Craig Ulmer

Finding Racism in Ebooks

2014-09-08 text code

he downside of reading expired-copyright ebooks is that many of them use offensive, racist terms. I first noticed this a few weeks ago when I downloaded Tom Swift and His Motor Cycle to read to my son. I'd heard of the Swift books and the premise sounded fun (young inventor has adventures and fixes anything with mechanical problems). Unfortunately, a short way into it Tom meets a bumbling African American named Eradicate, who equates technology with magic and speaks in broken English. It's pretty awful stuff and has a lot of racist terms in it (I wound up doing a lot of filtering and correction while reading it to my son). In the edit logs of Wikipedia's Tom Swift page, I was also surprised to see that a lot of the "this is offensive" comments were struck down by a moderator's "show me proof" comments.

Anyways, I was wanted to see how other books in Project Gutenberg have racist terms. I downloaded the 2010 iso with 29k books and then went about writing a go program to count offensive terms and score each book (1 point for each "could be" term, 100 points for each "definitely" term). The code flagged some 5900 books, though many of those innocently use terms that are commandeered by racists. Below is a plot, where each square is a book in the collection (grey is zero, dark is low score, light is high score).


While there are plenty of books that use the terms for a good purpose (eg, Huck Finn), I suspect many of them just use the terms because that's just how society was a hundred years ago, unfortunately. I think I'll be doing some more greps before I pick another bed time story.

Code

People asked me for the code for this, so I posted it on Github. You can find it (and documentation) in my gut-buster project: github:gut-buster