he downside of reading expired-copyright ebooks is that many of them use offensive, racist terms. I first noticed this a few weeks ago when I downloaded Tom Swift and His Motor Cycle to read to my son. I'd heard of the Swift books and the premise sounded fun (young inventor has adventures and fixes anything with mechanical problems). Unfortunately, a short way into it Tom meets a bumbling African American named Eradicate, who equates technology with magic and speaks in broken English. It's pretty awful stuff and has a lot of racist terms in it (I wound up doing a lot of filtering and correction while reading it to my son). In the edit logs of Wikipedia's Tom Swift page, I was also surprised to see that a lot of the "this is offensive" comments were struck down by a moderator's "show me proof" comments.
Anyways, I was wanted to see how other books in Project Gutenberg have racist terms. I downloaded the 2010 iso with 29k books and then went about writing a go program to count offensive terms and score each book (1 point for each "could be" term, 100 points for each "definitely" term). The code flagged some 5900 books, though many of those innocently use terms that are commandeered by racists. Below is a plot, where each square is a book in the collection (grey is zero, dark is low score, light is high score).
While there are plenty of books that use the terms for a good purpose (eg, Huck Finn), I suspect many of them just use the terms because that's just how society was a hundred years ago, unfortunately. I think I'll be doing some more greps before I pick another bed time story.
Code
People asked me for the code for this, so I posted it on Github. You can find it (and documentation) in my gut-buster project: github:gut-buster
2014-01-17 Fri
data medical viz
After goofing off with Slicer and a CT scan I had last year, I've come to the conclusion that my head would make an excellent bread bowl.
2013-11-03 Sun
data passwords
One can spend an endless amount of time slicing the adobe data up. Here's a list of the most common gov addresses in it. Team NASA for the win. It'd be interesting to see how corporate policies affect how often employees wind up registering with a site like adobe.
Does alphabetical order push some cultures to the back of the bus? This week Livermore had its annual summer reading program award ceremony, where every reader got to shake hands with the city council. Just like last year, they lined up everyone in alphabetical order and jumbled all the grades together. They made us all show up at the same time and after registration, Benjamin's 'U' last name got him spot 594 in the line. Similar to last year, it was frustrating to watch a lot of our friends with earlier names come out the back door, tell us the process was really efficient!, and go home to have their dinner while we continued to wait. Several of the families in the back didn't know they were in for an hour-long process and hadn't fed their kids. When we pointed out how it sucked to always be last in line, the program people said it was too hard to do the line up anything but alphabetical, starting with A.
The long amount of time we had in line got me wondering if ordering people by their last name is culturally insensitive, as I suspected that each country of origin favored different sounds and letters. Later that night I went home and pulled some simple data off a "top 10 surnames by country" kind of site, computed the average starting letter for the 10 names, and plotted it. While 10 names/country isn't enough of a sample to say much of anything, the first-order numbers showed what I suspected: it sucks to use alphabetical order when you're of German descent. Poland can never catch a break either.
I'd say a hash is in order.
2013-09-17 Tue
tracks gis planes
Someone asked me about the flight over the north pole, so I went back and took a closer look at the data. While I didn't have any exact north pole flights, I did find one in the dataset that was close: UAE230 going from Seattle to Dubai (see UAE230 13 hours, ouch!). Below are some histograms for lat/lon. I highlighted USA in Pac Man yellow to keep up with my pac man plotting pledge.