Craigslist is an interesting source of text data. In addition to providing a continuous stream of user postings from around the country in organized categories, the website stubbornly favors a plain-old-web format that's easy to retrieve and parse. I believe craigslist gets a lot of traffic from different kinds of scrapers. In addition to all the search engine crawlers, you hear stories about how individuals run scripts to continuously watch their local boards so they can be the first to snatch up free items. Craigslist blocks people that aggressively crawl the site, but otherwise let you wander around if you put in some rational delays.
Back in September I wrote some utilities to go off and scrape job postings from craigslist, because I thought it would be interesting to see what kind of people Bay Area companies wanted. After working out how to grab the data in an unobtrusive way, I updated my script to grab tech job postings from different cities around the country. I run the script about once a week, which over the last 9 months has given me about 32k postings, totaling 470MB in text data. This post just focuses on the scraping. I'll get to the analysis later.
Craigslist puts each post as a separate web page, and uses a city/topic/post directory structure to keep things organized. While the post part of the url is unique and non-sequential, they provide an easy-to-parse index page for each topic that will give you all the urls for the posts in reverse chronological order. All one has to do is pick a city and a topic, walk through the index, and retrieve the individual posts. I put some delay in after every page I fetched to be polite. I also randomized the city list on each run to even out the data if the grabs were taking too long and needed to be cut off (though always getting ATL would have been fine for me). To help with statistics, I had the script store basic information about runs in a local sqlite database. The database helps avoid downloading the same post twice, and gives me a place to store the dates of when I first and last saw a particular post.
Grabs Per Day
Below is a breakdown of how many posts I grabbed for each city when I ran the scraper. Since the script only grabs posts that it hasn't seen before, the per day grabs go up and down based on how frequently I ran the script (eg, when I missed a week or two, there was more data available to grab). For this time period, the cities seem to be fairly proportional. The big job cities seem to be San Francisco, Seattle, New York, and Boston (not unexpected). C'mon Atlanta. It's like you're not even trying.
Number of Active Days per Post
Another interesting statistic for me was how long job postings remain active on craigslist. I used the "first seen" and "last seen" dates stored in my meta data to estimate the amount of time I post stays alive. The numbers are off due to the initial posts I pulled (ie, I looked at the grab date instead of the post date) and the most recent posts (ie, which have not expired yet). As the below (logscale!) plot shows, most posts stick around for about a month. However, there are a few the last as long as 80 days.
It isn't much but I put the code for this on github:
In addition to being an interesting source of data for plane statistics, the FAA registration dataset also provides address information for each plane's owner. I was curious to see who owned airplanes in my town (not just the drones), so I wrote a simple script to extract addresses in my zipcode from the database and convert them to geospatial coordinates. Below is a plot of all the registered plane owners for Livermore. I've also outlined different neighborhoods in town and colored them by how expensive their houses are. Unsurprisingly, people that own planes tend to live in wealthier neighborhoods.
Livermore has a busy municipal airport on the north-west side of town, with an east-west landing strip. Planes typically approach the airport by flying west over the city, using the railroad and I580 as visual guides to locate the airport. People that live east of the airport often complain about the noise of descending planes, but the airport was there long before the houses (it was built in 1965). In general, Livermore house prices increase the farther south you get. The cheap houses (where I live, in the yellow) start at about $500k. Down in Ruby Hill they're all well over $1M.
For the above plot, I shaded different parts of town based on how expensive their houses are: the darker green the color, the more wealthy the neighborhood. The shading wasn't very scientific- I just boxed up regions by hand and then looked up what Zillow said houses were going for in the neighborhood. Sadly, I found that my yellow-ish neighborhood had zero plane owners, which was consistent with other poorer neighborhoods. I think it's interesting that most of the plane owners live south of the landing path. I'm not sure if that's because that's where the more expensive houses usually are, or if plane owners are smart enough to know not to live long the flight path.
East Bay Owners
In addition to Livermore, I pulled out data on the neighboring areas (basically all of Alameda and Contra Costa counties). Below is a snapshot of it, but you can explore the data yourself in pannable Google map of the data.
The script I used for extracting the data is extract_by_zipcode.py, which I've put in my flight classifier repo. GeoPy needs a newer version of Python than what my CentOS 6 desktop had, so I had to build/install that as well.
While looking planes up in the FAA dataset for the previous post, I noticed some planes had zero seats, weighed under 55 pounds, and were electric powered. Drones! (or more officially, sUAS - Small Unmanned Aircraft System). I knew that the FAA was making people register their drones, but I was surprised to see them showing up with other aircraft in the FAA database. After a little reading I learned that there are actually two ways to register: (1) online through a simple, instantaneous web page or (2) by mail using the traditional paper form process. While the by-mail approach takes a few weeks, your drone gets an N number and is plugged into the database. I wrote some python scripts to pull out electric plane registration info and plot it.
3,500 drone registrations is tiny compared to the web registration numbers (more than 300k in the first month). Still, it seems like a lot to me, given that I don't see an obvious reason to go through the by-mail process. In any case, I started filtering the data to see which organizations were registering. It wasn't that difficult, since the FAA database provides a registration type that identifies whether the owner is an individual, a corporation, or a government entity.
I first filtered on commercial entities, of which there were 940 different companies. Below is the complete list of companies with 10 or more drones. There are a few interesting stories here. First, Intel topped the charts with 111 drones. They seemed to all be the same ArsTec Hummingbird model, which (surprise) uses an Intel Atom Z530. BNSF Railway is using the drones to inspect rail lines (why not just strap a camera to a train?). Liberty Mutual says they're using them to assess insurance claims (eg natural disasters). San Diego Gas and Electric will do inspections of their service areas. Some companies do general "aerolytics", like this Talon Aerolytics video shows. Lockheed Martin manufactures their own drones. In addition to the electric models, their Missles and Fire Control group has a few drones under 55 pounds that use turbo-ject engines. There are also some mysteries in this list. Ashfloyd LLC has little outward info for a company with so many drones, causing some people to wonder who they are.
DRONES COMPANY ------ ------------------------------- 111 Intel Corp 93 Precisionhawk Usa Inc 43 Ashfloyd LLC 40 Aerovironment Inc 23 Rotor F/X LLC 22 Lockheed Martin Corp 18 San Diego Gas & Electric 17 Unmanned Innovation Inc Dba 16 Talon Aerolytics LLC 15 Wintec Arrowmaker 14 Flirtey Inc 13 Trimble Navigation Ltd 12 BNSF Railway UAS Program 12 Precision Hawk Usa Inc 12 Cape Productions Inc 12 Microsoft Corp 12 Aerodrome LLC 11 Hazon Solutions LLC 11 Liberty Mutual Insurance 11 Unconventional Concepts Inc 10 Aerocine Ventures Inc 10 Amazon Logistics Inc
I was a little surprised Amazon didn't have more given Amazon Prime Air. They currently have 10 drones with tail fins, and have registered four different models they've developed. They've been adding to their inventory since last year, and appear to have more in the works if you check with the FAA. Here are the counts for the different models:
Model Number Tailfins Currently Registered ------------------------------------------------ MK9A 0 MK021A 2 Starting March 2015 MK23A 1 December 2015 MK24 7 Starting April 2015
Next, I selected on Government users, which yielded 310 organizations. They're not as exciting as people would things though- they're mostly state schools, NASA, fire departments, and law enforcement. I moved National Labs into their own category to include more schools in this list.
DRONES ORGANIZATION ------ ------------------------------- 32 Kansas State University 22 Oregon State University 21 Nasa Langley Research Center 16 University Of Colorado 14 Nasa Ames Research Center 12 Virginia Polytechnic Institute & State University 12 Department Of Commerce 11 University Of Maryland Uas Test Site 11 Georgia Institute Of Technology 11 Cochise Community College 10 University Of Alaska Fairbanks 10 University Of Michigan 9 University Of North Dakota 8 Department Of Energy 8 Center For Disaster Risk Policy 7 Mississippi State University 7 Sinclair Community College 7 Auburn University 6 Ohio State University 6 Utah State University ... 4 Bureau Of Alcohol Tobacco Firearms & Explosives 3 Alameda County Sheriffs Office
I also pulled out national labs from the gov list. All of the drones I saw in this section were the same stuff consumers buy.
DRONES ORGANIZATION ------ ------------------------------- 12 Sandia National Laboratories 4 Battelle Pacific Northwest National Laboratory 2 Los Alamos National Laboratory 1 National Marine Mammal Laboratory 1 Brookhaven National Laboratory 1 Oak Ridge National Laboratory
MIT Lincoln Laboratory also popped up in the Aircraft Reference file (which defines airplane types), but does not show up as an owner of a registered plane in the master list. Searching for the drone's manufacturer model number in the master list turned up 9 hits, though all of them had their blank fields for the owner. There are many blanked owner fields in the dataset, so this may just be part of the registration process and not obfuscation.
The drone's name is Locust, which appears to be a micro-UAV developed by students in MIT's Beaverworks program, commissioned by LL and the USAF back in 2010. Some former students mentioned working on LOCUSTS/PERDIX in their LinkedIn pages, and that they'd designed micro-uavs that could be deployed at 30,000ft from a "cartridge mounted on a business jet". I don't know if it's releated or not, but the Office of Naval Research has a video of their LOCUST (low-cost uav swarm technology). Didn't these people watch Terminator?
The above plots and data were generated with plot_drones.py and tally_drones.py, which I've put in my flight-classifier repo.
Earlier this year the local Livermore paper had some articles about how air traffic at our municipal airport was going to shoot up during the Superbowl, because there weren't going to be enough places for private jets to land in the Bay Area. I didn't think much of it at the time, since the paper tends to have delusions about how rich people will fly to Livermore and spend time here. However, after the Superbowl, my wife noticed on social media that several of her local friends were chatting about how there were a lot of jets taking off from the airport that night. I fired up dump1090 and let it grab for an hour before bed. After a bit of post-processing work, here's a timeline for all the flights I saw:
Dump1090 is a great program- in addition to displaying where planes are in a webpage, it produces easy-to-parse dump files that contain a good bit of plane information. I captured two types of traces from dump1090 after the Superbowl: the detailed runtime output with all the message info and the distilled, csv-formatted data from the netcat interface. The grabs went on for about an hour and yielded 30MB and 13MB of data, respectively. Looking at the data, I saw basically what I expected: there were a large number of private planes, but none of them were reporting position information. It drives me crazy that they clearly have ADS-B equipment but don't transmit position. Current regulations don't require it though, so nearly all private planes turn it off to prevent you from tracking their exact locations.
The dataset did leave me with a big pile of timestamped ADS-B IDs though, so I started looking for ways I could convert the IDs to something more interesting. I found that the FAA provides an extremely useful database you can download that contains full registration information for all US planes. The database is a collection of easily-parsed CSV files, and contains each plane's ADS-B hex code, tailfin, plane type, and owner information. The master DB files are currently close to 200MB uncompressed, but when I extracted just the ADS-B id and owner columns, it was only about 8MB (small enough for a quick lookup table use).
I used the FAA info to find the owners of the planes in my dataset, and then did some simple text processing to assign a classification to each plane to group similar owners together. Since I only had 98 planes to look at, I mostly did the classifications by hand. 36 of the planes were easy to classify because they were owned by commercial, passenger airline companies like Delta. Another 16 planes were owned by banks (fun fact: banks own more planes than any other type of company). Through some Google searches, I identified four private passenger carriers (e.g., Xojet) that took care of 7 more planes. I found 2 more planes owned by oil companies (Eaton and San Joaquin Refining) and 1 emergency helicopter (California Shock Trauma). I also found a plane owned by a gun store and another by a trucking company. There were 21 other planes in the FAA dataset that didn't turn much up in Google searches, that I marked as unclassified. That left me with 15 planes that weren't in the FAA dataset.
The FAA dataset only has info on US planes, so I figured the missing planes must all be foreign owned. I did some reading and learned that the hex IDs reported in ADS-B are from the International Civil Aviation Organization (ICAO), and that each country is assigned its own block of values in the address space. For example the US fits in A00000 to AFFFFF (which explains why I always see A's in my data), while Portugal is in 490000 to 497FFF. Annoyingly, I couldn't find an official table with all the country codes in it anywhere. I did find a website that had deduced the info and put it into a table. I grabbed it and did a lot of awking to put it into a lookup table my scripts could use. Here's where the 15 remaining planes were from, sorted by country:
C00738 22:29:12.396 22:50:48.658 Canada C00964 22:30:02.529 22:40:50.856 Canada C04852 22:57:57.379 23:16:09.175 Canada C06E87 22:29:06.825 22:38:57.286 Canada C08048 22:31:58.590 22:43:20.798 Canada 780A5B 23:16:51.247 23:21:39.138 China 780A70 22:29:07.154 22:29:44.770 China 780DA9 22:44:36.948 22:52:14.966 China 0D049E 23:22:46.770 23:25:08.781 Mexico 0D083B 22:55:46.443 23:02:04.836 Mexico 0C206B 22:44:08.180 22:50:31.290 Panama 52027A 23:10:56.513 23:19:52.647 (reserved, EUR/NAT) 899103 22:34:37.444 22:42:35.777 Taiwan 072233 22:29:13.247 22:32:35.813 unknown A22E75 22:42:30.730 22:55:33.598 United States
The last plane there is a US plane, which should have been in the FAA database. FlightAware gave me the tail fin (N24JG), which the FAA told me had a December renewal rate. My guess is that the plane was just in-between renewals. In any case it was an interesting plane because it's owned by Jeff Gordon, Inc. Jeff Gordon is a race car driver, so I guess I did spot a celebrity. Neat.
The next unknown was 072233. I didn't find this registered anywhere, but Google searches turned the number up in lists where people monitor military plane activity. They reported this as 09-72233, which they say is a US Army UH-72A or EC45 helicopter (unarmed).
The final plane was 52027A, which caught my eye because it falls into a NATO band of the ICAO numbers (I believe). I looked it up in the raw dump1090 data and found that it also used the callsign IRONS12, which sounds like a tough-guy military callsign. I was hoping it might be the F15 that intercepted four planes during the superbowl (and escorted them to Livermore), but I think it's actually a surveillance plane. I found references to an IRONS12 callsign being used by an RC-26B with serial 920372 in the Bay Area the week before the superbowl (and leaving after). The RC-26B appears to be an Air National Guard plane with sensors for filming and tracking, and serves to "bridge the gap between Department of Defense and civil authorities". Now that I think about it, a surveillance plane is a lot more interesting than the F15s that the news covered.
The only other analysis I did on this data was look at how long planes were in the air (or otherwise chirping their ADS-B info). Given my antenna configuration that night, most planes were only visible for about 10-15 minutes. The emergency helicopter though operated for more than 20 minutes. I'd been hoping to see some private planes with long running times (a sign that they were sitting at the airport waiting for their owners to show up), but that didn't happen.
I've put the data and the scripts used to do these plots on github. The country code lookup table I made for this work is also in the repo.
For fun, I went through some of the images I captured from my webcam scraper project and converted them into timelapse videos. The videos are all pretty repetitive. However, the videos did help me spot some nice one-offs, such as the Vancouver lightning strike I reported on previously. Here are some Youtube clips of the more interesting ones.
The tides can be interested to watch in timelapses. Check out the rise and fall of boats in these webcams from Alaska and Hilton Head:
I spent a summer in Metz, France during college, so it was nice to see pictures of the city and Paris come up in the screensaver. One of the Metz webcams tracked an interesting building with curvy architecture being built. The Eiffel Tower cam is fairly constant, but if you stop it around Bastille day you can see some fireworks.
There were a few other places I grabbed from around Europe. The problem I had with getting data from there was that with the timezone differences, it was often night there when my desktop was running, resulting in night images. Here are timelapses from Warsaw, the Vatican, and Switzerland.
California also has some good cameras out in the mountains. Here are two from Mount Wilson (at the Mt. Wilson Observatory near L.A.) and Mount Shasta.
The Bay Area
The bay area has a few good cameras, besides the normal traffic cams. Here are cameras from downtown, Sausalito, and Berkeley.
I only found a few webcams for Atlanta, but the skyline camera always looked good to me. Georgia State used to have a really good, user-controlled camera with a strong zoom that let you look around the downtown streets and buildings. It was interesting to watch how other people controlled the camera. I often thought that if I watched it long enough I'd witness some downtown crime.
I was surprised to find that there are multiple webcams in Antarctica, and that they are well maintained. I don't seem to have it anymore, but on mother's day, someone left a sign in front of the webcam that said "I love you mom". The below timelapse is a little boring, but midway through it you see sea lions and penguins (I think) come up on shore.
Finally, here's my favorite camera, the one pointed at the Burrard Bridge in Vancouver. If you look around 34 seconds into it, you'll see the lightning strike I mentioned in the previous post.