Craig Ulmer

Examining Bad Flight Data from the Logger

2015-03-08 Sun tracks gis code planes

One of the problems of capturing your own ADS-B airplane data is that there are always bad values mixed in with the good. As I started looking through my data, I realized that every so often there'd be lon/lats that were nowhere near my location. Initially I thought I might be getting lucky with ionospheric reflections. However, a closer look into where these points are shows something else is probably going on here.

I wrote some Pylab to take all of my points and plot them on a world map (well, without the actual map). I marked of a rectangular to bound where Livermore is and then plotted each point I received. The points were colored blue if they were within the Livermore area and red if they were outside of it. I then extended the lon/lat boundaries for the Livermore area to help see where the far away points were relative to Livermore's lon/lat box.


The first thing I noticed was that there were a whole slew of dots horizontally that fall within Livermore's lat range. It's possible that this could be due to bit errors in the lon field. The next thing I noticed was that there were two columns of bad points, one at about -160 degrees, the other around 35. Since both of these columns had data spread across all lats, I realized it probably wasn't from an ionospheric reflection. The right column happens to be at about the same value as you'd get if lon and lat were swapped (drawn as a pink bar). However, I don't think that's what happened, as the dots are distributed all the way across the vertical column.

Individual Offenders

Since I didn't have a good explanation for the bad values, I did some more work on the dataset to pull out the individual offenders. Of the 2092 unique planes, only 16 were giving me problems. I plotted each plane's points individually below using the same plotter as before.


To me, these breakdowns indicate that the problem planes exhibit a few types of bad data. Three of them have purely horizontal data, while about 12 of the rest have some vertical problem. The 8678C0 case doesn't have enough points to tell what it's thinking. Interestingly, the vertical cases all seem to have at least a few points near Livermore. This makes me wonder if their GPS lost sync at some point in the flight and started reporting partially incorrect data. In any case there seem to be some common failure patterns.

Plane Info

Out of curiosity I went and looked up all 16 of these flights by hand to see what they were. It's interesting that all three of the planes with horizontal errors were't small planes (one old 747 and two new 777's). All the vertical errors seem to be from smaller planes (though one was a US Airways express). Here's the run down, including the number of days that I saw a particular plane in February:

#ID    Days Flight  Info
4248D9 1    VQ-BMS  Private 747 (1979) Las Vegas Sands Corp
8678C0 1    JA715A  Nippon Airways 777 
868EC0 1    JA779A  Nippon Airways 777 
A23C2E 4    N243LR  USAirways Express 
A3286B 1    N302TB  Private Beechcraft 400xp 
A346D0 1    N310    Private Gulfstream 
A40C4B 2    N360    Private (San Francisco) 
A5E921 7    N480FL  Private Beechcraft 
A7D68B 1    N604EM  Private Bombadeer
A7E28D 1    N607PH  Private Post Foods Inc
A8053D 1    N616CC  Private Gulfstream
A8DAB9 1    N67PW   Private Falcon50
AA7238 7    N772UA  United Airlines 777
AC6316 1    N898AK  Private Red Line Air
AC70DC 2    N900TG  Private (Foster City) 
AD853E 2    N970SJ  Private Gulfstream 

Code and Data

I've put my data and code up on GitHub for anyone that wants to look at it.

github:livermore-arplane-tracks


Flight Data From the Data Logger

2015-03-02 Mon tracks gis

Now that I've been running the Edison airplane data logger for more than a month, it's time to start looking at the data it's been capturing. I pulled the logs off the sdcard, reorganized them into tracks, and then generated daily plots using Mapnik. The below image shows all of the flights the logger captured for each day in February.


The first thing to notice is that the SDR has a pretty good range, even with the stock antenna. I live just south east of the dot for Livermore and was only expecting to see planes near town. Instead I'm seeing traffic all over the Tri-Valley and some a little bit beyond. I was initially surprised to see anything in either the Bay area or the central valley because of the Pleasanton ridge and the Altamont hills. However, I realized it makes sense though- planes fly much higher than the hills, except when they're landing.

Logger Statistics

I wanted to know more about the data I was getting so I wrote a few scripts to extract some statistics. The first thing I wanted to know was what percentage of the time the logger was running each day. I made a decision not to run it all day when I started because there just aren't that many flights at night. In order to help me remember to start and stop the logger each day, I plugged the Edison into the same power strip my home router uses, which I usually turn on when I get up (7am) and turn off when I go to bed (11:30pm). I wrote a perl script to look through each day's log and find the largest gap of time where there was no data. Since the logger uses UTC, my nightly shutdowns usually appear as a 7 hour gap starting around 7am UTC. The top plot below shows what percentage of the day the logger was up and running. It looks like I was only late turning it on a few times in February.


The next thing I wanted to know was how many flights I was seeing a day. The raw numbers are in green above, but I've also scaled them up using the top chart's data to help normalize it (no, not a fair comparison, as night flights are fewer). The red lines on the plots indicate where Sundays began on these plots. It looks like there's definitely lighter activity on Sundays. Things are a little skewed though, since everything is in UTC instead of Pacific (I was lazy and didn't bother to redistribute the days).

Missing IDs

The logger looks for two types of ADS-B messages from dump1090. The first is an occasional ID message that associates the hex ID for a plane with its call sign (often a tail fin). The second is the current location for a particular plane (which only contains the hex ID). Grepping through the data, I see 2195 unique hex IDs for the position messages, but only 2092 unique hex IDs for the ID messages. I checked and both message streams have some unique values that do not appear in the other message stream.

What Airlines am I Seeing?

Another stat I was interested in is what airlines show up the most in my data. It isn't too hard to get a crude estimate of the breakdown because (most?) commercial airlines embed their ICAO code in their flight number. Through the power of awk, grep, sed, and uniq, I was able to pull out the number of different flights each provider had over my area (this is unique flight numbers, not total flights). Here are the top 20:

404 UAL  United Airlines
114 VRD  Virgin America
 84 FDX  Federal Express
 72 AAL  American Airlines
 51 DAL  Delta Airlines
 46 JBU  Jet Blue
 45 SKW  Sky West
 38 AWE  US Airways
 29 EJA  Airborne Netjets Aviation ExecJet
 26 UPS  United Parcel Service
 22 RCH  Airborne Air Mobility Command "Reach"
 18 CPA  Cathway Pacific Aircraft
 17 OPT  Options
 16 EJM  Executive Jet Management "Jet Speed"
 11 TWY  Sunset Aviation, Twilight
 11 HAL  Hawaiian Airlines
 11 CSN  China Southern Airlines
 10 KAL  Korean Air
 10 EVA  EVA (Chinese)
  7 AAR  Asiana Airlines

There are a few things of interest in that breakdown. First, freight airlines like FedEx and UPS show up pretty high in the list. I think people often overlook them, but they occupy a sizable chunk of what's in the air. Second, I didn't see anything from Southwest in the data. They definitely fly over us, so I was surprised that I didn't see any SW or WN fins. Finally, there were a ton of planes that didn't have any info associated with them that would help me ID the owner (e.g., there were 456 N fins). There are websites you can go to to look them up (most of the time it just gives a private owner), but it's something that sinks a lot of time. Maybe later I'll revisit and write something to automate the retrieval.


Building an Airplane Data Logger on an Intel Edison

2015-02-08 Sun gis edison code

Lately I've been spending a good bit of my free time at home doing ad hoc analysis of airline data that I've scraped off of public websites. It's been an interesting hobby that's taught me a lot about geospatial data, the airline world, and how people go about analyzing tracks. However, working with scraped data can be frustrating: it's a pain to keep a scraper going and quite often the data you get doesn't contain everything you want. I've been thinking it'd be useful if I could find another source of airline data.


I started looking into how the flight tracking websites obtain their data and was surprised to learn that a lot of it comes from volunteers. These volunteers hook up a software-defined radio (SDR) to a computer, listen for airline position information broadcast over ADS-B, and then upload the data to aggregators like flightradar24. I've been looking for an excuse to tinker with SDR, so I went about setting up a low-cost data logger of my own that could grab airline location information with an SDR receiver and then store the observations for later analysis. Since I want to run the logger for long periods of time, I decided it'd be useful to setup a small, embedded board of some kind to do the work continuously in a low-power manner. This post summarizes how I went about making the data logger out of an Intel Edison board and an RTL-SDR dongle, using existing open-source software.

RTL-SDR and ADS-B

The first thing I needed to do for my data logger was find a cheap SDR that could plug into USB and work with Linux. Based on many people's recommendations, I bought an RTL-SDR USB dongle from Amazon that only cost $25 and came with a small antenna. The RTL-SDR dongle was originally built to decode European digital TV, but some clever developers realized that it could be adapted to serve as a flexible tuner for GNU Radio. If you look around on YouTube, you'll find plenty of how-to videos that explain how you can use an RTL-SDR and GNU Radio to decode a lot of different signals, including pager transmissions, weather satellite imagery, smart meter chirps, and even some parts of GSM. Of particular interest to me though was that others have already written high-quality ADS-B decoder programs that can be used to track airplanes.


ADS-B (Automatic Dependent Surveillance Broadcast) is a relatively new standard that the airline industry is beginning to use to help prevent collisions. Airplanes with ADS-B transmitters periodically broadcast a plane's vital information in a digital form. This information varies depending on the transmitter. On a commercial flight you often get the flight number, the tail fin, longitude, latitude, altitude, current speed, and direction. On private flights you often only see the tail fin. The standard isn't mandatory until 2020, but most commercial airlines seem to be using it.

If you have an RTL-SDR dongle, it's easy to get started with ADS-B. Salvatore Sanfilippo (of Redis fame) has an open source program called dump1090 that is written in C and does all of the decode work for you. The easiest way to get started is to run in interactive mode, which dumps a nice, running text display of all the different planes the program has seen within a certain time range. The program also has a handy network option that lets you query the state of the application through a socket. This option makes it easy to interface other programs to the decoder without having to link in any other libraries.

Intel Edison

The other piece of hardware I bought for this project was an Intel Edison, which is Intel's answer to the Raspberry Pi. Intel packaged a 32b Atom CPU, WiFi, flash, and memory into a board that's about the size of two quarters. While Edison is not as popular as the Pi, it does run 32b x86 code. As a lazy person, x86 compatibility is appealing because it means that I can test things on my desktop and then just move the executables/libraries over to the Edison without having to cross compile anything.

The small size of the Edison boards can make them difficult to interface with, so Intel offers a few carrier dev kits that break out the signals on the board to more practical forms. I bought the Edison Arduino board ($90 with the Edison module at Fry's), which provides two USB ports (one of which can be either a micro connector or the old clunky connector), a microSD slot for removable storage, a pad for Arduino shields, and a DC input/voltage regulator for a DC plug. It seems like the perfect board for doing low-power data collection.

Running dump1090 on the Edison

The first step in getting dump1090 to work on the Edison was compiling it as a 32b application on my desktop. This task took more effort than just adding the -m32 flag to the command line, as my Fedora 21 desktop was missing 32b libraries. I found I had to install the 32b versions of libusbx, libusbx-devel, rtl-sdr, and rtl-sdr-devel. Even after doing that, pkgconfig didn't seem to like things. I eventually wound up hardwiring all the lib directory paths in the Makefile.

The next step was transferring the executable and missing libraries to the Edison board. After some trial and error I found the only things I needed were dump1090 and the librtlsdr.so.0 shared library. I transferred these over with scp. I had to point LD_LIBRARY_PATH to pick up on the shared library, but otherwise dump1090 seemed to work pretty well.

Simple Logging with Perl

The next thing to do was write a simple Perl script that issued a request over a socket to the dump1090 program and then wrote the information to a text file stored on the sdcard. You could probably do this in a bash script with nc, but Perl seemed a little cleaner. The one obstacle I had to overcome was installing the Perl socket package on the Edison. Fortunately, I was able to find the package in an unofficial repo that other Edison developers are using.

#!/usr/bin/perl

use IO::Socket;
my $dir = "/media/sdcard";
my $sock;
do {
  sleep 1;
  $sock = new IO::Socket::INET( PeerAddr => 'localhost',
  	                             PeerPort => '30003',
                                Proto    => 'tcp');
} while(!$sock);

$prvdate ="";
while(<$sock>){
  chomp;
  next if (!/^MSG,[13]/);
  @x = split /,/;
  ($id, $d1, $t1, $d2, $t2) = ($x[4], $x[6], $x[7], $x[8], $x[9]);
  ($flt, $alt, $lat, $lon) = ($x[10], $x[11], $x[14], $x[15]);

  my ($day,$month,$year,$min) = (localtime)[3,4,5,1];
  my $date = sprintf '%02d%02d%02d', $year-100,$month,$day;
  
  if($x[1]==1){
    $line = "1\t$id\t$flt\t$d1\t$t1";
  } else {
    $line = "3\t$id\t$lat\t$lon\t$alt\t$d1\t$t1\t$d2\t$t2";
  }
  if($date ne $prv_date){
    close $fh if($prv_date!="");
    open($fh, '>>',"$dir/$date.txt") or die "Bad file io";
    $fh->autoflush;
  }
  print $fh "$line\n";
  $prv_date=$date;
}

Starting as a Service

The last step of the project was writing some init scripts so that the dump1090 program and the data capture script would run automatically when the board is turned on. Whether you like it or not, the default OS on the Edison uses systemd to control how services are launched on the board. I wound up using someone else's script as a template for my services. The first thing to do was to create the following cdu_dump1090.service script in /lib/systemd/system to get the system to start the dump1090 app. Note that systemd wants to be the one that sets your environment vars.

[Unit]
Description=dump1090 rtl service
After=network.target

[Service]
Environment=LD_LIBRARY_PATH=/home/root/rtl
ExecStart=/home/root/rtl/dump1090 --net --quiet
Environment=NODE_ENV=production 

[Install]
WantedBy=multi-user.target

Next, I used the following cdu_store1090.service script to launch my Perl script. Even with a slight delay at start I was finding that the logger was sometimes starting up before the socket was ready and erroring out. Rather than mess with timing, I added the sleep/retry loop to the perl code.

[Unit]
Description=Store 1090 data to sdcard
After=cdu_dump1090.service
After=media-sdcard.mount

[Service]
ExecStartPre=sleep 2
ExecStart=/home/root/rtl/storeit.pl
Environment=NODE_ENV=production

[Install]
WantedBy=multi-user.target

In order to get systemd to use the changes at boot, I had to do the following:

systemctl daemon-reload
systemctl enable cdu_dump1090.service
systemctl enable cdu_store1090.service

It Works!

The end result of the project is that the system works- the Edison now boots up and automatically starts logging flight info to its microSD card. Having an embedded solution is handy for me, because I can plug it into an outlet somewhere and have it run automatically without worrying about how much power its using.


Run Faker

2014-11-14 Fri tracks gis code

TL;DR: It isn't that hard to use third party sites like RunKeeper to load large amounts of fake health data into Virgin Health Miles. However, it's not worth doing so, because VHM only gives you minimal credit for each track (which is the same problem most legitimate exercise activities have in VHM). I've written a python program that helps you convert KML files into fake tracks that you can upload, if you're passive aggressive and don't like the idea of some corporation tracking where you go running.


Well-Being Incentive Program

A few years back my employer scaled back our benefits, which resulted in higher health care fees and worse coverage for all employees. In consolation, they started a well being incentive program through Virgin Health Miles (VHM). It's an interesting idea because they encourage healthier behaviors by treating it as a game: an employee receives a pedometer and then earns points for walking 7k, 12k, or 20k steps in a day. The VHM website provides other ways to earn points and includes ways to compete head to head with your friends. At the end of the year, my employer looks at your total points and gives you a small amount of money in your health care spending account for each level you've completed. In theory this is a win-win for everyone. The company gets healthier employees that don't die in the office as much. Employees exercise more and get a little money for health care. VHM gets to sell their service to my employer and then gets tons of personal health information about my fellow employees (wait, what?).

As you'd expect, there are mixed feelings about the program. Many people participate, as it isn't hard to keep a pedometer in your pocket and it really does encourage you to do a little bit more. Others strongly resent the program, as it is invasive and has great potential for abuse (could my employer look at this info and use it against me, like in Gatica?). Others have called out the privacy issue.

Given the number of engineers at my work, a lot of people looked into finding ways to thwart the hardware. People poked at the usb port and monitored the data transfers upwards, but afaik, nobody got a successful upload of fake data. The most common hack people did was to just put their pedometer on something that jiggled a lot (like your kids). This was such a threat, that pedometer fraud made it into a story in the Lockeed Martin Integrity Minute (worth watching all three episodes, btw).

My Sob Story

This year I actively started doing VHM, using both a pedometer to log my steps and RunKeeper to log my bike rides. When my first few bike rides netted only 10 points, I discovered that a track of GPS coordinates did not constitute sufficient proof that I had actually exercised (!). In order to "make it real" I had to buy a Polar Heart Rate Monitor, which cost as much as the first two rewards my employer would give me in the health incentive program. I bought it, because it sounded like it would help me get points for other kinds of exercise, like my elliptical machine.

Unfortunately, RunKeeper estimates how much exercise you've done by using your GPS track to calculate your distance. Since my elliptical machine doesn't move, RunKeeper logs my heart rate data, but then reports zero calories, since I went nowhere. When I asked VHM if there was a way to count my elliptical time, they said all I could do was get 10 points for an untrusted log entry, or count the steps with my pedometer (potentially getting 60-100 points).

The pedometer was fine, but then midway through the year it died (not from going through the washing machine either). I called up VHM and they wanted $17 for a new one. I wrote my employer and asked for a spare and was told tough luck. $17 isn't much, but I'd already blown $75 on the HRM, and it all started feeling like a scam. Anything where your have to pay money to earn the right to make money just doesn't sound right, especially if its through your workplace.

Equivalent Stats

Outside of VHM, I've been keeping a log of all the different days I've exercised this year. I had the thought, wouldn't it be nice if I could upload that information in a way that would give me the points VHM would have awarded me if their interfaces weren't so terrible? What if I wrote something that could create data that looked like a run, but was actually just a proxy for my elliptical work?

I discovered that RunKeeper has a good interface for uploading data (since they just want you to exercise, after all), and that they accepted GPX and TCX formatted data files. I wrote a python script to generate a GPX track file that ran around a circle at a rate and duration that matched my elliptical runs. Since I knew VHM needed HR data, I then generated heart rate data for each point. Circles are kind of boring, so the next thing I did was add the ability to import a kml file of points and then turn it into a run. Thus, you can go to one of the many track generator map sites, drop a bunch of points, and create the route you want to run. My program uses the route as a template to make a running loop, and jiggles the data so it isn't exactly the same every time through. Fun.

A Few Sample Runs

For fun, I made a few kml template files for runs that would be difficult for people to actually do. The first one at the top of this post was a figure eight around the cooling towers at three mile island. Next, since the Giants were in the world series, I decided it would be fitting to do some laps around AT&T park.


With all the stories in the news, I thought it would be fitting to squeeze in a few laps around the White House.


And last, I (in theory) made a break for it at Kirtland AFB and went out to see the old Atlas-I Trestle (a giant wooden platform they used to toll B52s out on and EMP dose them).


Mission Aborted

I decided to abort uploading all of my proxy data for two reasons. First, even with gps and heartbeat values, VHM still only assigns 10 points for each run. I was hoping to get "activity minute" credits, which would be on the order of 60 points per run, but alas, they must have some additional check to see whether the data came from your phone or another source. This problem really emphasizes why I dislike VHM: they only acknowledge data from a small number of devices that are meant to log certain types of exercises. If you want to do something else, you're out of luck. Second, VHM's user agreement says something about not submitting fraudulent data. While I wouldn't consider uploading proxy data to be any less ethical than what people with pedometers do at the end of the data to get to the next level, I don't want them coming after me because I was trying to compensate for their poorly-built, privacy-invading system.

If someone wanted to pick this up, I'd recommend looking at the data files RunKeeper generates in its apps. I've compared "valid" routes to my "invalid" ones, and I don't see any red flags why the invalid ones would be rejected upstream. I suspect RunKeeper passes on some info about where the data originated, which VHM uses to reject tracks that didn't come from the app.

Code

I've put my code up on GitHub for anyone that wants to generate their own fake data. It isn't too fancy, it just parses the kml file to get a list of coordinates, and then generates a track with lon/lat/time values to simulate a run based on the template. It uses Haversine to compute the distance between points so it can figure out how much time it takes to go between them at a certain speed.

github:rungen


International Airports

2014-10-14 Tue tracks gis code planes

The other day while riding the rental car shuttle to the Albuquerque International Sunport (ie, the ABQ airport), I started thinking about how some airports are labeled as being international while others are not. My first thought was that it didn't take much for an airport to become international- all you'd need is a few flights to Mexico. However, Wikipedia points out that international airports have to have customs and immigration, and generally have longer runways to support larger planes. In any case, it still seems like the international label is on a lot more airports than usual. This got me wondering, how different are the workloads of different airports anyways?

Since I have a big pile of airline data, I decided to see if I could better characterize airports by analyzing where the outgoing flights were heading. I wrote a Python script that takes a day's worth of flight data, extracts all direct flights that left a particular airport, and plots their altitude/distance traveled on the same graph. The X axis here is the cumulative distance the plane flew, calculated by summing up the distance between the coordinates in its track. The Y axis is altitude in miles. I picked several flights and verified that my distances roughly match the expected optimal path between cities (see AirMilesCalculator.com).

Below are some plots for SFO, ATL, ABQ, and CHS, all of which are international airports. A few interesting things pop out looking at the charts. First, SFO has a broad mix of travel, including local (LAX is about 340 miles), domestic (many stops between here an the east coast), and international (the large gaps in distances are for oceans). ATL is similar, but they have a lot more variety in the under 1,000 miles range (due to the number of airports on the east coast). ATL also has plenty of international flights, but they're shorter since Atlanta is closer to Europe. Interestingly, the longest flights for both SFO and ATL in this sample were both to Dubai. In contrast, the international Sunport (ABQ) and Charleston (CHS) didn't seem to have much range. In ABQ's case, this can be partially be attributed to the fact that it's towards the middle of the country (and close to Mexico).


This started out as a fun project that I quickly hacked together to get a first order result. I had the data, so all it took was building something to calculate the distances and plot them. The first plots I did showed a lot of promise, but I also noticed a lot of problems that need fixing. These fixes ate up a good bit of my time.

Distance Problems

The first problem I had was that my distances were completely wrong. I had several instances of flights that were going 30,000 miles, which is bigger than the circumference of the planet. My initial thought was that I wasn't properly handling flights that crossed the international dateline. I went through several variations of the code that looked for dateline crossings (ie, lon goes from near +/-180 to near -/+180) and fixed their distances. This helped but the long flights were still 2x longer than they should have been.

I realized later the problem was my distance calculator. I'd hacked something together that just found the Euclidean distance in degrees and then converted degrees to miles. That would be ok if the world was flat and lon/lat were for a rectilinear grid, but the world is round and lon/lat grid cells become smaller as you near the poles. I felt pretty stupid when I realized my mistake. A quick look on stack/overflow pointed me to the Haversine formula and gave me code I could plug in. The numbers started working out after that.

Multihop Problems

Another problem I hit was that my original data was for what a plane did during the day as opposed to the actual flights it took. At first, I tried to just compensate in my plotter by making an inline splicer that used altitudes to chop the plane's data into multiple hops. This partially worked, but unfortunately the data wasn't always sampled at the right time, so some flights don't always go to zero. I raised the cutoffs, but then had to add some logic that figured out whether a plane was landing or taking off. A lot of the times this worked, but there were still a bunch of individual cases that slipped by and made the plots noisy.

The solution was to go back to the original data and just use the source field that sits in the data. I reworked the data and generated all single hops based on the field. A nice benefit of this approach was that it made it easier to search for the airport of origin. Previously I'd been using the lon/lat boundaries for different airports to detect when a plane was leaving a certain airport. Now I can just look for the tag. This solution does have some problems, as there are many flights that don't label their src/dst. I've also found a few where the pilot must have forgotten to change the src/dst for the flight (ie, one sfo flight to chicago went on to Europe).

Velocity Problems

The last problem (and one I don't have a good fix for) is velocity. Looking at the individual distances and times for track segments, I sometimes see situations where the calculated velocity is faster than the speed of sound (I'm assuming Concordes aren't flying). Often this seems to happen out in the middle of the ocean, when the data becomes more sparse. It could be bad values in the data, or maybe two observers reporting the same thing differently. I'll need to take a closer look at it later. For now I just put in some kludge filters that throw out flights that are out of range.

Code

I decided to go ahead and make some of the scripts I've been writing available on github. This one is called Canonball Plotter, which can be found in: github:airline-plotters