Gaps in Airline Data

2015-02-14 Sat
gis planes

Mind the gaps. Whenever I use a public website that provides airline information, I'm impressed with how much they know about flights that are in progress all around the world. However, there have been a number of times when I've started drilling down into the data only to find that the samples I want aren't there. Given that (I believe) a lot of the data actually comes from volunteers that monitor their own local regions with something like dump1090, it should be expected that there are gaps in coverage. That got me thinking: could I look at a day's worth of airline track data and estimate where there isn't coverage?

Earlier this week (while flying to Albuquerque!) I wrote a script to search for gaps in a collection of airline tracks. All this script does is walk through a track and inspect the amount of time between two sample data points. If the time for the segment is greater than a certain threshold, I estimate that the plane wasn't in a place where anyone could hear it and then plot the segment in red. Click on the pictures to get a closer view.

Again, missing data segments are in red (segments with data are not plotted, since they overwhelm the plots). As expected, a lot of gaps appear over the oceans, where nobody is listening. As FlightRadar24 has pointed out, coverage in different countries depends on the country's ADS-B policies and ground stations.

USA

I was a little surprised to see that the US has some dead space in the middle and south east. That might not be surprising as the FAA doesn't make the data available for free (afaik), and there are large unpopulated areas in the country.

Europe

Europe seems to be well covered. I'm not sure if that's because people do a lot more tracking there, or if governments do the right thing and make the data available.

Western Pacific Rim

The Western Pacific Rim is interesting. Japan looks like it has excellent coverage. China must have coverage in the cities or on the coast, as there are a lot of flights with missing data on the interrior.

I'd been hoping there'd be something more interesting to look at around Korea. I've read that North Korea jams GPS from time to time and was hoping that I'd see a lot of gaps there. There weren't any stories of jamming for 2014 (afaik). Plus, I don't think there are many planes flying over NK to begin with.

Code

The gap_plotter.py plotter I threw together for this and a tiny sample dataset can be found here:

github:airline-plotters

This code just overlays line plots, so it's slow and breaks if you throw more than a day's worth of data at it. Some day I'd like to go back and build a propper analysis tool that grids and counts things.

Laser Sketcher

2014-12-12 Fri
edison

Tonight I used my Intel Edison to revive an old laser sketcher project I originally built for an "interfacing small computers" class nearly 20 years ago. This version is slow, crude, and low-res, but good enough to bring back some memories of previous work.

Interfacing Small Computers Class

Back in undergrad I took a class called "Interfacing Small Computers" that was all about connecting hardware to PCs of that era. It was a fun course that talked about a number of practical hardware/software design issues you had to deal with when developing cards that plugged into the PC's buses (AT, ISA, EISA, MCA,..). The labs were hard but fun- they used a special breakout board that routed all the signals from the ISA bus out to a large breadboard where you could connect in discrete logic chips. Students had to build circuits that decoded the bus's address signals, read/write the data bus, and do thing like trigger interrupts. On the host side we wrote dos drivers and simple applications in C to control the hardware. It was the most complicated breadboard work I've ever had to do (teaching me that designing a few 8-bit buses was a lot easier than wiring them up by hand).

In retrospect I was lucky to have taken the class at a time when you could still interface to a PC using a breadboard and TTL logic gates. Towards the end of the class we caught a glimpse of where I/O was heading- we started using Xilinx FPGAs to implement our bus handling logic. It wasn't long after this that PCI came out and ASICs/FPGAs became the only practical way to put your hardware on a bus.

Class Project

Towards the end of the quarter, my lab partner and I struggled to think of something we could build for the open-ended final lab project. In search of ideas, I talked to a friend of mine outside of school who had a knack for building interesting things. He said he'd just acquired an old, broken laser disc player that had some interesting parts I could have. In addition to having a bulky HeNe laser, the player had an X-Y mirror targeting system that was used to aim the laser at the right spot on the disc. The X-Y mirrors were a pretty clever design: all they did was place a mirror at the end of a slug in an inductor coil for each direction. The inductor moved the slug in or out of the coil depending on changes in current, so all you had to do was connect the coil to an analog output and change its voltage to position the mirror. The two mirror coils came in a single unit that were already oriented properly for X and Y reflections.

We went about building a simple board to feed the coils. The HeNe laser was too big (and had a dicey power supply), but luckily my friend also has a new, red laser diode I could use. We positioned it to hit the X-Y mirrors and fastened both to the board. Next, we used a pair of piggy-backed opamps to amplify the signals going to the inductors. Finally, we used a pair of 8-bit DACs to convert our digital data values to analog signals. We picked DACs that had built-in input registering. This meant that our ISA bus logic only had to generate signals to trigger the individual DACs to grab data off the ISA data bus when either the X or Y address appeared on the ISA address bus. Our design had to use a brand new Xilinx FPGA the lab had just received, so my parner ported the address decode logic to the FPGA.

I wrote some simple software for the host that continuously streamed coordinate data values to the mirror inductor coils (in retrospect, we should have buffered these in a buffer in the FPGA, but at the time, the FPGA was a big unknown). The mirror coilds probably weren't designed to run at high speeds, but we were able to stream data to them fast enough that we could render simple geometries like squares and circles. We got a lot of praise from others in the lab, as it was definitely a low-complexity, high-satisfaction project.

Reviving the Laser Sketcher with the Edison

After I finished the class, I didn't have a way to connect the sketcher to a computer because I didn't have an obvious way to stream data into the board. I looked at connecting it to a parallel port, but the whole thing got shelved because of time. After graduating, I thought about plugging the sketcher into an AVR embedded processor, but the AVRs didn't provide a way to change data vectors easily. The Intel Edison board lowered the effort bar so much that I didn't have an excuse to put this off anymore.

Connecting the laser sketcher up to the Edison was pretty easy. All I needed to do was find a way to supply analog X-Y signals to the amplifiers. The Arduino Edison board has a few pulse width modulation (PWM) pins that approximate an analog signal through pulse trains. The board had simple drivers for writing to the PWM generators, so all I had to do was create some data values and then stream them to the pins as fast as possible. The pulses make the output a little blocky, but are good enough for now. It'd probably be better to put a capacitor or opamp integrator w/ timed clearing in there to smooth the signal.

Run Faker

2014-11-14 Fri
tracks gis code

TL;DR: It isn't that hard to use third party sites like RunKeeper to load large amounts of fake health data into Virgin Health Miles. However, it's not worth doing so, because VHM only gives you minimal credit for each track (which is the same problem most legitimate exercise activities have in VHM). I've written a python program that helps you convert KML files into fake tracks that you can upload, if you're passive aggressive and don't like the idea of some corporation tracking where you go running.

A nice Jog on Three Mile Island

Well-Being Incentive Program

A few years back my employer scaled back our benefits, which resulted in higher health care fees and worse coverage for all employees. In consolation, they started a well being incentive program through Virgin Health Miles (VHM). It's an interesting idea because they encourage healthier behaviors by treating it as a game: an employee receives a pedometer and then earns points for walking 7k, 12k, or 20k steps in a day. The VHM website provides other ways to earn points and includes ways to compete head to head with your friends. At the end of the year, my employer looks at your total points and gives you a small amount of money in your health care spending account for each level you've completed. In theory this is a win-win for everyone. The company gets healthier employees that don't die in the office as much. Employees exercise more and get a little money for health care. VHM gets to sell their service to my employer and then gets tons of personal health information about my fellow employees (wait, what?).

As you'd expect, there are mixed feelings about the program. Many people participate, as it isn't hard to keep a pedometer in your pocket and it really does encourage you to do a little bit more. Others strongly resent the program, as it is invasive and has great potential for abuse (could my employer look at this info and use it against me, like in Gatica?). Others have called out the privacy issue.

Given the number of engineers at my work, a lot of people looked into finding ways to thwart the hardware. People poked at the usb port and monitored the data transfers upwards, but afaik, nobody got a successful upload of fake data. The most common hack people did was to just put their pedometer on something that jiggled a lot (like your kids). This was such a threat, that pedometer fraud made it into a story in the Lockeed Martin Integrity Minute (worth watching all three episodes, btw).

My Sob Story

This year I actively started doing VHM, using both a pedometer to log my steps and RunKeeper to log my bike rides. When my first few bike rides netted only 10 points, I discovered that a track of GPS coordinates did not constitute sufficient proof that I had actually exercised (!). In order to "make it real" I had to buy a Polar Heart Rate Monitor, which cost as much as the first two rewards my employer would give me in the health incentive program. I bought it, because it sounded like it would help me get points for other kinds of exercise, like my elliptical machine.

Unfortunately, RunKeeper estimates how much exercise you've done by using your GPS track to calculate your distance. Since my elliptical machine doesn't move, RunKeeper logs my heart rate data, but then reports zero calories, since I went nowhere. When I asked VHM if there was a way to count my elliptical time, they said all I could do was get 10 points for an untrusted log entry, or count the steps with my pedometer (potentially getting 60-100 points).

The pedometer was fine, but then midway through the year it died (not from going through the washing machine either). I called up VHM and they wanted $17 for a new one. I wrote my employer and asked for a spare and was told tough luck. $17 isn't much, but I'd already blown $75 on the HRM, and it all started feeling like a scam. Anything where your have to pay money to earn the right to make money just doesn't sound right, especially if its through your workplace.

Equivalent Stats

Outside of VHM, I've been keeping a log of all the different days I've exercised this year. I had the thought, wouldn't it be nice if I could upload that information in a way that would give me the points VHM would have awarded me if their interfaces weren't so terrible? What if I wrote something that could create data that looked like a run, but was actually just a proxy for my elliptical work?

I discovered that RunKeeper has a good interface for uploading data (since they just want you to exercise, after all), and that they accepted GPX and TCX formatted data files. I wrote a python script to generate a GPX track file that ran around a circle at a rate and duration that matched my elliptical runs. Since I knew VHM needed HR data, I then generated heart rate data for each point. Circles are kind of boring, so the next thing I did was add the ability to import a kml file of points and then turn it into a run. Thus, you can go to one of the many track generator map sites, drop a bunch of points, and create the route you want to run. My program uses the route as a template to make a running loop, and jiggles the data so it isn't exactly the same every time through. Fun.

A Few Sample Runs

For fun, I made a few kml template files for runs that would be difficult for people to actually do. The first one at the top of this post was a figure eight around the cooling towers at three mile island. Next, since the Giants were in the world series, I decided it would be fitting to do some laps around AT&T park.

With all the stories in the news, I thought it would be fitting to squeeze in a few laps around the White House.

And last, I (in theory) made a break for it at Kirtland AFB and went out to see the old Atlas-I Trestle (a giant wooden platform they used to toll B52s out on and EMP dose them).

Mission Aborted

I decided to abort uploading all of my proxy data for two reasons. First, even with gps and heartbeat values, VHM still only assigns 10 points for each run. I was hoping to get "activity minute" credits, which would be on the order of 60 points per run, but alas, they must have some additional check to see whether the data came from your phone or another source. This problem really emphasizes why I dislike VHM: they only acknowledge data from a small number of devices that are meant to log certain types of exercises. If you want to do something else, you're out of luck. Second, VHM's user agreement says something about not submitting fraudulent data. While I wouldn't consider uploading proxy data to be any less ethical than what people with pedometers do at the end of the data to get to the next level, I don't want them coming after me because I was trying to compensate for their poorly-built, privacy-invading system.

If someone wanted to pick this up, I'd recommend looking at the data files RunKeeper generates in its apps. I've compared "valid" routes to my "invalid" ones, and I don't see any red flags why the invalid ones would be rejected upstream. I suspect RunKeeper passes on some info about where the data originated, which VHM uses to reject tracks that didn't come from the app.

Code

I've put my code up on GitHub for anyone that wants to generate their own fake data. It isn't too fancy, it just parses the kml file to get a list of coordinates, and then generates a track with lon/lat/time values to simulate a run based on the template. It uses Haversine to compute the distance between points so it can figure out how much time it takes to go between them at a certain speed.

github:rungen

C++ Serialization Tests

2014-10-20 Mon
cpp

Data object serialization in C++ has been something that's been bugging me for a long time. I'm developing a distributed memory management library named Kelpie in C++ that sits on top of an RDMA/RPC communication library for HPC named Nessie that's in C. Nessie is a solid library and does a number of low-level things that make my life easier and my code portable to different HPC platforms. However, since it's written in C, there have been a number of times where I've had to do awkward things to interface my C++ code to it. Serialization is one major sore point with me. Nessie uses XDR for serialization, which doesn't work with C++ strings, and just feels cumbersome. I decided to write up a little benchmark to measure how well different, popular serialization libraries performed so I could convince the Nessie developers to support something else.

Serialization Libraries

Serialization is the process of converting a data structure you're using into a contiguous series of bites that can be transported over the network and revived into an object on the remote side without any additional information. This process often takes into account hardware architecture differences between the source and destination (byte endianness and word widths), as well as conversions between different programming languages. Some serialization libraries use a definition file to generate code for your application, while others provide hooks for you to easily define how objects get serialized in your code. I experimented with four main serialization libraries in this experiment.

There are other serialization libraries out there that I've experimented with but excluded from this test (thrift, avro, capn proto, flat buffers, etc.). Each library requires a good bit of tinkering to setup and incorporate into a test. I'd like to revisit them at some point for a bigger comparison.

Experiments

I'm only interested in seeing how well these libraries perform when using the types of data structures I frequently ship in my Kelpie library. My messages usually send a small number of (integer) default values followed by a series of id/string data entries. The test program generates all the application data at start time and then proceeds to serialize different amounts of input by changing how many id/string data entries are packed. Packing for me involves all steps in converting my application's data into a buffer that my RDMA library can transmit. In Boost, Serial, and PB, this RDMA buffer requirement involves copying the packed data to a memory location that is suitable for transmission (XDR allows me to write directly to the buffer).

Measurements

In separate tests I measure the encode/decode times, as well the size of the packed data that would be transmitted over the network. The packed message size shows that PB performs compression and achieves a noticeable improvement over the other libraries at all sizes. Boost carries the most overhead. Message size is important to me in Kelpie, as there are sizable communication penalties if you can't fit a message in a single MTU. Beyond fitting everything into an MTU, the differences between the libraries aren't that significant.

Encoding/Decoding speed however, is very important to me as every step in the communication path counts towards the latency of my operations. I was surprised and impressed by both XDR and PB: they both performed well and were significantly better that Boost and Serial at the left end of the plot where I care the most.

Decode times were another interesting story. XDR did very well for short messages and remained competitive with Cereal in larger messages. I was surprised PB did so poorly, even getting beat by Boost after 32 records were sent. Decode isn't as critical as encode (since the sender is free to do more work once the message leaves), but it does add to the communication cost.

Back to XDR?

I was pretty disappointed by these tests because I'd hoped that with all their shiny C++ features, newer libraries would outperform our aging XDR code. In some metrics that was true, but there was no universal winner in all situations: the air goes somewhere when you squeeze the balloon. It's important to point out here that these tests were designed to reflect my particular scenario, and are not meant to be an end-all test for comparing serialization libraries. Most people don't have my buffer challenges, and therefore would see different results. The best thing to do is code it up yourself and try it out with your own data.

International Airports

2014-10-14 Tue
tracks gis code planes

The other day while riding the rental car shuttle to the Albuquerque International Sunport (ie, the ABQ airport), I started thinking about how some airports are labeled as being international while others are not. My first thought was that it didn't take much for an airport to become international- all you'd need is a few flights to Mexico. However, Wikipedia points out that international airports have to have customs and immigration, and generally have longer runways to support larger planes. In any case, it still seems like the international label is on a lot more airports than usual. This got me wondering, how different are the workloads of different airports anyways?

Since I have a big pile of airline data, I decided to see if I could better characterize airports by analyzing where the outgoing flights were heading. I wrote a Python script that takes a day's worth of flight data, extracts all direct flights that left a particular airport, and plots their altitude/distance traveled on the same graph. The X axis here is the cumulative distance the plane flew, calculated by summing up the distance between the coordinates in its track. The Y axis is altitude in miles. I picked several flights and verified that my distances roughly match the expected optimal path between cities (see AirMilesCalculator.com).

Below are some plots for SFO, ATL, ABQ, and CHS, all of which are international airports. A few interesting things pop out looking at the charts. First, SFO has a broad mix of travel, including local (LAX is about 340 miles), domestic (many stops between here an the east coast), and international (the large gaps in distances are for oceans). ATL is similar, but they have a lot more variety in the under 1,000 miles range (due to the number of airports on the east coast). ATL also has plenty of international flights, but they're shorter since Atlanta is closer to Europe. Interestingly, the longest flights for both SFO and ATL in this sample were both to Dubai. In contrast, the international Sunport (ABQ) and Charleston (CHS) didn't seem to have much range. In ABQ's case, this can be partially be attributed to the fact that it's towards the middle of the country (and close to Mexico).

This started out as a fun project that I quickly hacked together to get a first order result. I had the data, so all it took was building something to calculate the distances and plot them. The first plots I did showed a lot of promise, but I also noticed a lot of problems that need fixing. These fixes ate up a good bit of my time.

Distance Problems

The first problem I had was that my distances were completely wrong. I had several instances of flights that were going 30,000 miles, which is bigger than the circumference of the planet. My initial thought was that I wasn't properly handling flights that crossed the international dateline. I went through several variations of the code that looked for dateline crossings (ie, lon goes from near +/-180 to near -/+180) and fixed their distances. This helped but the long flights were still 2x longer than they should have been.

I realized later the problem was my distance calculator. I'd hacked something together that just found the Euclidean distance in degrees and then converted degrees to miles. That would be ok if the world was flat and lon/lat were for a rectilinear grid, but the world is round and lon/lat grid cells become smaller as you near the poles. I felt pretty stupid when I realized my mistake. A quick look on stack/overflow pointed me to the Haversine formula and gave me code I could plug in. The numbers started working out after that.

Multihop Problems

Another problem I hit was that my original data was for what a plane did during the day as opposed to the actual flights it took. At first, I tried to just compensate in my plotter by making an inline splicer that used altitudes to chop the plane's data into multiple hops. This partially worked, but unfortunately the data wasn't always sampled at the right time, so some flights don't always go to zero. I raised the cutoffs, but then had to add some logic that figured out whether a plane was landing or taking off. A lot of the times this worked, but there were still a bunch of individual cases that slipped by and made the plots noisy.

The solution was to go back to the original data and just use the source field that sits in the data. I reworked the data and generated all single hops based on the field. A nice benefit of this approach was that it made it easier to search for the airport of origin. Previously I'd been using the lon/lat boundaries for different airports to detect when a plane was leaving a certain airport. Now I can just look for the tag. This solution does have some problems, as there are many flights that don't label their src/dst. I've also found a few where the pilot must have forgotten to change the src/dst for the flight (ie, one sfo flight to chicago went on to Europe).

Velocity Problems

The last problem (and one I don't have a good fix for) is velocity. Looking at the individual distances and times for track segments, I sometimes see situations where the calculated velocity is faster than the speed of sound (I'm assuming Concordes aren't flying). Often this seems to happen out in the middle of the ocean, when the data becomes more sparse. It could be bad values in the data, or maybe two observers reporting the same thing differently. I'll need to take a closer look at it later. For now I just put in some kludge filters that throw out flights that are out of range.

Code

I decided to go ahead and make some of the scripts I've been writing available on github. This one is called Canonball Plotter, which can be found in: github:airline-plotters