Craig Ulmer

Run Faker

2014-11-14 Fri tracks gis code

TL;DR: It isn't that hard to use third party sites like RunKeeper to load large amounts of fake health data into Virgin Health Miles. However, it's not worth doing so, because VHM only gives you minimal credit for each track (which is the same problem most legitimate exercise activities have in VHM). I've written a python program that helps you convert KML files into fake tracks that you can upload, if you're passive aggressive and don't like the idea of some corporation tracking where you go running.


Well-Being Incentive Program

A few years back my employer scaled back our benefits, which resulted in higher health care fees and worse coverage for all employees. In consolation, they started a well being incentive program through Virgin Health Miles (VHM). It's an interesting idea because they encourage healthier behaviors by treating it as a game: an employee receives a pedometer and then earns points for walking 7k, 12k, or 20k steps in a day. The VHM website provides other ways to earn points and includes ways to compete head to head with your friends. At the end of the year, my employer looks at your total points and gives you a small amount of money in your health care spending account for each level you've completed. In theory this is a win-win for everyone. The company gets healthier employees that don't die in the office as much. Employees exercise more and get a little money for health care. VHM gets to sell their service to my employer and then gets tons of personal health information about my fellow employees (wait, what?).

As you'd expect, there are mixed feelings about the program. Many people participate, as it isn't hard to keep a pedometer in your pocket and it really does encourage you to do a little bit more. Others strongly resent the program, as it is invasive and has great potential for abuse (could my employer look at this info and use it against me, like in Gatica?). Others have called out the privacy issue.

Given the number of engineers at my work, a lot of people looked into finding ways to thwart the hardware. People poked at the usb port and monitored the data transfers upwards, but afaik, nobody got a successful upload of fake data. The most common hack people did was to just put their pedometer on something that jiggled a lot (like your kids). This was such a threat, that pedometer fraud made it into a story in the Lockeed Martin Integrity Minute (worth watching all three episodes, btw).

My Sob Story

This year I actively started doing VHM, using both a pedometer to log my steps and RunKeeper to log my bike rides. When my first few bike rides netted only 10 points, I discovered that a track of GPS coordinates did not constitute sufficient proof that I had actually exercised (!). In order to "make it real" I had to buy a Polar Heart Rate Monitor, which cost as much as the first two rewards my employer would give me in the health incentive program. I bought it, because it sounded like it would help me get points for other kinds of exercise, like my elliptical machine.

Unfortunately, RunKeeper estimates how much exercise you've done by using your GPS track to calculate your distance. Since my elliptical machine doesn't move, RunKeeper logs my heart rate data, but then reports zero calories, since I went nowhere. When I asked VHM if there was a way to count my elliptical time, they said all I could do was get 10 points for an untrusted log entry, or count the steps with my pedometer (potentially getting 60-100 points).

The pedometer was fine, but then midway through the year it died (not from going through the washing machine either). I called up VHM and they wanted $17 for a new one. I wrote my employer and asked for a spare and was told tough luck. $17 isn't much, but I'd already blown $75 on the HRM, and it all started feeling like a scam. Anything where your have to pay money to earn the right to make money just doesn't sound right, especially if its through your workplace.

Equivalent Stats

Outside of VHM, I've been keeping a log of all the different days I've exercised this year. I had the thought, wouldn't it be nice if I could upload that information in a way that would give me the points VHM would have awarded me if their interfaces weren't so terrible? What if I wrote something that could create data that looked like a run, but was actually just a proxy for my elliptical work?

I discovered that RunKeeper has a good interface for uploading data (since they just want you to exercise, after all), and that they accepted GPX and TCX formatted data files. I wrote a python script to generate a GPX track file that ran around a circle at a rate and duration that matched my elliptical runs. Since I knew VHM needed HR data, I then generated heart rate data for each point. Circles are kind of boring, so the next thing I did was add the ability to import a kml file of points and then turn it into a run. Thus, you can go to one of the many track generator map sites, drop a bunch of points, and create the route you want to run. My program uses the route as a template to make a running loop, and jiggles the data so it isn't exactly the same every time through. Fun.

A Few Sample Runs

For fun, I made a few kml template files for runs that would be difficult for people to actually do. The first one at the top of this post was a figure eight around the cooling towers at three mile island. Next, since the Giants were in the world series, I decided it would be fitting to do some laps around AT&T park.


With all the stories in the news, I thought it would be fitting to squeeze in a few laps around the White House.


And last, I (in theory) made a break for it at Kirtland AFB and went out to see the old Atlas-I Trestle (a giant wooden platform they used to toll B52s out on and EMP dose them).


Mission Aborted

I decided to abort uploading all of my proxy data for two reasons. First, even with gps and heartbeat values, VHM still only assigns 10 points for each run. I was hoping to get "activity minute" credits, which would be on the order of 60 points per run, but alas, they must have some additional check to see whether the data came from your phone or another source. This problem really emphasizes why I dislike VHM: they only acknowledge data from a small number of devices that are meant to log certain types of exercises. If you want to do something else, you're out of luck. Second, VHM's user agreement says something about not submitting fraudulent data. While I wouldn't consider uploading proxy data to be any less ethical than what people with pedometers do at the end of the data to get to the next level, I don't want them coming after me because I was trying to compensate for their poorly-built, privacy-invading system.

If someone wanted to pick this up, I'd recommend looking at the data files RunKeeper generates in its apps. I've compared "valid" routes to my "invalid" ones, and I don't see any red flags why the invalid ones would be rejected upstream. I suspect RunKeeper passes on some info about where the data originated, which VHM uses to reject tracks that didn't come from the app.

Code

I've put my code up on GitHub for anyone that wants to generate their own fake data. It isn't too fancy, it just parses the kml file to get a list of coordinates, and then generates a track with lon/lat/time values to simulate a run based on the template. It uses Haversine to compute the distance between points so it can figure out how much time it takes to go between them at a certain speed.

github:rungen


International Airports

2014-10-14 Tue tracks gis code planes

The other day while riding the rental car shuttle to the Albuquerque International Sunport (ie, the ABQ airport), I started thinking about how some airports are labeled as being international while others are not. My first thought was that it didn't take much for an airport to become international- all you'd need is a few flights to Mexico. However, Wikipedia points out that international airports have to have customs and immigration, and generally have longer runways to support larger planes. In any case, it still seems like the international label is on a lot more airports than usual. This got me wondering, how different are the workloads of different airports anyways?

Since I have a big pile of airline data, I decided to see if I could better characterize airports by analyzing where the outgoing flights were heading. I wrote a Python script that takes a day's worth of flight data, extracts all direct flights that left a particular airport, and plots their altitude/distance traveled on the same graph. The X axis here is the cumulative distance the plane flew, calculated by summing up the distance between the coordinates in its track. The Y axis is altitude in miles. I picked several flights and verified that my distances roughly match the expected optimal path between cities (see AirMilesCalculator.com).

Below are some plots for SFO, ATL, ABQ, and CHS, all of which are international airports. A few interesting things pop out looking at the charts. First, SFO has a broad mix of travel, including local (LAX is about 340 miles), domestic (many stops between here an the east coast), and international (the large gaps in distances are for oceans). ATL is similar, but they have a lot more variety in the under 1,000 miles range (due to the number of airports on the east coast). ATL also has plenty of international flights, but they're shorter since Atlanta is closer to Europe. Interestingly, the longest flights for both SFO and ATL in this sample were both to Dubai. In contrast, the international Sunport (ABQ) and Charleston (CHS) didn't seem to have much range. In ABQ's case, this can be partially be attributed to the fact that it's towards the middle of the country (and close to Mexico).


This started out as a fun project that I quickly hacked together to get a first order result. I had the data, so all it took was building something to calculate the distances and plot them. The first plots I did showed a lot of promise, but I also noticed a lot of problems that need fixing. These fixes ate up a good bit of my time.

Distance Problems

The first problem I had was that my distances were completely wrong. I had several instances of flights that were going 30,000 miles, which is bigger than the circumference of the planet. My initial thought was that I wasn't properly handling flights that crossed the international dateline. I went through several variations of the code that looked for dateline crossings (ie, lon goes from near +/-180 to near -/+180) and fixed their distances. This helped but the long flights were still 2x longer than they should have been.

I realized later the problem was my distance calculator. I'd hacked something together that just found the Euclidean distance in degrees and then converted degrees to miles. That would be ok if the world was flat and lon/lat were for a rectilinear grid, but the world is round and lon/lat grid cells become smaller as you near the poles. I felt pretty stupid when I realized my mistake. A quick look on stack/overflow pointed me to the Haversine formula and gave me code I could plug in. The numbers started working out after that.

Multihop Problems

Another problem I hit was that my original data was for what a plane did during the day as opposed to the actual flights it took. At first, I tried to just compensate in my plotter by making an inline splicer that used altitudes to chop the plane's data into multiple hops. This partially worked, but unfortunately the data wasn't always sampled at the right time, so some flights don't always go to zero. I raised the cutoffs, but then had to add some logic that figured out whether a plane was landing or taking off. A lot of the times this worked, but there were still a bunch of individual cases that slipped by and made the plots noisy.

The solution was to go back to the original data and just use the source field that sits in the data. I reworked the data and generated all single hops based on the field. A nice benefit of this approach was that it made it easier to search for the airport of origin. Previously I'd been using the lon/lat boundaries for different airports to detect when a plane was leaving a certain airport. Now I can just look for the tag. This solution does have some problems, as there are many flights that don't label their src/dst. I've also found a few where the pilot must have forgotten to change the src/dst for the flight (ie, one sfo flight to chicago went on to Europe).

Velocity Problems

The last problem (and one I don't have a good fix for) is velocity. Looking at the individual distances and times for track segments, I sometimes see situations where the calculated velocity is faster than the speed of sound (I'm assuming Concordes aren't flying). Often this seems to happen out in the middle of the ocean, when the data becomes more sparse. It could be bad values in the data, or maybe two observers reporting the same thing differently. I'll need to take a closer look at it later. For now I just put in some kludge filters that throw out flights that are out of range.

Code

I decided to go ahead and make some of the scripts I've been writing available on github. This one is called Canonball Plotter, which can be found in: github:airline-plotters


Blogging with Org-Mode

2014-10-04 Sat web

A few weeks ago Matthew Pugh pointed me to Testing Grounds, a blog he started writing to help motivate him to learn more Python for doing statistical problems. The math is beyond me, but seeing the blog got me thinking about my old websites and how I haven't done anything with them for a lone time. I used to write a personal blog and put my publications on CraigUlmer.com, but I stopped updating both of them about the same time as when we had kids. A few years back I started writing things on Google+. As much as people like to hate on G+, it's been a good outlet for me. It's easy to add new things, you can write short or long posts, and it's a great way to hear stuff from all my friends that left the lab and went to Google.

Still, there are plenty of annoyances with G+. The main complaint I have is that it's effectively a "0-day content" site, where anything older than a day or so doesn't matter to anyone (well, except Google's data mining machines). It's easy for good posts to get buried in your stream if you've got a crazy uncle that likes to post about guns, cats, and Obama. When I first started posting, I often found myself wondering if my post had actually been dropped. It turned out they hadn't- it's just that other friends were posting more stuff and that had just pushed mine down on the list. At some level, you either get caught up in the popularity contest, or you write things for yourself and stop caring if other read it. I'm the latter. The crazy people are also a problem on G+. Comments aren't as bad as YouTube, but there are a lot of people on G+ that have nothing better to do than shout you down for mildly mentioning a view on a touch subject. The last big concern is what Google will do with the content in the future. I don't think G+ will be going away anytime soon, but then again I didn't think Google Reader would go away either. I'd just assume have a copy of what I write somewhere else.

Getting the Blog Back Together

So, I started thinking about ways I could pull some of my previous writing out of G+ and host it myself. Getting the content was easy- I just walked through my page and started cutting-and-pasting posts that were technical in nature. I use Emacs's org-mode a great deal at home and work, so I just stuck the content in a flat file and used org's markup to group and tag posts. The images got placed in a local directory, but those were easy enough to link back into the org file. The more stuff I put into the org doc, the more I felt like it was the native content format I wanted to use to host a blog. The next step was looking for something that could render it into a webpage.

People have written tons of tools that convert org-mode data to other things. Org-mode has a built-in exporter to html or latex. While this would be the easiest to get running, I decided against it because static-site generators drive me crazy. I wrote a static generator for my first blog, and it was a pain to maintain. Plus I don't like all the annotation you have to do to the org file to make it render right. Next, I saw lots of people do wiki-like blogs using org-mode. It seemed like most of these were more wiki-like, and used org's doc linking to stitch everything together. I wanted everything to get plugged into the same org file. Plus, the styles never seemed like what I wanted.

Next I started looking at client-side rendering. I figured I'd push org chunks out from the server and then do something in the client to render them. I'd been meaning to check out Dart, so I picked up the dev kit and started writing simple examples. Dart is pretty cool, especially since it looks like it'd let me skip having to learn JavaScript. However, I ran into a few roadblocks. First, I realized the parsing I'd need to do was way beyond what I would be able to figure out without a lot more work. Second, I realized that it would be dumb to send the entire org file to the client, and that what I really needed was something on the server side to chop up entries and serve them up. That meant parsing on both the client and the server, which sounded like a lot of work. During this time, I discovered org-js, a Javascript interpreter for org-mode that could do the rendering. I got it to work, but I had trouble shoehorning it into the web page I liked because the Javascript is complex enough that I wouldn't want to mess with it.

Back to the 90's

Since I needed some server side of code to chop things up anyways, I decided to resort back to my 1990's view of the web and just write some CGI. I looked into using Go, as there's an easy-to-use CGI lib available. However, my provider doesn't have Go support, and the iterative development process for web work makes compiled languages a pain. So, sadly, after all my hopes of doing something new, I wound up just using Perl again.

Ok, to be fair, parsing and web page generating is the kind of thing Perl does very well. It was satisfying to know that I could just throw in a couple of lines of regex and be able to get cool useful features. The bulk of the work was coming up with a simple tokenizer that could drive a simple state machine and begin/close sections properly. Fortunately, org's main keywords are usually easy to pick off, as they usually start at the beginning of the line. Bulletized lists were a pain because:

  • They require surrounding stuff
    • You have to know when you're still in a list
      • You still have to parse for other markup
  • They can jump around a lot

I took advantage of Google's Code Prettify to allow me to put code stubs in the stream. It's a bit of Javascript that sets the colors of code, which meant I didn't have to do anything to display code (besides handle lt/gt kinds of symbols that html has problems with). I need to tweak the colors, but good enough.

#!/bin/bash
echo "Hello, suckers!"

The one thing left I probably should do is put in tables. Org-mode has an awesome table entry system that makes it easy to create tables and put new things into it. I think that should just be a simple fix, but I'll get to that when I need it.

I probably spent just as much time writing the parser as I did trying to figure out how to make html do what I want to. I know it's not popular, but I decided to make this page a side scroller. Vertical scrollers are nice, but I get sick of dragging down to get to the next thing. Everyone has widescreen monitors these days- I don't know why more people do stuff horizontally. I decided not to absolute the top heading bar at the last minute. I like the idea of having navigation fields in a constant location, but I don't like the idea of losing screen space. If you have a better compromise, let me know. For the record, yeah, I know this looks pretty 90-ish. The 90s were still good times though.

It's Up

If you're reading this, then it looks like I've gotten the whole thing up and running. My intention is to post longer writeups on technical things to craigulmer.com and crosspost to G+, so people can send me comments there. I probably won't post personal blog stuff here, as I intend this to be the place where people can see what kind of work I do (plus, the world has gotten a lot creepier since when I first started running a personal blog, and thus I'm a lot more worried about what I say these days). Here's to hoping this isn't the last post that's made on this blog.


The Crimea Conflict and Airline Tracks

2014-08-24 Sun tracks gis planes

How does military conflict affect commercial airline flight paths? Below are the daily flight heat maps around Ukraine from late March to early April. Each frame is an aggregate of all flights for a particular day. The darker the color, the more flights went there that day. On 3/31, you see a lot of flights circled a city in Crimea (maybe Simferopol?). The next day the airlines started steering their flights around Crimea (except some flights coming from Russia).


I know others have plotted this before, but it's been fun writing something to walk through the data on my own. Python/Matplotlib was taking forever, so I wrote something in Go that allowed me to clean up the original data (eg, split dateline-crossing flights) and filter it down to regions of interest. I still plot with Matplotlib, but now that the data is simplified it takes a lot less time to do.


Parsing Flight Data

2014-08-14 Thu tracks gis planes code viz

This week at home I've been parsing through some flight data I acquired, looking for a good way to generate heat maps for where planes fly. I thought of a few elaborate ways to do this, but in the end, I just ran a chunk of 10k flights through a simple xy plotter in matplotlib using a low alpha (transparency) value. It's interesting to see the flight corridors emerge as you throw more data at it. The below is about half of the data for a single day.


Code

I've now posted the plotting code in github, under my airline-plotters project: github:airline-plotters