Craig Ulmer

Data-Management Services for ECP

2017-01-31 net systems

For the last few years, the majority of my time at work has gone into developing data-management services for high-performance computing (HPC). While we still have a ways to go before an official software release, we're starting to get performance numbers and have initial support for the Cray XC40's DataWarp NVMe array. I was asked to make a poster about our work and present it at an Exascale Computing Project (ECP) meeting that took place in Knoxville, Tennessee.


ECP is a new, multi-laboratory effort in the US that is scaling scientific computing to new levels through advances in hardware, applications, and system software. The goal is to develop an exaflops computing platform in the US that can serve the HPC needs of multiple domains (eg, science, energy, manufacturing, and national security). The data-management software we've been writing for asynchronous, many task (AMT) programming models falls into the system software category, and could be adapted to fit other needs in ECP (eg, workflows, checkpointing, and analysis).

Poster

ECP Poster Poster presented at ECP Meeting


Switching to Raspberry Pi

2017-01-07 edison pi

Back in October I mentioned in an ADS-B logging post that I had mixed feelings about using the Intel Edison for my embedded projects. This December I decided to make the jump and switch over to the Raspberry Pi. I don't have anything to say about the Pi that hasn't already been said- this post is just to provide me with some closure on the Edison.


Edison's Downfall

The Intel Edison had a lot going for it when it first came out. It featured a 32b x86 CPU, memory, flash, and wireless networking all on a tiny package for about $60. My first devkit had a socket for plugging in Arduino shields, which made it easy to use a lot of existing hardware and software. I bought a second Edison with a minimal breakout board so I could do low-power ADS-B monitoring. It looked like Intel was making smart moves to make a play in the emerging IoT market.

And then... nothing. While the Intel Message boards had a good amount of chatter on them, the only follow-up hardware relating to Edison was some Sparkfun gear and an expensive senor kit/book from Intel. In retrospect, there were a number if missteps on the hardware front. The Edison's micro i/o bus made it difficult for users to get at GPIO on their own. x86 compatibility was a wash because you usually had to recompile code to 32b in order to get it to run on the Edison. USB ports were limited. Worst of all, Intel never followed up with better hardware. Add Edison to the long list of Intel efforts where Intel was going to throw its weight around and take over a market but didn't (Hadoop, High-end GPUs, enterprise storage, smartphones... realsense and omnipath aren't looking so great these days either).

Edison vs Pi 3 Hardware

Back to the positive, though- I got a Raspberry Pi 3 kit for Christmas. The stats for the Pi are well known, but here's a summary of the differences between an Edison on a Breakout Board (BB) and the Pi 3:

 Feature          Edison(BB)  Pi3
 --------------   ----------  --------------
 ISA              32b x86     64b ARM
 Cores            2           4
 Clock            0.5GHz      1.2GHz
 Memory           1GB         1GB
 Internal Flash   4GB         0
 Micro SD card    no          Yes
 Networking       WiFi, BT    Eth, WiFi, BT
 USB              1           4
 GPIO             40pin       40pin
 IO Connector     70pin micro 40pin
 Video            no          HDMI
 Size             61x29mm     85x56mm
 Idle Power       0.7W        1.7W
 Cost             $60         $40

Performance and Power

I did a few simple benchmarks to do a rough comparison between the boards. First I found and built a Monte Carlo program for computing Pi and let it run for 100M iterations. Using only a single core, the Edison and Pi3 took 52s and 19s respectively, making the Pi 2.7x faster. I started working on some multicore tests via OpenMP, but my Edison installation was missing libgomp and it didn't seem worth fixing. I also downloaded and ran ramspeed on the boards. The pi's caches seem to be 2.2x faster, while memory was 20% to 50% faster.


I hooked up the Pi3 to a few wall-socket power meters and took some preliminary readings. The meters said the Pi3 used about 1.7W when running headless with wireless (no usb, no hdmi). Launching 1-4 instances of the Pi program on the board raised the power to 2.2W, 2.7W, 3.2W, and 3.7W (thus 0.5W per active core). For comparison, the breakout board version of the Edison idles at 0.7W, and runs 1-2 instances of the Pi program at 0.9W and 1.1W (0.2W per active core). While the Pi3 uses 2.45x the power (per core) of the Edison, it's 2.7x faster and has a lot of I/O hardware ready for users. The Edison breakout board only has one mirco-usb port on it, which limits what you can do with it for projects. Upgrading to the Edison Arduino board to get more usb ports and more friendly pins eats more power. My empty Edison Arduino board used 2.4W while idle.

Other Perks

Another feature I really like about the Pi is that it boots off a removable microsd card. This feaure means I can use create multiple microsd os images and boot the hardware differently by plugging in a different disk (similar to the Amigas of my youth). It's also useful to be able to work on the os image on a desktop (installing packages on an embedded box always takes forever). Most importantly though, you don't have to worry about bricking your device if an install goes bad. Upgrading the on-device OS image for the Edison was nerve racking because it was difficult to repair the board if an install went wrong.

I didn't expect to use it, but the Pi's HDMI port has also been a source of fun. I hooked the HDMI up to the TV so I could do the inital network configuration. The display was faster and better looking than I expected, and I soon found myself taking a side trip into emulators via RetroPie. My family took an interest in the Pi when they realized we could plug multiple PS3 controllers into it and play Gauntlet. I wound up buying a Buffalo Classic USB Gamepad controller so we could get a more authentic-feeling SNES controller. At some point I'd like to revisit the emulation side of things and get the Amiga emulator configured right. I may also have to buy one of those X-Arcade Tankstick + Trackball controllers so I can show the kids the wonders of Centipede.


Possibilities

Now that I've started working with the Pi I'm kicking myself for not moving towards it earlier. The boards are cheap enough I can use them for one-off projects around the house, and there's plenty of info out there on how to do things with the hardware. That may mean I'll never do anything original with the Pi, but it's certainly more fun to get a project working than it is to spend a lot of time figuring out a work around for a proprietary board's under-documented hardware.

Update: Intel Quits IoT in June

Sure enough, Intel announced that they were killing off their IoT business and shutting down Edison, Joule, and Galileo June 20th. A sad consequence of this decision was that on July 5th they announced layoffs for 140 people (100 here in Santa Clara, 40 in Ireland). The sad thing is that it wasn't like the hardware wasn't selling- Intel's IoT area generated $721 million in revenue for Q1, which would be astounding for anyone but Intel.


Edison ADS-B Logger Version 2

2016-10-09 edison planes

Last year I wrote about how I built an ADS-B data logger to track planes using an RTL-SDR and an Intel Edison board. It was a fun project, but I eventually took it offline because I only had one Edison and running the logger meant I couldn't use the board to do other hardware experiments. I bought a second, smaller Edison board to fix the problem, but got side tracked with other projects and didn't get a chance to finish building the new logger until recently. This post goes through some of the hardware details of building the second version of my Edison-based ADS-B logger.


Intel Edison Breakout Board

In the previous version of the logger I used the Edison Arduino board. The board is large but has a number of useful built-in features (eg, a removable micro-sd storage, usb ports, Arduino pins, and a power jack). The other main Edison board Intel sells is a Breakout Board, which is very small and has just enough I/O for basic apps. I think I paid about $60 for it at Fry's. Both boards use the same Intel Edison module, which provides dual-core 32b x86, 1GB RAM, 2GB flash, 802.11, and BT. Similar to the Edison Arduino, the breakout board has two micro USB ports: one for Serial-to-USB and another that can be either a master or a slave (depending on power). The only other I/O on the breakout board is raw pads that you can solder wires to for interfacing with the Edison's micro connector.


Power Problems

The first problem I ran into was power. The board has four ways you can power it:

  • OTG USB Port: The easiest way to power the board is through the OTG USB port. The downside here is that doing so makes the USB controller boot up as a slave device. If you want the board to be the master, you have to use a different power option.
  • DC: The J21 pins (bottom right in the above picture) let you plug in a DC source (7-15V). There's a voltage regulator on the pins to make the voltage right for the board.
  • Battery: The J2 pins (top left) let you plug in a 3.7V Li-ion battery. There's circuitry that will charge the battery if you're powering from another source. The battery pins aren't an option if you want the Edison to be the bus master (ie, USB runs at 5V).
  • Optional Barrel Connector: The board has pads on it for a standard barrel-connector power plug (which wall warts often use). Unfortunately, the breakout board does not come with the actual barrel connector, so it's up to you to buy it and solder it to the board (not hard, just a hassle).

I initially hoped that I might be able to use a powered USB hub between the Edison and the RTL-SDR to power and connect both. Unfortunately, (1) hubs don't provide power to a USB master (duh) and (2) the breakout board makes the OTG USB port a slave unless you power from an external source. Since the battery port is only 3.7V (ie, not the 5V USB needs) and I didn't have a barrel connector, the only option for me was to rig something up to the J21 DC pins. I cut up the wires to an old wall wart, assembled an on/off switch for it, and then stuffed everything into an old film capsule.

Storage Problems

The breakout board's limited I/O was also annoying because what I really wanted to do was plug in an external device for storing the data to prevent my Edison's internal storage from getting worn out. The breakout board lacked the micro-sd slot the Edison Arduino had, and with only one USB port I'd need to use a hub to plug in both the RTL-SDR and a thumb drive. I did some tests and found that when connected to DC, the breakout board could provide enough power to run a hub, the RTL-SDR, and a USB stick. However, it all felt kind of junky. I decided to go with compactness and just write the data to the Edison's internal storage. Given that Intel seems to have abandoned Edison, I'm not too concerned about the flash on these boards lasting forever anymore.

RTL-SDR Tweaks

On the radio side of the project I decided to add a bandpass filter to see if it improved my range. FlightAware sells a $20 bandpass filter that attenuates everything outside of ADS-B's frequency range (1090MHz). Annoyingly, the filter has SMA connectors so I also had to buy SMA-to-MCX and SMA-to-Coax adapters. To make matters worse, I got the orientation of the filter backwards when I originally ordered the adapters, so I had to order a second set later with the genders reversed (a pair of adapters cost $10). I verified that the filter was attenuating the strength of other frequencies by booting up the gqrx app on my desktop and looking at radio stations. I don't have a way to get the real frequency response of the filter right now, but others have reported that it is wide enough to capture the family of frequencies plane watchers usually want.


I used two RTL-SDR dongles to do some visual comparisons between the filtered and unfiltered ADS-B results. I pulled up the web interfaces for dump1090 on both SDRs and then compared the number of planes each observed. In my core visibility range, both systems seemed to capture the same plane info. The planes I tracked at the edge of my visibility tended to be seen more by the filtered line. For example, the filtered line followed one plane for an additional 20 miles (and recorded more data points than the unfiltered line when it could see it). This performance wasn't always perfect though- sometimes the unfiltered line would see an incoming plane before the filtered line. I'll need to gather some data and analyze it, but my initial impression is that the filter does work, but doesn't make a huge amount of difference in my case.

Power Use and Heat

I hooked up a power meter to get a rough estimate of how much power the Edison and RTL-SDR use when dump1090 is running. An idle Edison with 802.11 enabled and no peripherals consumes less than a watt. When I hook up the RTL-SDR and enable dump1090, the power is about 3.5W. I noticed that the RTL-SDR dongle got a little warm after being on all day, so I used my cheap-o infrared thermometer to get some estimates. The thermometer said the dongle was 112 degrees F in the hottest spot (center, where the air holes are). The Edison also hit 110 degrees F (right side of the silver Edison can). When probing the thermal sensors on the device (ie, via /sys/class/thermal/thermal_zoneX/temp), I see 14, 55, and 54 degrees C (or 57, 131, 129 degrees F). That seems hot to me, but then again the Edison is passively cooled. I've been running it like this for months without problems, so it seems ok.

Mixed Feelings

The Intel Edison is still a fine embedded board for building a data logger, but I've got to say that the breakout board's connectors disappointed me. It's stupid that they put in pads for a barrel jack but didn't populate it. Having only a single USB port also limits what I can do with the board. I'd hoped to make this a multi-purpose controller for the garage (eg, ADS-B monitor, webcam, temperature, etc), but to do that I'd need to add in a USB hub. The Raspberry Pi 3 is out now, and has built-in wireless and four USB ports. I'll probably change to that in the next version of things.


Blimp Tracking

2016-09-18 planes

Close, but no blimp data. Last Thursday at my son's soccer practice, one of the Goodyear Blimps circled the field as it descended for a landing at the Livermore Airport. It was a little surreal, since it looked like the blimp was monitoring the practice the same way it circles big bowl games. However, blimp sightings aren't that uncommon out here. Livermore is on the fringe of the Bay Area and we have a large municipal airport with wide open spaces around it. It seems like the perfect place to launch, land, and park a blimp if you knew you were going to be visiting the area by dirigible airship.


Sunday morning I started wondering where the blimp was going while it was in the area. Since I've been running a dump1090 data logger on my Edison board for the last few weeks, I began pulling the data and parsing it for signs of dirigibles. As I was puzzling through how I might identify a blimp in the pile of points, I heard a faint buzzing sound coming from outside and realized that the blimp was at that very moment passing by my house.

Getting the ID From Dump1090

I went over to the webpage that dump1090 generates to see what aircraft were in the local area. I was disappointed to see that the map was pretty much empty nearby, which meant that the blimp wasn't transmitting position data. I looked at the list of planes and noticed that there were some planes in the area that were reporting their presence, but not identifying their position. I sorted them by altitude and found one that was cruising along at only 2,000ft with an ICAO hex ID of A4A7EF. Some searching around and I found this ID belongs to tail N4A, which is a 33-year-old (!) blimp owned by Goodyear.

Looking it up in FR24

While I was disappointed that my own logger didn't get any position info for the blimp, I knew that other aviation sites have tracks for aircraft based on other data sources. I looked it up on FlightRadar24 and found that they had logged a few flights for the blimp on Saturday:



Well, that solves the mystery: they were out here to watch Stanford play against USC (for the people back home, that's U of Southern California, not South Carolina). They circled that stadium for more than 5 hours, trying to make sense out of the whole situation. Then they went and blew some steam off in San Francisco. I'd like to think that the highlight of their trip though was watching my son's soccer team practice.


Job Postings on Craigslist

2016-06-02 data text code

Craigslist is an interesting source of text data. In addition to providing a continuous stream of user postings from around the country in organized categories, the website stubbornly favors a plain-old-web format that's easy to retrieve and parse. I believe craigslist gets a lot of traffic from different kinds of scrapers. In addition to all the search engine crawlers, you hear stories about how individuals run scripts to continuously watch their local boards so they can be the first to snatch up free items. Craigslist blocks people that aggressively crawl the site, but otherwise let you wander around if you put in some rational delays.

Back in September I wrote some utilities to go off and scrape job postings from craigslist, because I thought it would be interesting to see what kind of people Bay Area companies wanted. After working out how to grab the data in an unobtrusive way, I updated my script to grab tech job postings from different cities around the country. I run the script about once a week, which over the last 9 months has given me about 32k postings, totaling 470MB in text data. This post just focuses on the scraping. I'll get to the analysis later.

Scraping

Craigslist puts each post as a separate web page, and uses a city/topic/post directory structure to keep things organized. While the post part of the url is unique and non-sequential, they provide an easy-to-parse index page for each topic that will give you all the urls for the posts in reverse chronological order. All one has to do is pick a city and a topic, walk through the index, and retrieve the individual posts. I put some delay in after every page I fetched to be polite. I also randomized the city list on each run to even out the data if the grabs were taking too long and needed to be cut off (though always getting ATL would have been fine for me). To help with statistics, I had the script store basic information about runs in a local sqlite database. The database helps avoid downloading the same post twice, and gives me a place to store the dates of when I first and last saw a particular post.

Grabs Per Day

Below is a breakdown of how many posts I grabbed for each city when I ran the scraper. Since the script only grabs posts that it hasn't seen before, the per day grabs go up and down based on how frequently I ran the script (eg, when I missed a week or two, there was more data available to grab). For this time period, the cities seem to be fairly proportional. The big job cities seem to be San Francisco, Seattle, New York, and Boston (not unexpected). C'mon Atlanta. It's like you're not even trying.


Number of Active Days per Post

Another interesting statistic for me was how long job postings remain active on craigslist. I used the "first seen" and "last seen" dates stored in my meta data to estimate the amount of time I post stays alive. The numbers are off due to the initial posts I pulled (ie, I looked at the grab date instead of the post date) and the most recent posts (ie, which have not expired yet). As the below (logscale!) plot shows, most posts stick around for about a month. However, there are a few the last as long as 80 days.


Code

It isn't much but I put the code for this on github:

github:craigslist_scraper