Craig Ulmer

Brute-Force Hardware Fuzzing

2012-09-11 fpga clusters

I recently received an unexpected email notifying me that a project team I'd worked with last year had been nominated for an NNSA Defense Program Award of Excellence. The team worked on verifying that a high-value hardware design would operate correctly in a noisy environment. I joined the team to help scale their VHDL simulation work up to run on a cluster so we could test a much larger portion of the design space and have greater confidence in the design. After a month of simulations on a cluster, we discovered two subtle bugs that would have been disastrous under certain conditions. The designers fixed the bugs, and we wound up winning a DP award:


Hardware Simulation with Open Source Tools

The project team was responsible for verifying that a complex hardware design was going to work properly, even in harsh operating conditions. One of the main concerns was that information from one device had to be transmitted to another over a communication link that was susceptible to errors. The designers had engineered the communication hardware to employ standard data protection mechanisms to ensure the receiver would detect and correct data errors. The verification team wanted to know how well the mechanisms worked when errors were injected, and whether the implementation itself had any bugs. My friend Brian had developed a complex testbench for the VHDL design that endlessly transmitted data over the channel and then applied random bit flips to simulate a noisy channel. The more he worked on the testbench though, the broader the search space grew. They realized they needed a lot more computing power than their simulation workstation could give them.

Running the simulation on multiple workstations was complicated by the fact that they were using a commercial VHDL simulator to run their testbenches. The simulator was fast, but the non-academic versions of EDA tools are notoriously expensive (sometimes $100k/year for just a handful of licenses). I'd done a good bit of hardware development in other projects and knew the strengths and weaknesses of some of the open source hardware simulation tools (e.g., GHDL for VHDL and Icarus for Verilog). I took the team's design and worked through the pains of getting it up and running in the GHDL simulator on Linux (interestingly, GHDL compiles your VHDL into an executable that runs as a standalone program). After getting the design to run, I did some benchmarking and found that the commercial simulation tools were approximately 6x faster than the open source tools. However, using the open source version meant we could run the simulations on as many nodes in our clusters as we wanted. Suddenly, brute-force fuzzing a hardware design seemed much more plausible.

Scaling to a Cluster

I happened to have a test cluster of 12 compute nodes that I could dedicate to this project. The nodes were dual-socket motherboards with 6-core Xeons chips and 64GB of memory. That gave me 144 physical cores for running simulations in parallel. The nodes ran a oneSIS'd version of Linux off an NFS mount, but had local disks where I could store the simulation results. While other clusters typically allocate nodes out to users for a maximum of a few days, I was the owner of the cluster and could dedicate the hardware to run for as many weeks as we needed. Long-term allocation is important for this work, as GHDL doesn't have a way to checkpoint and restart a design.

I wrote some scripts to help me launch a batch of 144 simulations on the cluster. Each simulation was given a unique seed value for its random number generator to make sure the simulations were independent but repeatable. The testbench used assertions for terminating a simulation when a bad condition occurred. I logged the assertion messages and other debugging info to disk so we could see why a simulation had terminated. The log files also provided us with a way to spy on what the simulation was doing while it was still running. We realized that if a simulation did terminate, we'd need to see everything that had happened just before it died. The easiest way to capture this information was to have the simulation log all of its signals to a waveform file as it progressed. These files were large (tens of Gigabytes), but tools like GTKWave can parse and display them. I launched the simulations on the cluster and then waited. The first day went without any problems, then a week. Then two weeks. Then three. Then, one morning I went and checked on the simulations and noticed that a few had mysteriously died.

Discovering Bugs

I pulled out the waveforms and log messages leading up to the terminations. I mostly expected the simulations to have failed due to an error in the simulation package (eg, out of memory), but the assertions all seemed to be legitimate. Brian and I walked through the traces leading up to the assertions and found there were two flaws in the design. The first was that the checksum being used to guard the data was too weak for the number of bits it was guarding (easily fixed). The second was a difficult to find bug hidden in an unusual corner case. It only surfaced when a bit flip happened at a specific time during the decoding of an incoming message, and only on certain data values. The timing problem was subtle enough that I don't think I would have ever spotted it just by inspecting the source code. However, shaking the box long enough caused the error condition to eventually fall out. After fixing the bugs, we reran the simulation for many weeks but didn't find any other flaws.

The Power of Brute Force

Scaling up the simulations to run on a cluster was interesting to me because it demonstrated how a platform that ran many slower-but-free simulations could give better results than one quicker-but-expensive simulation. It did take time on my part to adapt the simulation to run using open source tools, as well as on a cluster. However, it's a lot cheaper to throw more hardware at the problem than it is to optimize the simulation to run faster. In the end, I launched the job and let it spin on its own for weeks while I worked on other things.

I have hopes of eventually going back to this kind of work and doing some time-series analysis on the waveforms. I think it would be interesting if tools could lean to identify unusual behavior in the waveforms and then flag the designer. I wrote some initial programs in Go to compute basic statistics on waveforms, but this side project stalled out due to other commitments. At some point I'd like to come back to this idea, but I'll have to shelve it for now and just be happy that we found some hard-to-spot bugs through brute force testing.