One of the lessons we learned from building an FPGA-based, network intrusion detection system (NIDS) a few years back is that it is hard to develop a detector that can catch attacks that weren't known at build time. For example, while SNORT provides a convenient way to classify malicious packets based on a set of rules, attackers can sometimes escape detection by jiggling bits in their packets so they don't look like any of the known signatures. Additionally, attackers continuously invent new ways abuse a system, which can only be detected in a rule-based system when defenders observe the attack and add the proper signatures to the rule set. What NIDSs need in many cases is a classifier that is a little more fuzzy- something that can take into account data being rearranged or obfuscated, or otherwise doesn't look "normal".
Last year I had the opportunity to work with a group of researchers at LLNL that were interested in building a new NIDS for detecting and filtering malicious activity in web requests. Tina Eliassi-Rad and Brian Gallagher came up with a classifier that used TFIDF and cosine similarity to estimate whether an HTTP request looked like an attack or not. My job was to adapt the algorithm to a form that could run in an FPGA and filter network requests in real time. This task turned out to be a great deal of work, as I had to refactor the algorithm and find a way to compress multiple, massive dictionaries into in-chip FPGA memory. In the end I built a real implementation that was able to run at Gigabit speeds and could detect and filter attacks that were not part of the original training data.
Training Data and the Original Approach
In 2007 the ECML/PKDD conference had a data challenge contest to see who could build the best classifier for correctly detecting different attacks on web servers. In support of this contest, organizers provided a training dataset containing 120K labeled HTTP traces. Each trace was identified as being either malicious or non-malicious, with malicious requests being further identified as belonging to up to seven different attack categories (e.g., XSS, SQL/LDAP/XPATH injection, etc.).
My LLNL collaborators used document similarity techniques to evaluate the risk of a new incoming HTTP request. In this approach, each HTTP request was split into a collection of terms. The training data sets were then analyzed using term frequency inverse document frequency (TFIDF) to make a numerical estimate of how relevant each term was to a particular attack. Finally, cosine similarity was used to match a request's vector of terms to known attacks. The approach worked well, achieving an accuracy of 94% when run against the (previously unseen) testing data. While the bulk of the accuracy could be achieved with only a handful of terms (eg, "drop+table"), they found that using larger dictionaries of terms did help improve accuracy.
Refactoring for FPGAs
I faced two algorithmic hardships in my task of porting the classifier into FPGA hardware. First, I realized it'd be difficult to implement the scoring algorithms in floating point, because FPU cores in FPGAs have high latencies (eg, 12 pipeline stages of delay). Thus, we needed to find a different way to score values using integers, and then deal with the resulting precision discrepancies. Second, we needed to find a way to compress our multi-megabyte dictionaries into a collection of smaller tables that could fit in our FPGA's on-chip SRAM (on order of a few kilobytes). While others in the project looked at lossless compression through tables, I favored a lossy approach that allowed us to pack as many terms into the chip as possible.
My solution to this problem was to quantize our TFIDF statistics to a small number of discrete integer values, and then use Bloom filters to associate many terms with each value. The implementation would then use an array of Q Bloom filters for each attack category. When a new term arrived, the system consults all Bloom filters to determine if the term is known, and what quantized statistic could be assigned to it for each attack category. The system then plugs the statistics for any hits in the Bloom filter array into a scoring equation for the HTTP request. When all terms for a request are processed, the category with the highest score is the winner.
My collaborators were skeptical at first of my approach because of all the places the numbers get fuzzy. Quantizing the statistics into just a few sets of numbers loses a lot of precision. Bloom filters are probabilistic (no false negatives, but definitely false positives), so many terms will trigger erroneous hits in the array. Finally, I adjusted the equations a bit so we could reduce some of the computational hardware we needed (e.g., fewer multiplies and divides, and no square roots). These manipulations were all valid, but made people nervous because it wasn't the textbook definition for the equation. I did a great deal of engineering and testing to make sure that my approach was statistically correct, and that it made a good quality classifier for our NIDS.
There were a number of challenges to building the actual system, most of which involved building software tools to reshape the statistics in different ways. I did a lot of up-front work with the training datasets, inserting the data into a SQL database so I could easily run experiments on different subsets of the data. I wrote tools to generate TFIDF data, quantize the data (I did an iterative algorithm that merged the least influential points until you were left with a nonlinear list of Q values), generate hashing tables for the Bloom filters, generate Bloom filters for each dictionary, and evaluate how well the filter array would perform against new data. I also had to build a tool that would generate all the hardware lookup tables that the design would need. The whole process took a great deal of time, and involved making a build system that could take you from training dataset all the way down to hardware(!).
The hardware I built for this had its own set of hardships. The data flow through the system wasn't all that complicated because I designed the system to have a lot of re-use, but it took a while to get all the generics right so that VHDL would construct as much of the array hardware as possible. The hashing units in the Bloom filters took a good bit of work, as I needed a lot of good quality hash functions to make the Blooms accurate. Pearson hashes were wonderful here- they meant I could define a simple lookup table for each hash and be assured that they'd give me decent quality results.
Results and Validation
This project was one of the more rewarding efforts I've been on, because I felt that we came up with a good solution for a tough problem, and proved it worked much better than people originally expected. It's always a thrill to build software that cranks through data to generate custom software, and therefore I spent a good number of my nights refining my tools for this project. In the end we wrote a paper for the Reconfigurable Architectures Workshop (RAW) and then later a journal paper (JPDC).
A short time after we did this work, someone started knocking over webservers with a hand crafted attack that was tuned to evade many of the commercial IDS systems. Out of curiosity, I acquired some pcaps of the request and plugged it into our system. We may have just gotten lucky, but the classifier reported the request as suspicious, even though it hadn't been trained against them.
JPDC Paper Craig Ulmer, Maya Gokhale, Brian Gallagher, Philip Top, and Tina Eliassi-Rad, "Massively Parallel Acceleration of a Document Similarity Classifier to Detect Web Attacks", Journal of Parallel and Distributed Computing, February 2011.
RAW Paper Craig Ulmer and Maya Gokhale, "A Configurable-Hardware Document Similarity Classifier to Detect Web Attacks", Reconfigurable Architectures Workshop, April 2010.
RAW Slides Presentation given at the Reconfigurable Architecture Workshop.
Seminar Slides Presentation given on an early version of the work