![]() |
Abstract: The Feature Characterization Library (FCLib) is a software library that simplifies the process of interrogating, analyzing, and understanding complex data sets generated by finite element applications. This document provides an overview of the library, a description of both the design philosophy and implementation of the library, and examples of how the library can be utilized to extract understanding from raw datasets. |
![]() |
This is the 1.7.0 code release of the Feature Characterization Library (FCLib). Please see the Official FCLib Page for more information. |
![]() |
This is the deliverables package we handed off to LLNL for our summer of 2008 work. The status report describes our microbenchmark work with the flash memory device and our experiences with the XtremeData XD1000 FPGA accelerator. The delivery bundle provides our source code and results. |
![]() |
This is a poster presentation I did for HPEC on our threaded microbenchmarks for out-of-core storage applications. The poster and intro presentation have better/newer performance numbers than the paper. Additional Unlocked Links: |
![]() |
This is an extended abstract I wrote for HPEC on our threaded microbenchmarks for out-of-core storage applications. After observing that flash memory performance increased when more threads were utilized, we constructed a small set of threaded microbenchmark programs to investigate acceleration opportunities. These microbenchmarks include block transfer, k-nearest neighbors (kNN), external sort, and binary search. We observed that the ioDrive provided 3x-300x performance gains over a hard drive RAID with three SATA drives. More recent numbers are reported in the poster presentation. |
![]() |
This is a quick 1-page handout I did for some visitors that presents the core data results from the fusion-io microbenchmark work. This data was later used in the HPEC poster. |
![]() |
This is a paper I helped the SISC team at LLNL put together for IEEE Computer.
My contribution was explaining the low-level mechanics of flash memory, as
well as initial benchmarks of the Fusion-io device. Abstract: Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements. Experiments with three benchmarks suggest that emerging hardware technologies can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency. |
![]() |
This is the 1.6.1 code release of the Feature Characterization Library (FCLib). Please see the Official FCLib Page for more information. |
![]() |
I presented this talk on Flash memory at a DOE Office of Science conference. The talk walks through the basics of flash memory and gives some early performance measurements of the Fusion-io card. The Fusion-io card was officially announced on the same day (thus, this information is public), but we still received approval by Fusion-io to talk about the card. |
![]() |
Xilinx's flagship FPGAs feature multi-gigabit transceivers (MGTs) called RocketIO modules that enable user circuits to communicate with external hardware via high-speed serial links. These MGTs perform several physical layer operations (e.g., SERDES, 8B/10B encoding, CRC) and can be used in a variety of applications that require chip-to-chip communication. While flexible, MGTs are somewhat difficult for new users to get a handle on because of the complexity associated with the hardware. In this talk I'll explain how these units work and walk through a few examples that demonstrate how you can take advantage of these units. |
![]() |
Abstract: Reconfigurable computing leveraging field programmable gate arrays (FPGAs) is one of many accelerator technologies that are being investigated for application to high performance computing (HPC). Like most accelerators, FPGAs are very efficient at both dense matrix multiplication and FFT computations, but two important aspects of how to deliver that performance to applications have received too little attention. First, the standard API for important compute kernels hides parallelism from the system. Second, the issue of system architecture is virtually never addressed. This paper explores both issues and their implications for applications. We find that high bandwidth, low latency connectivity can be important, but the right API can be even more important. |
![]() |
Abstract: Field programmable gate arrays (FPGAs) have been used as alternative computational devices for over a decade; however, they have not been used for traditional scientific computing due to their perceived lack of floating-point performance. In recent years, there has been a surge of interest in alternatives to traditional microprocessors for high performance computing. Sandia National Labs began two projects to determine whether FPGAs would be a suitable alternative to microprocessors for high performance scientific computing and, if so, how they should be integrated into the system. We present results that indicate that FPGAs could have a significant impact on future systems. FPGAs have the potential to have order of magnitude levels of performance wins on several key algorithms; however, there are serious questions as to whether the system integration challenge can be met. Furthermore, there remain challenges in FPGA programming and system level reliability when using FPGA devices. |
![]() |
Abstract: The recent emergence of high-quality floating-point libraries for FPGAs has sparked a renewed interest in accelerating scientific applications through Reconfigurable Computing (RC) techniques. Unfortunately, the sheer size of these floating-point units makes it difficult to house a large number of units in a single FPGA. In order to support the adaptation of non-trivial algorithms to hardware, it is therefore necessary to consider methods by which a set of floating-point units can be reused to perform different operations in an algorithm. In this paper we discuss a "recycling architecture" that reuses a fixed number of floating-point units to implement an algorithm. We customize the hardware data path for this architecture at compile time based on a static computational schedule that is generated for an algorithm. As a means of illustrating tradeoffs, we step through the adaptation process with an example application that computes ray-triangle intersection points. By reusing hardware, we are able to halve resource requirements while maintaining acceptable performance. As a means of motivating future work, we also discuss our experiences constructing tools that translate an algorithm's equations into a synthesizable netlist. |
![]() |
These are the slides I presented at the ERSA talk for our floating-point reuse talk. |
![]() |
This was an extended version of the ARC paper that was the International
Journal of Electronics published (vol. 93, No. 6, June 2006, pp. 403-420). In
this version we also added a monitoring unit that used a PowerPC to check NIDS
activity and summarize results to a serial port for external viewing. |
![]() |
This was a Dean seminar talk about the Ray-Triangle Intersection Unit that my intern Adrian Javelo and I did. The talk goes through how we ported a simple floating-point alg to hardware. This work looks at two scenarios. First, we examine scheduling tradeoffs when a limited number of FP units are available and we must reuse the units to complete the alg. We do one and two unrolls of the loop to improve utilization. Second, we consider the case where infinite hardware is available and simply instantiate a FP unit for each calc in the algorithm. We also discuss tools that we built to automate the process of building this hardware. Additional Unlocked Links: |
![]() |
This is a paper I sketched out for FPGA '06. In the end, I lacked performance numbers and thus it didn't make it's way out to see the light of day. However, I'm including it here because it gives a snapshot of where we were in September. |
![]() |
Presentation slides for the CUG paper. Additional Unlocked Links: |
![]() |
Abstract: Reconfigurable Computing (RC) refers to the use of reconfigurable hardware devices to accelerate the computational performance of a system for particular applications. Cray s new XD1 computer presents an appealing substrate for RC research because it places Field-Programmable Gate Arrays (FPGAs) in close proximity to host processor memory. In this paper we present our early experiences with the XD1 in the context of RC. In order to gain more insight into the inner mechanics of the architecture, we have constructed four simple FPGA-based applications: a data transfer engine, a linear sorting array, a data hashing function, and a distance calculation kernel that involves double-precision floating-point operations. |
![]() |
Presentation slides for the NIDS work we did at ARC '05 |
![]() |
Abstract: Network intrusion detection systems (NIDS) are critical network security tools that help protect distributed computer installations from malicious users. Traditional host-based NIDS architectures are becoming strained as network data rates increase and attacks intensify in volume and complexity. In recent years researchers have proposed using FPGAs to perform the computationally-intensive components of a NIDS. In this work we present the next logical step in NIDS architecture: the integration of network interface hardware and packet analysis hardware into a single FPGA chip. This integration allows for better customization of the NIDS as well as a more flexible substrate for network security operations. As a means of demonstration, we have implemented a complete and functional NIDS in a Xilinx Virtex II/Pro FPGA that performs in-line packet filtering on multiple Gigabit Ethernet links based on the SNORT rule set. |
![]() |
An update on some of our RC progress, and information about the Cray XD1 system that we are currently evaluating. Presented at the Dean 8900 R&D Seminar. |
![]() |
I was asked to give a talk at the SOS8 workshop in Charleston, SC, on the use of FPGAs in upcoming HPC systems. This talk provides an introduction to RC, a description of our near-term strategy for the technology, and a list of challenges that the research field faces. A brief description of Sandia's LDRD work is presented, including Keith Underwood's (SNL/NM) recent floating point performance results and our (SNL/CA) network interface progress. |
![]() |
This talk provides an overview of reconfigurable computing. Includes a brief history of the field, architecture of FPGAs, and three examples of how hardware can be designed for an FPGA to accelerate an application. Presented at the Dean 8900 R&D Seminar by Mitch Sukalski. |
![]() |
This talk provides an overview of reconfigurable computing. Includes a brief history of the field, architecture of FPGAs, and three examples of how hardware can be designed for an FPGA to accelerate an application. Presented at the Dean 8900 R&D Seminar by Mitch Sukalski. Note: The introduction has some slick history slides. |
![]() |
My doctoral defense at Georgia Tech. Describes resource-rich cluster computers for multimedia applications, and the GRIM message layer that facilitates communication in these clusters. A portion of this work includes the adaptation of a Xilinx Virtex 1000 FPGA card to function as a networked, computational resource. Additional Unlocked Links: |
![]() |
Presentation slides given at PDPTA. |
![]() |
Abstract: A key task for providing high performance in cluster computers is efficiently transferring data between cluster resources. This study focuses on one component of the communication pipeline: the host to peripheral card interface. As Moore~s Law continues to progress, we are seeing successive generations of clusters with increasing compute power and communications bandwidth, but with roughly the same I/O systems. Communication software is continuously being re-optimized for each succeeding generation of hardware. In this paper we describe a tunable library for host-to-device communication. The library profiles performance characteristics of the host~s hardware environment and utilizes this information to automatically configure host-to-device transfer mechanisms. In addition to taking advantage of CPU- specific features, the library exposes I/O characteristics of individual peripheral devices in data transfer optimizations. The benefit of the library is demonstrated by providing measurements and experiences with three generations of clusters. Additional Unlocked Links: |
![]() |
This is a poster I helped put together for an internal CERCS review by Intel. Funny how you do eight out-of-nine slides and get listed as the sixth-of-seven authors. |
![]() |
My advisor asked me to give a lecture on message layers for cluster computers for his CS8803 High Performance Communication graduate class. I did a core dump on the things I thought were important, which probably wasn't all that useful to most of his students. Sudha later told me that I should have put in some tedious numerical calculations, since all students are looking for is a formula they can plug some numbers into. |
![]() |
Presentation slides |
![]() |
Abstract: This paper explores the view that the SAN network infrastructure can be an active computational entity capable of supporting certain classes of data intensive computations effectively during communication. The performance is achieved via the use of Field Programmable Gate Arrays (FPGAs) in the network interfaces (NIs). This paper describes the programming model and the design of a prototype hardware/software implementation using commercial FPGA devices coupled with Myrinet. An active messages style of programming is used to support application- transparent, dynamic reconfiguration of the FPGA hardware to accommodate different computations over time. Performance evaluation of this implementation quantifies the overheads and sources of performance improvement. Additional Unlocked Links: |
![]() |
During my second internship at JPL I rewrote my sensor network simulator. This version has a number of improvements in both the GUI and the simulator. The GUI presents nodes in 3D that you can drag around (I wrote most of that from scratch, by the way). I also beefed up the nodes in the simulator to make it easier to add in new functionality. The node has an ok communication stack built into it that allows a user to plug in different MAC protocols (well, provided they write them). Once again, users can write up applications as state machines that the nodes execute cycle-by-cycle. The link goes to a launch page that can execute the simulator. I wrote up a new clustering algorithm for the simulator that is based on politics- random nodes in the network campaign to be leaders. Once a node gains enough supporters, it becomes an official leader. However, if the cluster starts becoming too large, nodes can mutiny and go off to form their own cluster. Additional Unlocked Links: |
![]() |
Abstract: Resource rich clusters are an emerging category of clusters of workstations where cluster nodes comprise of modern CPUs as well as high-performance peripheral devices such as intelligent I/O interfaces, active disks, and capture devices that directly access the network. These clusters target specific applications such as digital libraries, web servers, and multimedia kiosks. We argue that such clusters benefit from a re-examination of the design of the message layer to retain high performance communication while facilitating the interface to endpoints for a variety of devices. This paper describes a message layer design which includes optimistic flow control, the use of logical channels, a push-style cut-through injection optimization, and an API supporting cluster-wide active message handler management. The goal is to support a number of diverse cluster hardware configurations where communication endpoints exist in a variety of locations within a node. The current implementation has been tested on a Myrinet cluster with communication endpoints located in the host CPUs as well as Intel i960 based I2O server cards. |
![]() |
Abstract: Resource rich clusters are an emerging category of computational platform where cluster nodes have both CPUs as well as high- performance I/O cards. These clusters target specific applications such as digital libraries, web servers, and multimedia kiosks. The presence of communication endpoints at locations other than the host CPU requires a re-examination of how middleware for these clusters should be constructed. A key issue of middleware design is the management of flow control for the reliable delivery of messages. We propose using a network interface based optimistic flow control scheme to address resource rich cluster requirements. We implement this functionality with a message layer called GRIM, and compare its general performance to other well-known message layers. This implementation suggests that the necessary middleware functionality can not only be constructed efficiently, but also in a way that provides additional middleware benefits. |
![]() |
During my first internship at JPL, my manager asked me to look at wireless sensor networks for deployment on Mars. It was a big change from what I had been doing, but an interesting project in any case. When I went back to school, I took a wireless and mobile networking class that had a large course project. I decided to put together a basic wireless sensor network simulator so I could begin looking at distributed algorithms. For lack of a better name, I called it "sensorsim". This name was a terrible choice, as about a dozen other people around that time did the same. My sensorsim is nothing to do with anything you've probably heard of before. The attached link is a copy of the web page. The great thing is that I wrote it in Java, so you may be able to launch it from your browser. It should run a simulation where many nodes wake up at different times and try to organize themselves into neighboring clusters. Each node in the simulation really runs a state machine that implements the distributed algorithm. While the communication assumptions are as silly as anyone else's were at the time, it's still fun to watch clusters form in the network. I later revamped the whole thing in SensorSimII. Additional Unlocked Links: |
![]() |
PeZ is a Pole-Zero editor for Matlab that I developed while working for Dr. Schafer and Dr. McClellan as an undergrad. Similar to other PZ editors, PeZ allows users to place poles and zeros in the complex z-plane and visualize the corresponding filter results. This version includes code tweeks to allow it to run well on different hardware platforms and different versions of Matlab. It also provides support for importing filter data from other Matlab programs (e.g., the filtdemo filter design program). PeZ was a pretty amazing piece of software for the time. While there were plenty of other PZ editors available at the time, PeZ was the first one that I know of that (1) worked in Matlab, (2) ran on Windows, Mac, and Unix using the same source, and (3) was able to do so with real-time performance. The GUI code was all written by hand without a layout manager. The software was included on a CD for the DSP First book. Numerous people helped along the way. Brad North came up with the first optimizations for Windows that made real-time drawing possible. Dr. Schafer, Dr. Yoder, and Dr. McClellan pushed me to add new features and even debugged some of my code to make it run faster. Finally, Amer Abufadel and the EE2200 students helped test out the software to make sure it worked well. Additional Unlocked Links: |
![]() |
Abstract: This paper explores early analysis of the complex relationships between system architectures and the active and packaging materials from which they are implemented. The goals of this analysis are to enable the designer to specify cost effective technologies for a particular system and to uncover resources which may be exploited to increase performance of such a system, early in the design process. We describe a prototype tool called IMPACT, which will predict cost, performance, power, and reliability, and present several case studies demonstrating its use. |
![]() |
Abstract: This paper explores early analysis of the complex relationships between system architectures and the active and packaging materials from which they are implemented. The goals of this analysis are to enable the designer to specify cost effective technologies for a particular system and to uncover resources which may be exploited to increase performance of such a system, early in the design process. We describe a prototype tool called IMPACT, which will predict cost, performance, power, and reliability, and demonstrate its use on several problems. |
![]() |
Abstract: Computer system design addresses the optimization of metrics such as cost, performance, power, and reliability in the presence of physical constraints. The advent of large area, low cost Multi-Chip Modules (MCM) will lead to a new class of optimal system designs. This paper explores the early analysis of the impact of packaging technology on this design process. Our goal is to develop a suite of tools to evaluate computing system architectures under the constraints of various technologies. The design of the memory hierarchy in high speed microprocessors is used to explore the nature and type of trade-offs that can be made during the conceptual design of computing systems. |