Craig Ulmer

SmartNICs Project Final Report

2024-04-01 pub smartnics hpc

In December we finished our three-year, ASCR-funded "Offloading Data Management Services to SmartNICs" project. One of our deliverables was to write a final report that consolidates what we learned into a single report. This 144-page (!) report includes sections from our proposal and previous papers, and examines using SmartNICs from multiple perspectives.


There are three new topics in this report that we haven't covered before:

  • Apache Arrow vs Kokkos: Previously we've talked about Arrow as a way to write code that scales to multiple cores. In Chapter 4 we port three types of simple analytics to both Arrow and Kokkos and examine how well they scale on Host and SmartNIC processors. Arrow was tedious to write, but was competitive! The code listings are included in Appendix A.
  • Injection Optimizations: Host-to-NIC transfer performance has always been a problem due to memory addressing problems. In Chapter 8 we cover some optimizations that enable us to use the SmartNIC to gather data from the host's native buffers so that it can be serialized during injection.
  • Job-local Storage with SmartNICs: As a means of addressing performance issues with using a shared filesystem in a platform, we investigated using SmartNICs to host a private BeeOND filesystem on a job's SmartNICs. Given the limited flash memory of the SmartNIC, we borrowed the host's disk for this work via NVMeoF. Chapter 9 talks about the challenges of getting NVMeoF to work with offloading and covers some early jitter experiments.


Abstract

Modern workflows for high-performance computing (HPC) platforms rely on data management and storage services (DMSSes) to migrate data between simulations, analysis tools, and storage systems. While DMSSes help researchers assemble complex pipelines from disjoint tools, they currently consume resources that ultimately increase the workflow's overall node count. In FY21-23 the DOE ASCR project "Offloading Data Management Services to SmartNICs" explored a new architectural option for addressing this problem: hosting services in programmable network interface cards (SmartNICs). This report summarizes our work in characterizing the NVIDIA BlueField-2 SmartNIC and defining a general environment for hosting services in compute-node SmartNICs that leverages Apache Arrow for data processing and Sandia's Faodel for communication. We discuss five different aspects of SmartNIC use. Performance experiments with Sandia's Glinda cluster indicate that while SmartNIC processors are an order of magnitude slower than servers, they offer an economical and power efficient alternative for hosting services.

Publication

  • SAND Report Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Aldrin Montana, Matthew L. Curry, Scott Levy, Whit Schonbein, and John Shawger, "Offloading Data Management Services to SmartNICs: Project Summary". SAND2024-03873, April 2024.

Presentations

  • HPC Initiatives Slides: Presentation I gave at the SNL HPC Initiatives seminar in January.
  • SRU Slides: Presentation I gave to an undergraduate Computer Engineering class at Slippery Rock University in October.

The Glinda Cluster

2023-10-04 pub hpc smartnics

During the pandemic Sandia procured a new 126-node HPDA cluster named Glinda. While it was a nightmare working through all the supply chain issues with the global shutdown, the hardware has been quite good: compute nodes feature a 32-core Zen3 processor, 512GB of RAM, a BlueField-2 InfiniBand SmartNIC, and an Ampere A100 GPU. People like that they can grab a few nodes to do some deep learning experiments before heading over to the DGX boxes for full runs. We received some requests for a publication that they can reference, so we wrote the below tech report with all the details. The report covers background info for the types of platforms at the labs, details about the hardware and data center, power measurements, and practical installation and operational info for the A100 and BlueField-2.


The Glinda name for this cluster is a reference to the Wizard of Oz. Glinda's Book of Records really resonated with us, as she uses it to record all the important things that are happening throughout the Land of Oz.



Abstract

Sandia National Laboratories relies on high-performance data analytics (HPDA) platforms to solve data-intensive problems in a variety of national security mission spaces. In a 2021 survey of HPDA users at Sandia, data scientists confirmed that their workloads had largely shifted from CPUs to GPUs and indicated that there was a growing need for a broader range of GPU capabilities at Sandia. While the multi-GPU DGX systems that Sandia employs are essential for large-scale training runs, researchers noted that there was also a need for a pool of single-GPU compute nodes where users could iterate on smaller-scale problems and refine their algorithms.

In response to this need, Sandia procured a new 126-node HPDA research cluster named Glinda at the end of FY2021. A Glinda compute node features a single-socket, 32-core, AMD Zen3 processor with 512GB of DRAM and an NVIDIA A100 GPU with 40GB of HBM2 memory. Nodes connect to a 100Gb/s InfiniBand fabric through an NVIDIA BlueField-2 VPI SmartNIC. The SmartNIC includes eight Arm A72 processor cores and 16GB of DRAM that network researchers can use to offload HPDA services. The Glinda cluster is adjacent to the existing Kahuna HPDA cluster and shares its storage and administrative resources.

This report summarizes our experiences in procuring, installing, and maintaining the Glinda cluster during the first two years of its service. The intent of this document is twofold. First, we aim to help other system architects make better-informed decisions about deploying HPDA systems with GPUs and SmartNICs. This report lists challenges we had to overcome to bring the system to a working state and includes practical information about incorporating SmartNICs into the computing environment. Second, we provide detailed platform information about Glinda's architecture to help Glinda's users make better use of the hardware.

Publication

  • SAND Report Craig Ulmer, Jerry Friesen, and Joseph Kenny, "Glinda: An HPDA CLuster with Ampere A100 GPUs and BlueField-2 VPI SmartNICs". SAND2023-10451, October 2023.

Opportunistic Query Execution on SmartNICs

2023-09-26 pub hpc smartnics arrow

In our SmartNIC project we've been using Apache Arrow to represent and process in-transit data that flows between different jobs in a workflow. One of the advantages of using Arrow is that it includes a sophisticated compute engine named Acero that allows you to execute queries on tabular data. Previously we've written some basic queries in C++ to have Acero split entries in a table based on a field. Lately we've been using Acero to execute queries that a user might create at runtime (via tools like DuckDB or Ibis that can generate Substrait query plans). Jianshen and I wrote some client/server code for Faodel that allows a client to transmit a serialized substrait plan to an endpoint, deserialize the requested objects into Arrow tables, apply the plan to the data, and send the serialized results back to the client. This conduit gives us a handy way to query a remote SmartNIC and inspect its in-transit data.


For this paper (and his dissertation), Jianshen focused on making a decision engine that could quickly estimate whether it would be faster to execute the query at the SmartNIC or simply return the raw data and defer execution to the client. He measured overheads for executing queries and transmitting data, and then used machine learning techniques to make predictions about how long a query would take and how much data it would return. He used Apache DataSketches to rapidly characterize the in-transit data the SmartNIC held. At runtime the decision engine parsed the query syntax and applied probabilities to each clause to estimate how selective a query would ultimately be.


Abstract

High-performance computing (HPC) systems researchers have proposed using current, programmable network interface cards (or SmartNICs) to offload data management services that would otherwise consume host processor cycles in a platform. While this work has successfully mapped data pipelines to a collection of SmartNICs, users require a flexible means of inspecting in-transit data to assess the live state of the system. In this paper, we explore SmartNIC-driven opportunistic query execution, i.e., enabling the SmartNIC to make a decision about whether to execute a query operation locally (i.e., "offload") or defer execution to the client (i.e., "push-back"). Characterizations of different parts of the end-to-end query path allow the decision engine to make complexity predictions that would not be feasible by the client alone.

Publication

  • HPEC Paper Jianshen Liu, Carlos Maltzahn, and Craig Ulmer, "Opportunistic Query Execution on SmartNICs for Analyzing In-Transit Data" in IEEE High Performance Extreme Computing, September 2023.

Anycubic Kobra-2 FDM Printer

2023-06-18 3d print

A few years ago I bought a 3D resin printer so the kids and I could learn a little bit more about modeling and fabricating 3D objects. While it's been a great experience, we haven't printed much in the last year because of all the headaches of dealing with resin. Every time we do a print we have to deal with temperatures, level the plate, put on all the safety gear, and then clean up everything at the end. It's a lot of overhead and dangerous enough I don't want my kids doing it when I'm not home. I've been thinking it would be nice to have a traditional FDM printer on hand to lower the barrier for printing simple things so that printing will be more accessible to my kids. After a lot of internet wandering, I decided to get the new Anycubic Kobra-2. It's new, works with Linux, and shipped from Amazon with a 1KG spool of filament for $300.


Setup

The kids and I setup the Kobra-2 on my desk in the garage. The assembly wasn't too difficult, although it took us a while to figure out how to hold the frame so we could get some of the machined screws lined up properly. It was also a little unclear how the feeder tube was supposed to go in the header (does this go any farther in?). Once it was setup we ran the auto calibration tool to probe the height of the build place. Auto calibration was a required feature for me, and one of the reasons why I'm happy to be buying a printer after the technology has had a chance to mature. We then preheated the filament and had it print the famous 3DBenchy boat design. The kids and I watched with wonder as the extruder spun around the plate with robot brrrrr noises. FDM printing is so much more exciting to watch than resin because you really see it happen. With resin the plate moves up and down every few seconds, with an upside-down design that's coated in excess resin. While you add a whole layer at a time, it takes a long time to get through all the pads and supports before you get to your actual design.

Sample Prints

3D Benchy only took 30 minutes to print out. One of the other selling points of this printer is that it can do higher speed prints (150mm/s to 250mm/s, compared to the 60mm/s of the stock Ender printers). I was really tempted to get one of the $600 Bambu printers, which can do up to 500mm/s, but decided we should start with a basic printer and see how much we like it first. Benchy came out looking pretty good, though you can see some pixelation in the windows that I don't think you'd have in resin. That's fine though- I think I'm more interested in building functional widgets with this printer than detailed figures.


The next thing we printed was a small mesh cup I pulled from thingiverse. This design came as a plain STL object so I had to load it into a slicer to render to gcode. Anycubic says to use PrusaSlicer, which is a powerful slicer built for Prusa printers. It's free and has a Linux version that worked on my Chromebook's Linux container. I had to download the settings from the Anycubic support site, but they came up fine. For this design I just loaded the cup, hit slice, and saved the gcode. Prusa had a lot of detailed info about how it built the object. I liked that it recognized the interior and autofilled it with a grid to save on material. The scaled down version of the print took about an hour to build (correctly predicted by Prusa). I was impressed that the printer was able to build a thin mesh and have it come out ok (though later I broke it trying to trim some of the base).


Next up was a micro-sd card holder. I found a clever design someone had made that had a radial container with a screw-on lid. The threading is really interesting to me because it gives you a way to connect parts together (someone also modified the design so you could screw together multiple micro-sd containers, though I doubt I'll ever fill this one). The parts I printed screwed together just fine. Two of the slots weren't deep enough, but that's ok. I should have added an up label though, as the slots don't have enough friction to keep cards in place if you open it upside down.


Finally, I printed a baby guardian dragon dice holder from Thingiverse for my niece. This design has a spot for you to put a die. It's a cute design, though the FDM version resulted in a bunch of lines on the angled surfaces.


Issues

We have had a few issues with the Kobra-2 during our first week of use. My son had a few failed prints that we're trying to figure out. The printer would get partway through the base of the design, get stuck, and then go into an endless calibration loop. It's possible this is because we installed a newer version of the slicer than we were previously using. When I went back and sliced the design with my chromebook it printed fine. Again, it's nice that the setup/cleanup for a print is so easy. The other main issue has been quality. The FDM prints look good, but they're not as detailed as the resin prints. Below are some zoom-ins that show how this results in the FDM prints coming out jagged in certain spots.


Power

One thing I've noticed about the FDM printer is that it the motors really get a beating, zig zagging back and forth all the time. Our house doesn't have great wiring, so the lights in the garage (and bathroom) flicker slightly when the printer is bouncing. Also, there's a spike in power when you start up because it needs to warm up the build plate and nozzle. Maybe I'll look into getting a battery or power conditioner for the plug to smooth out the signal.

Overall

Overall, I'm pretty happy with the Kobra-2 so far. After dealing with all the resin printing pains it's been a breeze to get FDM working. I don't think we'll print a ton of things, but it's nice to have the option to design and build stuff when we want.


Extending Composable Data Services into SmartNICs

2023-05-19 pub smartnics hpc

In my SmartNICs project we've been busy building examples of how collections of SmartNICs in an HPC platform can work together to implement data services that are useful to HPC workflows. In the Fall we built a distributed, particle-sifting example that reorganizes simulation results into a form that's easier for downstream applications to consume. We used Faodel to control the distribution of data between collections of SmartNICs and Apache Arrow to reorganize the data. Thanks to Sandia's Glinda cluster, this is the first time we've had the opportunity to run SmartNIC experiments at the scale of 100 cards.



Abstract

Advanced scientific-computing workflows rely on composable data services to migrate data between simulation and analysis jobs that run in parallel on high-performance computing (HPC) platforms. Unfortunately, these services consume compute-node memory and processing resources that could otherwise be used to complete the workflow's tasks. The emergence of programmable network interface cards, or SmartNICs, presents an opportunity to host data services in an isolated space within a compute node that does not impact host resources. In this paper we explore extending data services into SmartNICs and describe a software stack for services that uses Faodel and Apache Arrow. To illustrate how this stack operates, we present a case study that implements a distributed, particle-sifting service for reorganizing simulation results. Performance experiments from a 100-node cluster equipped with 100Gb/s BlueField-2 SmartNICs indicate that current SmartNICs can perform useful data management tasks, albeit at a lower throughput than hosts.

Best Paper Award

At the conference they gave us an award for best paper. I really enjoyed meeting the other CompSYS folks, and had good, friendly discussions with a number of people.


Publication

  • Compsys Paper Craig Ulmer, Jianshen Liu, Carlos Maltzahn, and Matthew L. Curry, "Extending Composable Data Services into SmartNICs" in Second Workshop on Composable Systems (CompSYS), May 2023.

Presentation