SPARC Burst Buffer Work

2018-01-01 Mon
faodel io hpc pub

We published an unclassified unlimited release (UUR) report.

Abstract

Recent high-performance computing (HPC) platforms such as the Trinity Advanced Technology System (ATS-1) feature burst buffer resources that can have a dramatic impact on an application's I/O performance. While these non-volatile memory (NVM) resources provide a new tier in the storage hierarchy, developers must find the right way to incorporate the technology into their applications in order to reap the benefits. Similar to other laboratories, Sandia is actively investigating ways in which these resources can be incorporated into our existing libraries and workflows without burdening our application developers with excessive, platform-specific details. This FY18Q1 milestone summaries our progress in adapting the Sandia Parallel Aerodynamics and Reentry Code (SPARC) in Sandia's ATDM program to leverage Trinity's burst buffers for checkpoint/restart operations. We investigated four different approaches with varying tradeoffs in this work: (1) simply updating job script to use stage-in/stage out burst buffer directives, (2) modifying SPARC to use LANL's hierarchical I/O (HIO) library to store/retrieve checkpoints, (3) updating Sandia's IOSS library to incorporate the burst buffer in all meshing I/O operations, and (4) modifying SPARC to use our Kelpie distributed memory library to store/retrieve checkpoints. Team members were successful in generating initial implementation for all four approaches, but were unable to obtain performance numbers in time for this report (reasons: initial problem sizes were not large enough to stress I/O, and SPARC refactor will require changes to our code). When we presented our work to the SPARC team, they expressed the most interest in the second and third approaches. The HIO work was favored because it is lightweight, unobtrusive, and should be portable to ATS-2. The IOSS work is seen as a long-term solution, and is favored because all I/O work (including checkpoints) can be deferred to a single library.

Publications

EMPRESS Metadata Harvesting

2017-11-01 Wed
faodel io hpc pub

We published an unclassified unlimited release (UUR) paper.

Abstract

Significant challenges exist in the efficient retrieval of data from extreme-scale simulations. An important and evolving method of addressing these challenges is application-level metadata management. Historically, HDF5 and NetCDF have eased data retrievalby offering rudimentary attribute capabilities that provide basic metadata. ADIOS simplified data retrieval by utilizing metadata for each process' data. EMPRESS provides a simple example of the next step in this evolution by integrating per-process metadata with thestorage system itself, making it more broadly useful than single file or application formats. Additionally, it allows for more robust and customizable metadata.

Publications

Reference Architecture for Emulytics Clusters

2017-10-01 Sun
clusters pub

We published an unclassified unlimited release (UUR) technical report.

Abstract

In this document we describe a reference architecture developed for Emulytics clusters at Sandia National Laboratories. Taking into consideration the constraints of our Emulytics software and the requirements for integration with the larger computing facilities at Sandia, we developed a cluster platform suitable for use by Sandia's several Emulytics toolsets and also useful for more general large-scale computing tasks.

Publication

Kahuna in Sandia Lab News

2017-04-21 Fri
clusters news

We discussed Kahuna in an unclassified unlimited release (UUR) lab newspaper article.

News Article

Data-Management Services for ECP

2017-01-31 Tue
net systems

We presented an unclassified unlimited release (UUR) poster.

Poster