Feb 20, 2017
Jolley Hall, Room 309
The SMURFS Project: Simulation and Modeling for Understanding Resilience and Faults at Scale
Department of Computer Science
University of New Mexico
Current HPC research explorations target computer systems with exaflop (10^18 or a quintillion floating point operations per second) capabilities. Such computational power will enable new, important discoveries across all basic science domains. Application resilience is a major challenge to the realization of extreme scale computing systems. The SMURFS Project addresses this challenge by developing methods to improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment and a given fault-tolerance strategy at extreme scale. Specifically, SMURFS explores: (1) Advanced simulation and modeling capabilities for studying application resilience at scale; (2) Comprehensive, comparative studies of existing and new fault-tolerance strategies; (3) Detailed understandings of how application features interplay with different fault-tolerance strategies and hardware technologies; and (4) Effective prescriptions to guide application developers, hardware architects and system designers to realize efficient, resilient extreme scale capabilities.
(This project is a collaboration amongst the University of New Mexico, the University of Tennessee and the Sandia National Labs. It is funded in part by the National Science Foundation.)
Dorian Arnold is an associate professor in the Department of Computer Science at the University of New Mexico. His broad research interests include operating and distributed systems, system software, middleware and run-time systems, online (streaming) data analysis, fault-tolerance and high-performance tools.
Particularly, he is interested in the performance, scalability and reliability issues that abound in extreme scale computing environments that comprise of hundreds of thousands or even millions of components. Professor Arnold's research group maintains strong collaborations with the Los Alamos, Livermore, and Sandia National Laboratories and Cray Inc. These collaborations lend the privilege of working world-class scientists and engineers on world-class computing systems. In part due to such collaborations, Professor Arnold's research projects were selected as Top 100 R&D technologies in 1999 and 2011.
Arnold is very active in the HPC community and has held many leadership roles in major HPC conferences and is currently on the SC steering committee. He is also very dedicated to diversity and inclusion in computer science and serves as the General Chair for the 2017 Tapia Conference. He is an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems and was recently appointed as an ACM Distinguished Speaker.
Arnold holds a Ph.D. in Computer Science from the University of Wisconsin, an M.S. in Computer Science from the University of Tennessee and a B.S. in Mathematics and Computer Science from Regis University (Denver, CO).