Christian Engelmann, Ph.D.
follow
https://www.christian-engelmann.info/
Home
follow
https://www.christian-engelmann.info/
Solutions
follow
https://www.christian-engelmann.info/?page_id=1800
xSim – The Extreme-scale Simulator
follow
https://www.christian-engelmann.info/?page_id=1804
redMPI – A Redundant MPI
follow
https://www.christian-engelmann.info/?page_id=1873
Proactive Fault Tolerance Framework
follow
https://www.christian-engelmann.info/?page_id=1912
Hybrid Full/Incremental System-level Checkpointing
follow
https://www.christian-engelmann.info/?page_id=1939
Symmetric Active/Active High Availability for HPC System Services
follow
https://www.christian-engelmann.info/?page_id=1955
Current Research
follow
https://www.christian-engelmann.info/?page_id=148
2021-…: An Open Federated Architecture for the Laboratory of the Future
follow
https://www.christian-engelmann.info/?page_id=3858
2015-…: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale
follow
https://www.christian-engelmann.info/?page_id=2416
Past Research
follow
https://www.christian-engelmann.info/?page_id=2481
2018-2019: rOpenMP: A Resilient Parallel Programming Model for Heterogeneous Systems
follow
https://www.christian-engelmann.info/?page_id=3213
2015-2019: Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems
follow
https://www.christian-engelmann.info/?page_id=2470
2013-16: Hobbes – OS and Runtime Support for Application Composition
follow
https://www.christian-engelmann.info/?page_id=2099
2013-16: MCREX – Monte Carlo Resilient Exascale Solvers
follow
https://www.christian-engelmann.info/?page_id=2047
2012-14: Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
follow
https://www.christian-engelmann.info/?page_id=1443
2011-12: Extreme-scale Algorithms and Software Institute
follow
https://www.christian-engelmann.info/?page_id=1024
2009-11: Soft-Error Resilience for Future-Generation High-Performance Computing Systems
follow
https://www.christian-engelmann.info/?page_id=909
2008-11: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond
follow
https://www.christian-engelmann.info/?page_id=914
2008-11: Scalable Algorithms for Petascale Systems with Multicore Architectures
follow
https://www.christian-engelmann.info/?page_id=916
2006-09: Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms
follow
https://www.christian-engelmann.info/?page_id=918
2006-08: Virtualized System Environments for Petascale Computing and Beyond
follow
https://www.christian-engelmann.info/?page_id=922
2004-07: MOLAR – Modular Linux and Adaptive Runtime Support for High-End Computing
follow
https://www.christian-engelmann.info/?page_id=924
2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing
follow
https://www.christian-engelmann.info/?page_id=926
2002-04: Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures
follow
https://www.christian-engelmann.info/?page_id=928
2000-05: Harness – Heterogeneous Distributed Computing
follow
https://www.christian-engelmann.info/?page_id=930
Teaching
follow
https://www.christian-engelmann.info/?page_id=835
Publications
follow
https://www.christian-engelmann.info/?page_id=15
Peer-reviewed Journal Papers
follow
https://www.christian-engelmann.info/?page_id=99
Peer-reviewed Conference Papers
follow
https://www.christian-engelmann.info/?page_id=109
Peer-reviewed Workshop Papers
follow
https://www.christian-engelmann.info/?page_id=1125
Peer-reviewed Conference Posters
follow
https://www.christian-engelmann.info/?page_id=116
White Papers
follow
https://www.christian-engelmann.info/?page_id=97
Technical Reports
follow
https://www.christian-engelmann.info/?page_id=118
Talks and Lectures
follow
https://www.christian-engelmann.info/?page_id=122
Co-advised Theses
follow
https://www.christian-engelmann.info/?page_id=124
Theses
follow
https://www.christian-engelmann.info/?page_id=127
BibTeX Citations
follow
https://www.christian-engelmann.info/?page_id=26
Activities
follow
https://www.christian-engelmann.info/?page_id=288
follow
javascript:void(0);
About Me
follow
https://www.christian-engelmann.info/?p=1
Senior Scientist
follow
https://csmd.ornl.gov/profile/christian-engelmann
Intelligent Systems and Facilities Group
follow
https://csmd.ornl.gov/group/intelligent-systems-and-facilities
Oak Ridge National Laboratory (ORNL)
follow
http://www.ornl.gov
Download
follow
engelmann.pdf
Download
follow
publications.pdf
follow
https://www.xing.com/profile/Christian_Engelmann7
follow
http://www.linkedin.com/in/christianengelmann
follow
https://scholar.google.com/citations?user=99-rbwsAAAAJ
follow
https://dblp.org/pid/71/4514
orcid.org/0000-0003-4365-6416
follow
https://orcid.org/0000-0003-4365-6416
18037364000
follow
https://www.scopus.com/authid/detail.uri?authorId=18037364000
4,434
follow
https://scholar.google.com/citations?user=99-rbwsAAAAJ
32
follow
https://scholar.google.com/citations?user=99-rbwsAAAAJ
71
follow
https://scholar.google.com/citations?user=99-rbwsAAAAJ
3
follow
http://www.oakland.edu/enp
171
follow
?page_id=288#committees
46
follow
?page_id=288#committees
60
follow
?page_id=288#reviews
2015 US Department of Energy Early Career Research Award
follow
https://science.osti.gov/early-career
… more
follow
?page_id=3858
… more
follow
?page_id=2416
New Approach to Fault Tolerance Means More Efficient High-Performance Computers
follow
https://www.energy.gov/science/ascr/articles/new-approach-fault-tolerance-means-more-efficient-high-performance-computers
What’s New in HPC Research: GPU Lifetimes, the Square Kilometre Array, Support Tickets & More
follow
https://www.hpcwire.com/2021/01/04/whats-new-in-hpc-research-gpu-lifetimes-the-square-kilometre-array-support-tickets-more
What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
follow
https://www.hpcwire.com/2018/11/19/whats-new-in-hpc-research-thrill-for-big-data-scaling-resilience-and-more
Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems
follow
https://insidehpc.com/2018/08/characterizing-faults-errors-failures-extreme-scale-computing-systems/
Mounting a charge. Early-career awardees attack exascale computing on two fronts: Power and resilience
follow
https://ascr-discovery.org/2015/07/mounting-a-charge
Tackling Power and Resilience at Exascale
follow
http://www.hpcwire.com/2015/07/21/tackling-power-and-resilience-at-exascale
Supercomputers face growing resilience problems
follow
https://www.computerworld.com/article/2493336/supercomputers-face-growing-resilience-problems.html
21st ACM International Conference on Supercomputing (ICS) 2007
follow
http://ics07.ac.upc.edu
10.1145/1274971.1274978
follow
http://dx.doi.org/10.1145/1274971.1274978
follow
javascript:showAbstract("Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today`s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen`s live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.");
follow
http://www.christian-engelmann.info/publications/nagarajan07proactive.pdf
follow
http://www.christian-engelmann.info/publications/nagarajan07proactive.ppt.pdf
follow
?page_id=26#nagarajan07proactive
International Journal of High Performance Computing Applications (IJHPCA)
follow
http://hpc.sagepub.com
10.1177/1094342014522573
follow
http://dx.doi.org/10.1177/1094342014522573
follow
javascript:showAbstract("We present here a report produced by a workshop on Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.");
follow
http://www.christian-engelmann.info/publications/snir14addressing.pdf
follow
?page_id=26#snir14addressing
25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012
follow
http://sc12.supercomputing.org
10.1109/SC.2012.49
follow
http://dx.doi.org/10.1109/SC.2012.49
follow
javascript:showAbstract("Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption. To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy. Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0% and 30% as compared to unprotected applications without redundancy. Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors.");
follow
http://www.christian-engelmann.info/publications/fiala12detection2.pdf
follow
http://www.christian-engelmann.info/publications/fiala12detection2.ppt.pdf
follow
?page_id=26#fiala12detection2
21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008
follow
http://sc08.supercomputing.org
10.1145/1413370.1413414
follow
http://dx.doi.org/10.1145/1413370.1413414
follow
javascript:showAbstract("As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.");
follow
http://www.christian-engelmann.info/publications/wang08proactive.pdf
follow
http://www.christian-engelmann.info/publications/wang08proactive.ppt.pdf
follow
?page_id=26#wang08proactive
32nd International Conference on Distributed Computing Systems (ICDCS) 2012
follow
http://icdcs-2012.org/
10.1109/ICDCS.2012.56
follow
http://dx.doi.org/10.1109/ICDCS.2012.56
follow
javascript:showAbstract("Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At 80,000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.");
follow
http://www.christian-engelmann.info/publications/elliott12combining.pdf
follow
http://www.christian-engelmann.info/publications/elliott12combining.ppt.pdf
follow
?page_id=26#elliott12combining
27th European Conference on Parallel and Distributed Computing (Euro-Par) 2021 Workshops
follow
http://2021.euro-par.org
14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids
follow
http://www.csm.ornl.gov/srt/conferences/Resilience/2021
follow
javascript:showAbstract("Resilience to faults, errors, and failures in extreme-scale HPC systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.");
follow
?page_id=26#kumar21rdpm
Journal of Parallel and Distributed Computing (JPDC)
follow
http://www.elsevier.com/locate/jpdc
10.1016/j.jpdc.2021.03.001
follow
http://dx.doi.org/10.1016/j.jpdc.2021.03.001
follow
javascript:showAbstract("Today's High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of peta ops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy");
follow
http://www.christian-engelmann.info/publications/kumar21study.pdf
follow
?page_id=26#kumar21study
25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020
follow
http://prdc.dependability.org/PRDC2020
10.1109/PRDC50213.2020.00014
follow
http://dx.doi.org/10.1109/PRDC50213.2020.00014
follow
javascript:showAbstract("For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including application scalability, power and energy efficiency. Therefore, resilience solutions for HPC systems must be thoughtfully engineered and deployed. In previous work, we developed the concept of resilience design patterns, which consist of templated solutions based on well-established techniques for detection, mitigation and recovery. In this paper, we use these patterns as the foundation to propose new approaches to designing runtime systems for HPC systems. The instantiation of these patterns within a runtime system enables flexible and adaptable end-to-end resiliency solutions for HPC environments. The paper describes the architecture of the runtime system, named Plexus, and the strategies for dynamically composing and adapting pattern instances under runtime control. This runtime-based approach enables actively balancing the cost-benefit trade-off between performance overhead and protection coverage of the resilience solutions. Based on a prototype implementation of PLEXUS, we demonstrate the resiliency and performance gains achieved by the pattern-based runtime system for a parallel linear solver application.");
follow
http://www.christian-engelmann.info/publications/hukerikar20plexus.pdf
follow
?page_id=26#hukerikar20plexus
33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020
follow
http://sc20.supercomputing.org
10.1109/SC41405.2020.00045
follow
http://dx.doi.org/10.1109/SC41405.2020.00045
follow
javascript:showAbstract("The Cray XK7 Titan was the top supercomputer system in the world for a very long time and remained critically important throughout its nearly seven year life. It was also a very interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three very significant rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 operation years in the GPU lifetimes, which correspond to Titan's 6 year long productive period after an initial break-in period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the system cooling architecture and job scheduling. In addition to describing some of the system history, the data collection, data cleaning, and our analysis of the data, we provide reliability recommendations for designing future state of the art supercomputing systems and their operation. We make the data and our analysis codes publicly available.");
follow
http://www.christian-engelmann.info/publications/ostrouchov20gpu.pdf
follow
http://www.christian-engelmann.info/publications/ostrouchov20gpu.ppt.pdf
follow
?page_id=26#ostrouchov20gpu
33rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2020
follow
http://sc20.supercomputing.org
10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2020
follow
http://sites.google.com/site/ftxsworkshop/home/ftxs-2020
10.1109/FTXS51974.2020.00008
follow
http://dx.doi.org/10.1109/FTXS51974.2020.00008
follow
javascript:showAbstract("Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.");
follow
http://www.christian-engelmann.info/publications/kumar20models.pdf
follow
?page_id=26#kumar20models
WordPress
follow
http://wordpress.org/
NeoEase
follow
http://www.neoease.com/
XHTML 1.1
follow
http://validator.w3.org/check?uri=referer
CSS 3
follow
http://jigsaw.w3.org/css-validator/check/referer?profile=css3