USRC Publications

View the publications from various staff members of the USRC.
2019
  • Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. 2019. Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). ACM, New York, NY, USA, 821-835. DOI: https://doi.org/10.1145/3352460.3358267 (ACM Digital Library) (data from paper)
  • Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). ACM, New York, NY, USA, 106-116.  (ACM Digital Library)
  • BinFI: An Efficient Fault Injector for Safety-Critical Machine Learning Systems
    Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben, To appear in the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2019. (TensorFI-BinaryFI on github) (blog at UBC about the paper)
  • S. Huang, S. Liang, S. Fu, W. Shi, D. Tiwari and H. Chen, "Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems," 2019 IEEE International Conference on Autonomic Computing (ICAC), Umea, Sweden, 2019, pp. 157-166.
    doi: 10.1109/ICAC.2019.00027 (IEEE Digital Library)
  • Z. Qiao, S. Liang, S. Fu, H. Chen and B. Settlemyer, "Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track, Portland, OR, USA, 2019, pp. 17-20.
    doi: 10.1109/DSN-Industry.2019.00011 (IEEE Digital Library)
  • Zhi Qiao, Song Fu, Hsing-Bung Chen and Bradley Settlemyer, Exploring Declustered Software RAID for Enhanced Reliability and Recovery Performance in Storage Systems, The 38th International Symposium on Reliable Distributed Systems (SRDS 2019). Oct. 1st – Oct. 4th, 2019, Lyon, France.  (to appear)
  • Nathan Hjelm, Howard Pritchard, Samuel K. Gutierrez, Daniel. J. Holmes, Ralph Castain and Anthony Skjellum, "MPI Sessions: Evaluation of an Implementation in Open MPI," IEEE Cluster 2019.  (to appear)
2018
  • M. Hickman et al., "Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code," 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Memphis, TN, 2018, pp. 100-105. doi: 10.1109/ISSREW.2018.00-23 (IEEE Digital Library)
  • E. Baseman et al., "Physics-Informed Machine Learning for DRAM Error Modeling," 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Chicago, IL, 2018, pp. 1-6. doi: 10.1109/DFT.2018.8602983 (IEEE Digital Library)
  • Amvrosiadis G, Park J W, Ganger G, Gibson G, Baseman E, and DeBardeleben N.  On the Diversity of Cluster Workloads and its Impact on Research Results.  USENIX ATC 2018. (USENIX link with abstract, presentation slides, and audio)
  • Z Qiao, J Hochstetler, S Liang, S Fu, H Chen, B Settlemyer. Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers.  International Conference on Big Data, 194-208
  • Qiang Liu, Nageswara SV Rao, Satyabrata Sen, Bradley W Settlemyer, Hsing-Bung Chen, Joshua M Boley, Rajkumar Kettimuthu, Dimitrios Katramatos.  Virtual Environment for Testing Software-Defined Networking Solutions for Scientific Workflows.  Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science, Pages 3-11. ACM.
  • Nageswara SV Rao, Qiang Liu, Satyabrata Sen, Raj Kettimuthu, Josh Boley, Bradley W Settlemyer, Hsing B Chen, Dimitrios Katramatos, Dantong Yu.  Software-Defined Network Solutions for Science Scenarios: Performance Testing Framework and Measurements.  Proceedings of the 19th International Conference on Distributed Computing and Networking, Pages 53-64.  ACM.
  • Michael A Sevilla, Carlos Maltzahn, Peter Alvaro, Reza Nasirigerdeh, Bradley W Settlemyer, Danny Perez, David Rich, Galen M Shipman.  Programmable Caches with a Data Management Language and Policy Engine.  Proceedings of the International Symposium on Cluster, Cloud and Grid Computing (CCGrid'18).
  • Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons learned from memory errors observed over the lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 43, 12 pages. DOI: https://doi.org/10.1109/SC.2018.00046 (ACM Digital Library)
  • DeLucia, A, Baseman E. Work in Progress: Topic Modeling for HPC Job State Prediction. Machine Learning for Computing Systems Workshop, HPDC 2018.
  • Goetting I, Baseman E, Cao H. Work in Progress: Causal Relationships amongst Sensors in the Trinity Supercomputer. Machine Learning for Computing Systems Workshop, HPDC 2018.
  • DeLucia, A, Baseman E. High Performance Computing Job Outcome by Mining System Logs. Southern Data Science Conference 2018.
2017
  • Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage using Register Vulnerability. the 3rd International Workshop on Recent Advances in the DependabIlity AssessmeNt of Complex systEms (RADIANCE). 
  • Tan L, DeBardeleben N, Guan Q, Blanchard S, Lang M. 2017. Using Virtualization to Quantify Power Conservation via Near-Threshold Voltage Reduction for Inherently Resilient Applications. Parallel Computing. 
  • Otstott D, Ionkov L, Lang M, Zhao M. 2017. TCASM: An asynchronous shared memory interface for high-performance application composition. Parallel Computing. 63: 61-78.
  • Wu P, DeBardeleben N, Guan Q, Blanchard S, Chen J, Tao D, Liang X, Ouyang K, Chen Z. 2017. Silent Data Corruption Resilient Two-sided Matrix Factorizations. Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 415–427, ACM, Austin, Texas, USA, 2017, ISBN: 978-1-4503-4493-7.
  • Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Garth A Gibson, Charles D Cranor, Bradley W Settlemyer, Gary Grider, Fan Guo.  2017.  Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory.  Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, 7-12, ACM
  • Lei Cao, Bradley W Settlemyer, John Bent.  2017.  To share or not to share: comparing burst buffer architectures.  Conference Proceedings of the 25th High Performance Computing Symposium.  Pages 4-14, Society for Computer Simulation International.
  • Baseman E. Helping Exascale Computers Help Us: Machine Learning for High Performance Computing. Women in Machine Learning Workshop, NIPS 2017.
  • Haque A, DeLucia A, Baseman E. Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs. HPC User Support Tools Workshop, Supercomputing 2017.
  • Siddiqua T, Sridharan V, Raasch S, DeBardeleben N, Ferreira K, Levy S, Baseman E, Guan Q. Lifetime Memory Reliability Data from the Field. DFT 2017.
  • Baseman E, DeBardeleben N, Ferreira K, Sridharan V, Siddiqua T, Tkachenko O. Automating DRAM Fault Mitigation by Learning from Experience. DSN (Industrial Track) 2017.
2016
  • Baseman E, Blanchard S, Li Z, Fu S. 2016. Relational Synthesis of Text and Numeric Data for Anomaly Detection on Computing System Logs. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 882-885. 
  • Baseman E, DeBardeleben N, Ferreira K, Levy S, Raasch S, Sridharan V, Siddiqua T, Guan Q. 2016. Improving DRAM Fault Characterization through Machine Learning. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 250-253..
  • Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 72-76.
  • DeBardeleben N. 2016. Extreme scale and bleeding edge technology lead to a need for resilient high performance computing systems. 2016 IEEE International Reliability Physics Symposium (IRPS), pp. 3B-1-1-3B-1-8.
  • Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z. 2016. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 31–42, ACM, Kyoto, Japan, 2016, ISBN: 978-1-4503-4314-5..
  • Fang B, Wu P, Guan Q, DeBardeleben N, Monroe L, Blanchard S, Chen Z, Pattabiraman K, Ripeanu M. 2016. SDC is in the Eye of the Beholder: A Survey and Preliminary Study. 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops 2016, Toulouse, France, June 28 - July 1, 2016, pp. 72–76.
  • Monroe L, Jones WM, Lavigne SR, IV CD, Guan Q, DeBardeleben N. 2016. On the Inherent Resilience of Integer Operations. Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers, pp. 648–659.
  • Nageswara SV Rao, Qiang Liu, Satyabrata Sen, Greg Hinkel, Neena Imam, Ian Foster, Rajkumar Kettimuthu, Bradley W Settlemyer, Chase Q Wu, Daqing Yun.  2016.  Experimental analysis of file transfer rates over wide-area dedicated connections.  IEEE 18th International Conference on High Performance Computing and Communications Pages 198-205, Best Paper Winner.
  • John Bent, Bradley W Settlemyer, Gary Grider.  Serving data to the lunatic fringe: The evolution of HPC storage.  The USENIX Magazine 41 (2), 34-39.
  • NSV Rao, G Hinkel, N Imam, BW Settlemyer.  Measurements of file transfer rates over dedicated long-haul connections.  2nd International Workshop on The Lustre Ecosystem.
  • Morrow A, Baseman E, Blanchard S. Ranking Anomalous High Performance Computing Sensor Data using Unsupervised Clustering. CSCI: Symposium on Parallel and Distributed Computing and Computational Science 2016.
  • Baseman E, Blanchard S, DeBardeleben N, Bonnie A, Morrow A. Interpretable Anomaly Detection for Monitoring of High Performance Computing Systems. Outlier Definition, Detection, and Description on Demand Workshop, KDD 2016.
  • Guan Q, DeBardeleben N, Wu P, Eidenbenz S, Blanchard S, Monroe L, Baseman E, Tan L. Design, Use, and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications. SIMUTOOLS 2016.
  • Baseman E, DeBardeleben N, Ferreira K, Levy S, Raasch S, Sridharan V, Siddiqua T, Guan Q. Improving DRAM Fault Characterization through Machine Learning. DSN (Industrial Track) 2016.
2015
  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms. 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop with HPDC 2015.
  • Wang K, Zhou X, Qiao K, Lang M, McClelland B, Raicu I. 2015. Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing. ACM HPDC.
  • Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. 2015. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE(00): 1-29.
  • Sridharan V, DeBardeleben N, Blanchard S, Ferreira K, Stearley J, Shalf J, Gurumurthi S. 2015. Memory Errors in Modern Systems: The Good, The Bad, and the Ugly. Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems. 
  • Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P and others. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operating. IEEE 21st International Symposium on High Performance Computer Architecture (HPCA): 331-342.
  • Huang S, Fu S, DeBardeleben N, Guan Q, Xu C. 2015. Differentiated Failure Remediation with Action Selection for Resilient Computing. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC).
  • Guan Q, DeBardeleben N, Blanchard S, Fu S. 2015. Addressing Statistical Significance of Fault Injection: Empirical Studies of the Soft Error Susceptibility. IEEE Pacific Rim International Symposium on Dependable Computing(PRDC).
  • Guan Q, DeBardeleben N, Atkinson B, Robey R, Jones W. 2015. Towards Building Resilience Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with CLAMR Hydrodynamics Mini-App. IEEE Cluster 2015. 
  • DeBardeleben N, Blanchard S, Kaeli D, Rech P. 2015. Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design. 2015 IEEE 33rd VLSI Test Symposium (VTS), pp. 1-2, 2015, ISSN: 1093-0167.
2014
  • Snir M, Wisniewski R, Abraham J, Adve S, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2014. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications.
  • DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K, Shalf J. 2014. Extra Bits on SRAM and DRAM Errors - More Data From the Field. Silicon Errors in Logic - System Effects (SELSE-10), Stanford University. 
  • Bautista Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Sonza Reorda M. 2014. GPGPUs: How to Combine High Computational Power with High Reliability. Design, Automation & Test in Europe (DATE14), Dresden, Germany. 
  • Guan Q. 2014. F-SEFI: A Fine-grained Soft Error Fault Injector for Profiling Application Vulnerability. Poster presentation: LANL Predictive Science Panel Review, Los Alamos, NM. 
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. HiPEAC 2014, Vienna, Austria.
  • DeBardeleben N. 2014. Reliability Requirements for GPUs in HPC. Design, Automation & Test in Europe (DATE14), as part of "Embedded Tutorial: GPGPUs: how to combine high computational power with high reliability". 
  • Atkinson B, DeBardeleben N, Guan Q, Robey R, Jones WM. 2014. Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App. Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium: 6-9. 
2013
  • Ionkov L, Lang M, Maltzahn C. 2013. DRepl: Optimizing Access to Application Data for Analysis and Visualization. 
  • Yuan X, Mahapatra S, Lang M, Pakin S. 2013. RRR: A Load Balanced Routing Scheme for Slimmed Fat-trees. 
  • Pakin S, Lang M. 2013. Understanding the Performance of Two Production Supercomputers. 
  • Akkan H, Lang M, Liebrook L. 2013. Understanding and isolating the noise in the Linux kernel. International Journal of High Performance Computing Applications. 
  • Soltero P, Bridges P, Arnold D, Lang M. 2013. A Gossip-based Approach to Exascale System Services. 
  • Akkan H, Ionkov L, Lang M. 2013. Transparently Consistent Asynchronous Shared Memory. 
  • Pakin S, Lang M. 2013. Energy Modeling of Supercomputers and Large-Scale Scientific Applications. IEEE. 
  • Wang K, Kulkarni A, Lang M, Arnold D, Raicu I. 2013. Using Simulation to Explore Distributed Key-Value Stores for Extreme-Scale System Services. 
  • Yuan X, Mahapatra S, Nienaber W, Pakin S, Lang M. 2013. A New Routing Scheme for Jellyfish and its Performance with HPC Workloads. Supercomputing Conference. 
  • Akkan H, Lang M, Ionkov L. 2013. HPC Runtime Support for Fast and Power Efficient Locking and Synchronization. IEEE. 
  • Pakin S, Luang X, Lang M. 2013. Predicting the performance of extreme-scale supercomputer networks. The Next Wave (http://www.nsa.gov/research/tnw/). 20(2). 
  • Huang B, Sass R, DeBardeleben N, Blanchard S. 2013. PyDac: A Resilient Run-time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-core Architecture. Proceedings of the The 6th Workshop on UnConventional High Performance Computing 2013 (UCHPC 2013). 
  • DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C, Wright C. 2013. GPU Behavior on a Large HPC Cluster. 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in conjunction with the 19th International European Conference on Parallel and Distributed Computing (Euro-Par 2013), Aachen, Germany,. 
  • Jian X, Blanchard S, DeBardeleben N, Sridharan V, Kumar R. 2013. Reliability Models for Double Chipkill Detect/Correct Memory Systems. SELSE (Silicon Errors in Logic, System Effects): 6. 
  • Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B and others. 2013. Addressing Failures in Exascale Computing. Argonne National Laboratory Technical Report.
  • Jian X, DeBardeleben N, Blanchard S, Sridharan V, Kumar R. 2013. Analyzing Reliability of Memory Subsystems with Double Chipkill Detect/Correct. The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2013). Vancouver, BC, Canada. 
  • Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S. 2013. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. SC13, Denver Colorado. 
2012
  • Kulkarni A, Wang K, Lang M. 2012. Exploring the Design Tradeoffs for Exescale System Services Through Simulation. 
  • Kulkarni A, Lumsdaine A, Lang M, Ionkov L. 2012. Optimizing Latency and Throughput for Spawning Processes on Massively Multicore Processors. 
  • Akkan H, Lang M, Liebrook LM. 2012. Stepping Towards Noiseless Linux Environment. 
  • Kulkarni A, Manzanares A, Ionkov L, Lang M, Lumsdaine A. 2012. The Design and Implementation of a Multi-level Content-Addressable Checkpoint File System.
  • Jones WM, Daly JT, DeBardeleben N. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. Proceedings of the 50th Annual Southeast Regional Conference: 262-267. 
  • DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S. 2012. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. Euro-Par 2011: Parallel Processing Workshops Lecture Notes in Computer Science. 7156: 282-291. 
  • Geist A, Snir M, Roman E, Still B, Clay R, Engelmann C, Ross R, Schulz M, Krishnamoorthy S, Vishnu A and others. 2012. US Department of Energy Fault Management Workshop Report.
  • Daly J, Harrod B, Hoang T, Nowell L, Adolf B, Borkar S, DeBardeleben N, Elnozahy M, Heroux M, Rogers D and others. 2012. Inter-Agency Workshop on HPC Resilience at Extreme Scale.
2011
  • DeBardeleben N, Blanchard SP, Fu S, Guan Q, Zhang Z. 2011. Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience. 
  • Kulkarni A, Lang M, Lumsdaine A. 2011. GoDEL: A multidirectional dataflow execution model for large-scale computing. 
  • Ionkov L. 2011. Gostor: Storage beyond POSiX. 
  • Greenberg H, Lang M, Ionkov L, Blanchard SP. 2011. REDfish - REsilient Dynamic dIstributed Scalable System Services for Exescale.