IPDPS 2020 Conference

General IPDPS Info

IN COOPERATION WITH

and

IEEE Computer Society Technical Committees on Computer Architecture & Distributed Processing

HOST

COMING SOON

Call for Participation
IPDPS 2021 Portland

35th IEEE International
Parallel & Distributed
Processing Symposium

May 17-21, 2021
Portland Hilton Downtown
Portland, Oregon USA

IPDPS 2020 VIRTUAL PROGRAM

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

This page lists all the 21 workshops that are part of the IPDPS 2020 program. Click on the workshop of interest – Monday workshops at top of page and Friday workshops at bottom – and the link will take you to the home landing page of the workshop. The workshop web page provides detailed information regarding papers in the workshop and any other program material and events. Check individual workshop pages to see what events are planned.

The Main Conference program that follows shows the papers accepted for the conference, organized in Technical Sessions originally scheduled to be held on Tuesday, Wednesday and Thursday. Those papers as well as all of the papers in the workshops are all published in the proceedings and accompanied by presentation slides from the authors.

This publication will be released by May 15th to be available to all registrants.
IPDPS will be holding virtual events to coincide with the conference dates of 18-22 May. Participation details available here and in links in the program that follows.

Tuesday, May 19: Best paper presentations and Q&A session.
Wednesday, May 20: Best paper announcement and TCPP public meeting.
Thursday, May 21: IPDPS Town Hall meeting.

Events on these three days will take place from 9:00 AM to 10:00 AM US Central Daylight Time / 2:00 PM UTC. Check individual workshops for any scheduled events.

MONDAY - 18 May 2020

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

MONDAY WORKSHOPS

Visit individual
websites at
links shown

MONDAY WORKSHOPS on 18 May 2020
HCW	Heterogeneity in Computing Workshop
RAW	Reconfigurable Architectures Workshop
HiCOMB	High Performance Computational Biology
GrAPL	Graphs, Architectures, Programming, and Learning
EduPar	NSF/TCPP Workshop on Parallel and Distributed Computing Education
HIPS	High-level Parallel Programming Models and Supportive Environments
HPBDC	High-Performance Big Data and Cloud Computing
AsHES	Accelerators and Hybrid Exascale Systems
PDCO	Parallel / Distributed Combinatorics and Optimization
APDCM	Advances in Parallel and Distributed Computational Models

TUESDAY - 19 May 2020

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

Best Paper Presentations and Q&A Session

See this page for details and link to join session.

Parallel Technical
Sessions 1, 2, 3, & 4

SESSION 1: Communication & NoCs

DozzNoC: Reducing Static and Dynamic Energy in NoCs with Low-latency Voltage Regulators using Machine

Mark Clark, Yingping Chen, Avinash Karanth, Brian Ma, and Ahmed Louri

Neksus: An Interconnect for Heterogeneous System-In-Package Architectures

Vidushi Goyal, Xiaowei Wang, Valeria Bertacco, and Reetu Das

Accelerated Reply Injection for Removing NoC Bottleneck in GPGPUs

Yunfan Li and Lizhong Chen

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

Jahanzeb Maqbool Hashmi, Shulei Xu, Bharath Ramesh, Hari Subramoni, Mohammadreza Bayatpour, and Dhabaleswar K. (DK) Panda

SESSION 2: Storage & IO

ClusterSR: Cluster-Aware Scattered Repair in Erasure-Coded Storage

Zhirong Shen, Jiwu Shu, Zhijie Huang, and Yingxun Fu

Stitch It Up: Using Progressive Data Storage to Scale Science

Jay Lofstead, John Mitchel, and Enze Chen

HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments

Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun,

CanarIO: Sounding the Alarm on IO-Related Performance Degradation

Michael Wyatt, Stephen Herbein, Kathleen Shoga, Todd Gamblin, and Michela Taufer

SESSION 3: Applications

A Study of Graph Analytics for Massive Datasets on Large-Scale Distributed GPUs

Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, V. Krishna Nandivada, and Keshav Pingali

A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format

Hang Cao, Liang Yuan, He Zhang, Baodong Wu, Shigang Li, Pengqi Lu, Yunquan Zhang, Yongjun Xu, and Minghua Zhang

Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations

Sian Jin, Pascal Grosset, Christopher M. Biwer, Jesus Pulido, Jiannan Tian, Dingwen Tao, and James P. Ahrens

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Oguz Selvitopi, Md Taufique Hussain, Ariful Azad, and Aydin Buluc

SESSION 4: Distributed Algorithms

Tightening Up the Incentive Ratio for Resource Sharing Over the Rings

Yukun Cheng, Xiaotie Deng, Yuhao Li

Communication-Efficient String Sorting

Timo Bingmann, Peter Sanders, and Matthias Schimek

SCSL: Optimizing Matching Algorithms to Improve Real-time for Content-based Pub/Sub Systems

Tianchen Ding, Shiyou Qian, Jian Cao, Guangtao Xue, and Minglu Li

Distributed Graph Realizations

John Augustine, Keerti Choudhary, Avi Cohen, David Peleg, Sumathi Sivasubramaniam, and Suman Sourav

Parallel Technical Sessions 5, 6, 7, & 8

SESSION 5: Reliability and QoS

Transaction-Based Core Reliability

Sang Wook Stephen Do and Michel Dubois

Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Seung-Hwan Lim, Ross Miller, and Sudharshan Vazhkudai,

EC-Fusion: An Efficient Hybrid Erasure Coding Framework to Improve Both Application and Recovery Performance in Cloud Storage Systems

Han Qiu, Chentao Wu, Jie Li, Minyi Guo, Tong Liu, Xubin He, Yuanyuan Dong, and Yafei Zhao

SESSION 6: Learning Algorithms

Learning an Effective Charging Scheme for Mobile Devices

Tang Liu, Baijun Wu, Wenzheng Xu, ,Xiaobo Cao, Jian Peng, and Hongyi Wu

Optimize Scheduling of Federated Learning on Battery-powered Mobile Devices

Cong Wang, Xin Wei, and Pengzhan Zhou

Harnessing Deep Learning via a Single Building Block

Kunal Banerjee, Michael J. Anderson, Sasikanth Avancha, Anand Venkat, Gregory M. Henry, Evangelos Georganas, Hans Pabst, Alexander Heinecke, and Dhiraj D. Kalamkar

Experience-Driven Computational Resource Allocation of Federated Learning by Deep Reinforcement Learning

Yufeng Zhan, Peng Li, and Song Guo

SESSION 7: Data Analysis and Management

An Active Learning Method for Empirical Modeling in Performance Tuning

Jiepeng Zhang, Jingwei Sun, Wenju Zhou, and Guangzhong Sun

DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection

Bin Dong, Veronica Rodriguez, Xin Xing, Suren Byna, Jonathan Ajo-Franklin, and Kesheng Wu

Scaling of Union of Intersections for Inference of Granger Causal Networks from Observational Data

Mahesh Balasubramanian, Trevor Ruiz, Brandon Cook, Mr Prabhat, Sharmodeep Bhattacharyya, Aviral Shrivastava, and Kristofer Bouchard

GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting

Xiaodong Yu, Fengguo Wei, Xinming Ou, Michela Becchi, Tekin Bicer, and Danfeng(Daphne) Yao

SESSION 8: Edge Computing

Robust Server Placement for Edge Computing

Dongyu Lu, Yuben Qu, Fan Wu, Haipeng Dai, Chao Dong, and Guihai Chen

EdgeIso: Effective Performance Isolation for Edge Devices
Yoonsung Nam, Yongjun Choi, Byeonghun Yoo, Yongseok Son, and Hyeonsang Eom

Busy-Time Scheduling on Heterogeneous Machines

Runtian Ren and Xueyan Tang

Scheduling Malleable Jobs Under Topological Constraints
Evripidis Bampis, Konstantinos Dogeas, Alexander Kononov, Giorgio Lucarelli, and Fanny Pascual

PLENARY SESSION:
Best Papers

Best Papers

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs
Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-mei Hwu

Abstract—There has been a rapid proliferation of machine learning/deep learning (ML) models and wide... Read more

Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection
Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa

Abstract—Genome-wide association studies are performed to correlate a number of diseases and other... Read more

Understanding and Improving Persistent Transactions on Optane DC Memory
Pantea Zardoshti, Michael Spear, Aida Vosoughi, and Garret Swart

Abstract—Storing data structures in high-capacity byte-addressable persistent memory instead... Read more

CycLedger: A Scalable and Secure Parallel Protocol for Distributed Ledger via Sharding
Mengqian Zhang, JiChen Li, Zhaohua Chen, Hongyin Chen, and Xiaotie Deng

Abstract—Traditional public distributed ledgers have not been able to scale-out well and work... Read more

WEDNESDAY - 20 May 2020

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

Best Paper Announcement and TCPP Public Meeting

See this page for details and link to join session.

Parallel Technical
Sessions 9, 10, 11, & 12

SESSION 9: Cloud Technology

Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in the Cloud

Jianshu Liu, Shungeng Zhang, Qingyang Wang, and Jinpeng Wei

DAG-Aware Joint Task Scheduling and Cache Management in Spark Clusters

Yinggen Xu, Liu Liu, and Zhijun Ding

Solving the Container Explosion Problem for Distributed High Throughput Computing

Tim Shaffer, Nicholas Hazekamp, Jakob Blomer, and Douglas Thain,

Amoeba: QoS-Awareness and Reduced Resource Usage of Microservices with Serverless Computing

Zijun Li, Quan Chen, Shuai Xue, Tao Ma, Yong Yang, Zhuo Song, and Minyi Guo

SESSION 10: Machine Learning

Efficient I/O for Neural Network Training with Compressed Data

Zhao Zhang, Lei Huang, J. Gregory Pauloski, and Ian T. Foster

Not All Explorations Are Equal: Harnessing Heterogeneous Profiling Cost for Efficient MLaaS Training

Jun Yi, Chengliang Zhang, Wei Wang, Cheng Li, and Feng Yan

ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

Saeed Soori, Bugra Can, Mert Gurbuzbalaban, and Maryam Dehnavi

Benanza: Automatic uBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Cheng Li, Abdul Dakkak, Jinjun Xiong, and Wen-mei Hwu

SESSION 11: GPUs

Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Alberto Zeni, Giulia Guidi, Marquita Ellis, Nan Ding, Marco D. Santambrogio, Steven Hofmeyr, Aydin Buluç, Leonid Oliker, and Katherine Yelick

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

Qi Yu, Bruce R. Childers, Libo Huang, Cheng Qian, Hui Guo, and Zhiying Wang

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka

SESSION 12:Applications

DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-point Computing Power of GPUs

Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles Weems

Scalability Challenges of an Industrial Implicit Finite Element Code

Francois-Henry Rouet, Cleve Ashcraft, Jef Dawson, Roger Grimes, Erman Guleryuz, Seid Koric, Robert F. Lucas, James S. Ong, Todd Simons, and Ting-Ting Zhu

ETH: An Architecture for Exploring the Design Space of In-Situ Scientific Visualization

Greg Abram, Vignesh Adhinarayanan, Wu-chun Feng, David H. Rogers, and James P. Ahrens

Scaling Betweenness Approximation to Billions of Edges by MPI-based Adaptive Sampling

Alexander van der Grinten and Henning Meyerhenke

Parallel Technical Sessions 13, 14, 15, & 16

SESSION 13: Data Management

Improved Intermediate Data Management for MapReduce Frameworks

Haoyu Wang, Haiying Shen, Charles Reiss, Arnim Jain, and Yunqiao Zhang

Bandwidth-Aware Page Placement in NUMA
David Gureya, João Neto, Reza Karimi, João Barreto, Pramod Bhatotia, Vivien Quema, Rodrigo Rodrigues, Paolo Romano, and Vladimir Vlassov

HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments

Hariharan Devarajan, Anthony Kougkas, Luke Logan, and Xian-He Sun,

FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data

Robert R. Underwood, Sheng Di, Jon Calhoun, and Franck Cappello

SESSION 14: Storage & Caching

DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors
Nadja Holtryd, Madhavan Manivannan, Per Stenström, and Miquel Pericas

Coordinated Management of Processor Configuration and Cache Partitioning to Optimize Energy under QoS Constraints
Mehrzad Nejat, Madhavan Manivannan, Miquel Pericas, and Per Stenström

StragglerHelper: Alleviating Straggling in Computing Clusters via Sharing Memory Access Patterns

Wenjie Liu, Ping Huang, and Xubin He

SESSION 15: Numerics

Evaluating the Numerical Stability of Posit Floating Point Arithmetic

Nicholas Buoncristiani, Sanjana Shah, David Donofrio, and John Shalf

Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing

Ignacio Laguna

Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply

Da Yan, Wei Wang, and Xiaowen Chu

SESSION 16: IoT and Consensus

Data Collection of IoT Devices Using an Energy-Constrained UAV

Yuchen Li, Weifa Liang, Wenzheng Xu, and Xiaohua Jia

Argus: Multi-Level Service Visibility Scoping for Internet-of-Things in Enterprise Environments

Qian Zhou, Omkant Pandey, and Fan Ye

G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications

LapHou Lao, Xiaohai Dai, Bin Xiao, and Songtao Guo

Byzantine Generalized Lattice Agreement

Giuseppe Antonio Di Luna, Emmanuelle Anceaume, and Leonardo Querzoni

THURSDAY - 21 May 2020

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

IPDPS Town Hall Meeting

See this page for details and link to join session.

Parallel Technical Sessions 17, 18, 19, & 20

SESSION 17: Graph Processing & Coding

A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing

Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling Xue

Spara: An Energy-Efficient ReRAM-based Accelerator for Sparse Graph Analytics Applications

Long Zheng, Jieshan Zhao, Yu Huang, Qinggang Wang, Zhen Zeng, Jingling Xue, Xiaofei Liao, and Hai Jin

Optimal Encoding and Decoding Algorithms for the RAID-6 Liberation Codes

Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, and Ning Li

Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers

Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, and Minyi Guo

SESSION 18: Parallel Algorithms

A High-Throughput Solver for Marginalized Graph Kernels on GPU

Yu-Hang Tang, Oguz Selvitopi, Doru Thom Popovici, and Aydin Buluc

Dynamic Graphs on the GPU

Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, and John D. Owens

Accelerating Parallel Hierarchical Matrix-Vector Products via Data Driven Sampling

Lucas Erlandson, Difeng Cai, Yuanzhe Xi, and Edmond Chow

NC Algorithms for Popular Matchings in One-Sided Preference Systems and Related Problems

Changyong Hu and Vijay Garg

SESSION 19: Performance, Power, and Energy

Smartly Handling Renewable Energy Instability in Supporting A Cloud Datacenter

Jiechao Gao, Haoyu Wang, and Haiying Shen

A Self-Optimized Generic Workload Prediction Framework for Cloud Computing

Vinodh Kumaran Jayakumar, Jaewoo Lee, In Kee Kim, and Wei Wang

SeeSAw: Optimizing Performance of In-Situ Analytics Applications under Power Constraints

Ivana Marincic, Venkatram Vishwanath, and Henry Hoffmann

SESSION 20: Resource Management

What does Power Consumption Behavior of HPC Jobs Reveal?
Tirthak Patel, Adam Wagenhäuser, Christopher Eibel, Timo Hönig, Thomas Zeiser, and Devesh Tiwari

Efficient Parallel Adaptive Partitioning for Load-balancing in Spatial Join

Jie Yang and Satish Puri

Union: An Automatic Workload Manager for Accelerating Network Simulation

Xin Wang, Misbah Mubarak, Yao Kang, Robert B. Ross, and Zhiling Lan

Auto-Tuning Parameter Choices using Bayesian Optimization

Harshitha Menon, Abhinav Bhatele, and Todd Gamblin

Parallel Technical Sessions 21, 22, 23, 24

SESSION 21: Runtime Systems

Inter-Job Scheduling of High-Throughput Material Screening Applications

Zhihui Du, Xining Hui, Yurui Wang, Jun Jiang, Jason Liu, Baokun Lu, Chongyu Wang

Reservation and Checkpointing Strategies for Stochastic Jobs

Ana Gainaru, Brice Goglin, Valentin Honore, Guillaume Pallez, Padma Raghavan, Yves Robert, and Hongyang Sun

A Scheduling Approach to Incremental Maintenance of Datalog Programs

Shikha Singh, Sergey Madaminov, Michael Bender, Michael Ferdman, Ryan Johnson, Benjamin Moseley, Hung Ngo, Dung Nguyen, Soeren Olesen, Kurt Stirewalt, and Geoffrey Washburn

Dynamic Scheduling in Distributed Transactional Memory

Costas Busch, Maurice Herlihy, Miroslav Popovic, and Gokarna Sharma

SESSION 22: Performance Analysis

Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling
Marcus Ritter, Alexandru Calotoiu, Sebastian Rinke, Thorsten Reimann, Torsten Hoeﬂer, and Felix Wolf

The Case of Performance Variability on Dragonfly-based Systems

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, and David Lowenthal

Predicting and Comparing the Performance of Array Management Libraries

Donghe Kang, Oliver Ruebel, Suren Byna, and Spyros Blanas

Demystifying the Performance of HPC Scientific Applications on NVM-based Memory

Ivy B. Peng, Kai Wu, Jie Ren, Dong Li, and Maya Gokhale

SESSION 23: Communication

Packet-in Request Redirection for Minimizing Control Plane Response Time

Rui Xia, Haipeng Dai, Jiaqi Zheng, Hong Xu, Meng Li, and Guihai Chen

PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network

Chao Tian, Lingxiao Ma, Zhi Yang, and Yafei Dai

ConMidbox: Consolidated Middleboxes Selection and Routing in SDN/NFV-Enabled Networks

Guiyan Liu, Songtao Guo, Pan Li, and Liang Liu

Scalable and Memory-Ef?cient Kernel Ridge Regression
Gustavo Chávez, Yang Liu, Pieter Ghysels, Xiaoye Sherry Li, and Elizaveta Rebrova

SESSION 24: Storage

SSDKeeper: Self-Adapting Channel Allocation to Improve the Performance of SSD Devices

Renping Liu, Xianzhang Chen, Yujuan Tan, Runyu Zhang, Liang Liang, and Duo Liu

FlashKey:A High-Performance Flash Friendly Key-Value Store

Madhurima Ray, Krishna Kant, Peng Li, and Sanjeev Trika

Pacon: Improving Scalability and Ef?ciency of Metadata Service through Partial Consistency

Yubo Liu1, Yutong Lu, Zhiguang Chen, and Ming Zhao

Parallel Technical Sessions 25, 26, 27 & 28

SESSION 25: Program Analysis and Runtime Library

XPlacer: Automatic Analysis of Data Access Patterns on Heterogeneous CPU/GPU Systems
Peter Pirkelbauer, Pei-Hung Lin, Tristan Vanderbruggen, and Chunhua Liao

Improving Transactional Code Generation via Variable Annotation and Barrier Elision
João P.L. de Carvalho, Bruno C. Honorio, Alexandro Baldassin, and Guido Araujo

Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi
Hancheng Wu and Michela Becchi

AnySeq: A High Performance Sequence Alignment Library based on Partial Evaluation
André Müller, Bertil Schmidt, Andreas Hildebrandt, Richard Membarth, Roland Leißa, Matthis Kruse, and Sebastian Hack

SESSION 26: Scheduling

Analysis of a List Scheduling Algorithm for Task Graphs on Two Types of Resources

Lionel Eyraud-Dubois and Suraj Kumar

Optimal Convex Hull Formation on a Grid by Asynchronous Robots with Lights

Rory Hector, Ramachandran Vaidyanathan, Gokarna Sharma, and Jerry L. Trahan

On the Complexity of Conditional DAG Scheduling in Multiprocessor Systems
Alberto Marchetti-Spaccamela, Nicole Megow, Jens Schlöter, Martin Skutella, and Leen Stougie

Weaver: Ef?cient Co?ow Scheduling in Heterogeneous Parallel Networks

Xin Sunny Huang, Yiting Xia, and T. S. Eugene Ng

SESSION 27: Fault Tolerance

Fault-Tolerant Containers Using NiLiCon
Diyu Zhou and Yuval Tamir

Aarohi: Making Real-Time Node Failure Prediction Feasible

Anwesha Das, Frank Mueller, and Barry Rountree

FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications

Pinchao Liu, Hailu Xu, Dilma Da Silva, Qingyang Wang, Sarker Tanzir Ahmed, and Liting Hu

SESSION 28: Multidisciplinary

Implementation and Evaluation of a Hardware Decentralized Synchronization Lock for MPSoCs
Maxime France-Pillois, Jérôme Martin, and Frederic Rousseau

Communication-Ef?cient Jaccard Similarity for High-Performance Distributed Genome Comparisons
Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Ratsch, Torsten Hoeﬂer, and Edgar Solomonik

Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Kyle Berney and Nodari Sitchinava

The Impossibility of Fast Transactions

Karolos Antoniadis, Diego Didona, Rachid Guerraoui and Willy Zwaenepoel

FRIDAY - 22 May 2020

DAYS • Monday • Tuesday • Wednesday • Thursday • Friday

FRIDAY WORKSHOPS

Visit individual
websites at
links shown

FRIDAY WORKSHOPS on 22 MAY 2020
JSSPP	Job Scheduling Strategies for Parallel Processing
CHIUW	Chapel Implementers and Users Workshop
PDSEC	Parallel and Distributed Scientific and Engineering Computing
iWAPT	Automatic Performance Tuning
MPP	Parallel Programming Models - Special Edition Machine Learning Performance and Security
SNACS	Scalable Networks for Advanced Computing Systems
PAISE	Parallel AI and Systems for the Edge
RADR	Resource Arbitration for Dynamic Runtimes
ScaDL	Scalable Deep Learning over Parallel and Distributed Infrastructures
HPS	High-Performance Storage
ParSocial	Parallel and Distributed Processing for Computational Social Systems

IPDPS 2020 BEST PAPERS

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPU

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-mei Hwu

Abstract—There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results.

This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze 65 state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.

Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection

Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa

Abstract—Genome-wide association studies are performed to correlate a number of diseases and other physical or even psychological conditions (phenotype) with substitutions of nucleotides at specific positions in the human genome, mainly single-nucleotide polymorphisms (SNPs). Some conditions, possibly because of the complexity of the mechanisms that give rise to them, have been identified to be more statistically correlated with genotype when multiple SNPs are jointly taken into account. However, the discovery of new associations between genotype and phenotype is exponentially slowed down by the increase of computational power required when epistasis, i.e., interactions between SNPs, is considered. This paper proposes a novel graphics processing unit (GPU)-based approach for epistasis detection that combines the use of modern tensor cores with native support for processing binarized inputs with algorithmic and target-focused optimizations. Using only a single mid-range Turing-based GPU, the proposed approach is able to evaluate 64.8 × 1012 and 25.4 × 1012 sets of SNPs per second, normalized to the number of patients, when considering 2-way and 3-way epistasis detection, respectively. This proposal is able to surpass the state-of-the-art approach by 6× and 8.2× in terms of the number of pairs and triplets of SNP allelic patient data evaluated per unit of time per GPU.

Understanding and Improving Persistent Transactions on Optane DC Memory

Pantea Zardoshti, Michael Spear, Aida Vosoughi, and Garret Swart

Abstract—Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® OptaneTM Direct Connect (OptaneTM DC) Persistent Memory. OptaneTM DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how OptaneTM DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of OptaneTM DC memory.

In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.

CycLedger: A Scalable and Secure Parallel Protocol for Distributed Ledger via Sharding

Mengqian Zhang, JiChen Li, Zhaohua Chen, Hongyin Chen, and Xiaotie Deng

Abstract—Traditional public distributed ledgers have not been able to scale-out well and work efficiently. Sharding is deemed as a promising way to solve this problem. By partitioning all nodes into small committees and letting them work in parallel, we can significantly lower the amount of communication and computation, reduce the overhead on each node’s storage, as well as enhance the throughput of the distributed ledger. Existing sharding-based protocols still suffer from several serious drawbacks. The first thing is that all non-faulty nodes must connect well with each other, which demands a huge number of communication channels in the network. Moreover, previous protocols have faced great loss in efficiency in the case where the honesty of each committee’s leader is in question. At the same time, no explicit incentive is provided for nodes to actively participate in the protocol.

We present CycLedger, a scalable and secure parallel protocol for distributed ledger via sharding. Our protocol selects a leader and a partial set for each committee, who are in charge of maintaining intra-shard consensus and communicating with other committees, to reduce the amortized complexity of communication, computation, and storage on all nodes. We introduce a novel semi-commitment scheme between committees and a recovery procedure to prevent the system from crashing even when leaders of committees are malicious. To add incentive for the network, we use the concept of reputation, which measures each node’s trusty computing power. As nodes with a higher reputation receive more rewards, there is an encouragement for nodes with strong computing ability to work honestly to gain reputation. In this way, we strike out a new path to establish scalability, security, and incentive for the sharding-based distributed ledger.