Adaptive Runtime Systems

Productive parallel programming with Charm++ and Charm4Py

Modern high-performance computing systems increasingly rely on heterogeneous architectures that combine traditional CPUs with GPU accelerators, creating new challenges for efficient parallel programming and communication. The Charm++ ecosystem represents a significant framework for addressing these challenges, offering multiple programming models designed to optimize performance across distributed memory systems. This research explores several critical aspects of modern parallel programming: GPU-aware communication optimization, Python framework performance in distributed environments, and the comparative analysis of asynchronous task-based systems.

In our work on GPU-aware communication, we implemented efficient communication layers using the Unified Communication X (UCX) framework across multiple parallel programming models (Choi et al., 2021; Choi et al., 2022) . This research addresses the critical challenge of GPU data movement in modern HPC systems. By developing optimized communication layers for Charm++, AMPI, and Charm4py, we achieved substantial performance improvements: up to 17.4x reduction in communication latency and 10.5x increase in bandwidth. These improvements translate directly to real-world applications, demonstrated through a Jacobi iterative method proxy application where communication performance improved by up to 19.7x.

Our research on Python parallel programming frameworks addresses the growing importance of Python in scientific computing and machine learning (Fink et al., 2021). Through comprehensive benchmarking and analysis of Charm4py and mpi4py, we evaluated the relative performance characteristics of these frameworks in both CPU and GPU-accelerated environments. This work provides insights into the performance characteristics of Charm4Py, and serves as the seed for ongoing optimization efforts, including fine-grained optimization of runtime system to reduce Python overhead, and coarse-grained optimizations to improve communciation performance using copy-avoiding communication technologies such as RDMA.

We also conducted extensive research on Asynchronous Many-Task (AMT) runtime systems, comparing modern approaches like Charm++ and HPX with traditional programming models such as MPI and OpenMP (Wu et al., 2022). Using the Task Bench benchmark suite, we quantified system overheads under various scenarios and analyzed scalability characteristics. This analysis illuminates the comparative advantages of different parallel programming approaches and their effectiveness in hiding communication latency, providing essential guidance for HPC application developers.

References

2022

  1. Accelerating communication for parallel programming models on GPU systems
    Jaemin Choi, Zane Fink, Sam White, and 3 more authors
    Parallel Computing, 2022
  2. Quantifying Overheads in Charm++ and HPX using Task Bench
    Nanmiao Wu, Ioannis Gonidelis, Simeng Liu, and 6 more authors
    In Euro-Par 2022: Parallel Processing Workshops, 2022

2021

  1. GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python
    Jaemin Choi, Zane Fink, Sam White, and 3 more authors
    In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2021
  2. Performance Evaluation of Python Parallel Programming Models: Charm4Py and mpi4py
    Zane Fink, Simeng Liu, Jaemin Choi, and 2 more authors
    In 2021 IEEE/ACM Sixth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), 2021