Asynchronous many-task runtimes

Apr 17, 2025

Computer architectures are evolving at an astonishing pace. We’re no longer just adding a few more cores, we’re building systems with vast numbers of potentially heterogeneous processing units. This scale and complexity present a fundamental challenge: how do we write software that fully utilizes such intricate systems?

Conventional programming models that worked well on smaller-scale systems begin to show their limitations under massive parallelism. As systems scale, coordination overhead and the complexity of managing it become major bottlenecks. To fully leverage modern hardware, programming models and tools must evolve in step with the architectures they target.

Two long-time approaches to managing parallelism

In the early days of parallel computing, many systems relied heavily on synchronous operations. All processes or threads had to coordinate tightly, which meant that if one part of the code lagged, the rest of the system had to wait. This rigid synchronization model became a major bottleneck, particularly on massively distributed computers.

To address the growing complexity of parallelism, two primary approaches emerged. One focused on shared-memory systems within a single node. Languages and libraries like Cilk (now OpenCilk) and Threading Building Blocks (TBB) introduced structured abstractions for expressing concurrency, helping developers implement efficient parallelism without managing every low-level detail.

The other approach targeted distributed memory systems, building on the Message Passing Interface (MPI). Over time, MPI evolved to support asynchronous communication, enabling better overlap between computation and data transfer. While this improved scalability, writing high-performance distributed applications with MPI still demands significant manual effort. Developers must explicitly manage data distribution, synchronization, and inter-node coordination—tasks that become increasingly challenging as system scale and complexity grow.

A newer approach

While these approaches to parallel computing have proven powerful, they begin to strain under the demands of extreme scale. Recognizing both their strengths and limitations, a new class of parallel runtime systems has been developed: asynchronous many-task (AMT) runtimes. Purpose-built to support concurrency at massive scales and maximize system utilization, AMT runtimes combine fine-grained parallelism with asynchronous execution, offering a robust platform for parallel programming at any scale—from laptops to supercomputers. Let’s take a closer look at the core principles that define AMTs.

Asynchrony

A defining feature of AMT runtimes is their emphasis on asynchronous execution. Unlike early parallel models, which often required tasks to move in lockstep, AMTs embrace a "don't wait" approach, allowing independent tasks to make progress without waiting for others to complete. When a task initiates a computation or data access that would otherwise leave it idle, it can continue with other work instead of stalling.

This is made possible through constructs like futures and continuations, which specify what should happen once the delayed operation completes. As a result, computation and communication can proceed in parallel. This is a particularly powerful feature in distributed systems, where latency is often a significant constraint. By enabling overlap between work and waiting, asynchrony improves resource utilization and allows developers to build applications that remain productive even in the presence of communication delays or I/O stalls.

Fine-grained parallelism

AMT runtimes are designed to support massive numbers of small, lightweight threads. By minimizing the overhead of task creation, scheduling, and context switching, lightweight threads make it practical to work with a significantly larger number of smaller tasks. This fine granularity improves both load balancing and system utilization, as it gives the runtime more flexibility to make dynamic scheduling decisions.

With many tasks available at any given moment, the system can keep processing units busy even when some tasks are waiting (because of data dependencies, communication delays, or other blocking operations.) Ready tasks can be scheduled immediately, reducing idle time and sustaining high throughput. As a result, AMT runtimes enable programmers to express concurrency at much finer granularities and across much larger scales.

This model also simplifies the implementation of algorithms with dynamic or irregular task structures which are common characteristics of applications in domains such as scientific computing, graph analytics, and data-intensive workloads. In these contexts, where task size and duration can vary widely, the ability to schedule and distribute many small tasks adaptively is essential for scalable and efficient execution.

Global Address Space (GAS)

Many AMT runtimes incorporate a global address space (GAS) to simplify programming across distributed systems. GAS presents a unified logical view of memory, even though the underlying hardware is composed of physically separate memory spaces. With this abstraction, developers can reference and access data globally without needing to manage its physical location manually, significantly reducing the complexity of writing distributed applications.

A common refinement of this model is the partitioned global address space (PGAS), which maintains the unified view but logically associates portions of the address space with specific processing units. This makes the performance distinction between local and remote data explicit, enabling both developers and compilers to optimize for memory locality.

Some AMT runtimes go further with an active global address space (AGAS). AGAS adds dynamic capabilities, allowing the runtime to move data at execution time based on changing application behavior. This supports adaptive strategies like load balancing and communication minimization by relocating data closer to where it's needed.

A little bit of history

While the term asynchronous many-task (AMT) runtime is relatively recent, the ideas that define these systems have been evolving over several decades. Many of the core concepts were explored and refined through earlier runtimes that laid the groundwork for what AMTs have become today. Before wrapping up, let’s take a brief look at some influential systems, along with a few recommended reads for those interested in learning more.

Cilk (now OpenCilk)

Originally developed in the 1990s, Cilk was a pioneering system for shared-memory parallelism. One of its most influential contributions to AMT concepts was its efficient work-stealing scheduler. Using simple language constructs like cilk_spawn and cilk_sync, developers could express parallelism while the runtime handled dynamic task distribution across available processor cores. Although Cilk was designed for single-node systems, its focus on lightweight tasking and dynamic scheduling laid groundwork for the fine-grained parallelism seen in AMT runtimes.

Charm++

Designed for distributed memory systems, Charm++ introduced a message-driven execution model built around migratable objects called chares. It encourages overdecomposition, where programmers create more chares than there are physical processors, enabling the runtime to dynamically schedule and migrate them between nodes based on workload and communication patterns. This asynchronous, object-oriented approach addresses the challenges of large-scale distributed computing and reflects core AMT principles.

HPX

Taking a modern C++ approach, this runtime implements AMT concepts across both shared and distributed memory systems. It provides a standard library–like API that supports lightweight tasks, futures and continuations for expressing asynchrony, and an Active Global Address Space (AGAS) to abstract data access and migration. The design focuses on scalability and aims to support a broad range of hardware architectures, making it a useful reference point for studying how AMT principles can be applied in general-purpose runtime systems.

Legion

With its data-centric approach to asynchronous tasking, Legion introduces a model where programmers explicitly define how data is structured, partitioned, and accessed using constructs called logical regions. This shifts the focus away from task flow toward a clearer specification of data layout and usage. Based on this information, the runtime can make informed decisions about how to schedule tasks, place them in memory, and manage data movement.

If you’d like to explore these systems and their design philosophies in more depth, here are two excelent reads worth checking out:

"A Comparative Study of Asynchronous Many-Tasking Runtimes: Cilk, Charm++, ParalleX and AM++" by Abhishek Kulkarni and Andrew Lumsdaine (2019)
- https://arxiv.org/pdf/1904.00518
"A Taxonomy of Task-Based Parallel Programming Technologies for High-Performance Computing" by Thoman et al. (2018)
- https://insieme-compiler.org/pub/2017_wlpp_taxonomy.pdf
- https://link.springer.com/article/10.1007/s11227-018-2238-4