Skip to main content
2023-03-1720 min read
Software Engineering

The Holy Grail of Garbage Collection: Dissecting Azul's C4 Algorithm

The Holy Grail of Garbage Collection: Dissecting Azul's C4 Algorithm

For decades, garbage collection (GC) in programming languages offered a frustrating trade-off: gain the safety and productivity of automatic memory management, but suffer unpredictable application pauses. These pauses – sometimes lasting hundreds of milliseconds or even seconds – rendered garbage-collected languages unsuitable for many latency-sensitive applications, from financial trading platforms to real-time control systems and responsive user interfaces.
Imagine playing a demanding online video game that randomly freezes for half a second at critical moments. This is akin to the experience applications faced with traditional garbage collection—jarring interruptions that disrupt execution flow and degrade user experience.
Enter Azul Systems' Continuously Concurrent Compacting Collector (C4) – a breakthrough that achieved what many in the field considered nearly impossible: truly pauseless garbage collection with memory compaction. This article explores how C4 works, why its development was a landmark achievement, and how its pioneering concepts continue to influence modern runtime systems.

Section 1: Understanding Memory Management

Before diving into the intricacies of garbage collection and C4, let's establish why memory management is a cornerstone of software development and the fundamental challenges it presents.

1.1. The Indispensable Resource: Why Memory Matters

Every computer program, from the simplest script to the most complex enterprise application, requires memory. This finite resource is used to:
  • Store the program's executable code.
  • Hold the data structures the program manipulates (variables, objects, arrays).
  • Manage the program's state and execution flow (e.g., call stacks).
  • Cache intermediate results for performance.
Effectively managing this resource is paramount. Without it, applications would quickly exhaust available memory, leading to crashes, severe performance degradation, or unpredictable behavior.

1.2. The Basic Dance: Allocation and Deallocation

At its core, memory management revolves around two fundamental operations:
  • Allocation: When a program needs a portion of memory (e.g., to create a new object or store data), it requests it from the system. The system finds a suitable free block of memory and grants access to it.
  • Deallocation (or Freeing): When the program no longer needs a previously allocated block of memory, it should be returned to the system so it can be reused for future allocations.
While these operations sound simple, coordinating them correctly and efficiently in large, dynamic applications is a complex task.

1.3. The Perils of Mismanagement

When memory isn't managed correctly, a host of problems can arise, often subtle at first but potentially devastating in the long run:
  • Memory Leaks: The program allocates memory but fails to deallocate it after it's no longer needed. Over time, available memory dwindles, leading to slowdowns and eventual crashes.

    Start: 20% Used

    40% Used

    60% Used

    80% Used

    💥 Out of Memory!

    c
    1// Memory Leak Example in C
    2#include
    3
    4void memory_leak_example() {
    5 while (1) {
    6 // Allocate 1KB of memory
    7 void* leaked_memory = malloc(1024);
    8
    9 // Memory is allocated but never freed
    10 // Each iteration loses 1KB of memory
    11 // Eventually the program will run out of memory
    12 }
    13}
    14
    15// Another common leak pattern
    16void subtle_leak_example() {
    17 for (int i = 0; i < 1000; i++) {
    18 char* buffer = (char*)malloc(256);
    19 // Do some work with buffer...
    20
    21 // Oops! Forgot to free(buffer);
    22 // This memory is now leaked
    23 }
    24 // After loop: 256KB of memory leaked!
    25}
  • Dangling Pointers/References: The program attempts to use memory that has already been deallocated. This can lead to data corruption, crashes, or security vulnerabilities as the program might be reading or writing to memory now used by another part of the system or even another program.

    Pointer → Memory

    Memory Freed 🗑️

    Pointer → ???

    💥 Crash!

    c
    1// Dangling Pointer Example in C
    2#include
    3#include
    4
    5void dangling_pointer_example() {
    6 // Allocate memory for an integer
    7 int* ptr = (int*)malloc(sizeof(int));
    8 *ptr = 42;
    9 printf("Value: %d\n", *ptr); // Output: 42
    10
    11 // Free the memory
    12 free(ptr);
    13 // ptr now points to deallocated memory (dangling pointer)
    14
    15 // DANGER: Using the pointer after free
    16 *ptr = 10; // Undefined behavior!
    17 printf("Value: %d\n", *ptr); // May crash, corrupt data, or "work"
    18
    19 // The memory might be:
    20 // - Reused by another allocation
    21 // - Filled with garbage values
    22 // - Protected, causing a segmentation fault
    23}
    24
    25// Real-world scenario
    26struct Node {
    27 int data;
    28 struct Node* next;
    29};
    30
    31void use_after_free_bug() {
    32 struct Node* head = (struct Node*)malloc(sizeof(struct Node));
    33 head->data = 1;
    34 head->next = NULL;
    35
    36 struct Node* current = head;
    37 free(head);
    38
    39 // Bug: accessing freed memory
    40 if (current->data == 1) { // Undefined behavior!
    41 printf("This might work... or crash\n");
    42 }
    43}
  • Double Frees: The program attempts to deallocate the same block of memory multiple times. This can corrupt the memory manager's internal data structures, leading to unpredictable behavior or crashes.

    Memory Block

    First Free ✅

    Second Free ❌

    💥 Corrupted!

    c
    1// Double Free Example in C
    2#include
    3#include
    4
    5void double_free_example() {
    6 // Allocate memory
    7 int* ptr = (int*)malloc(sizeof(int));
    8 *ptr = 100;
    9
    10 // First free - this is fine
    11 free(ptr);
    12
    13 // Second free - UNDEFINED BEHAVIOR!
    14 free(ptr); // Double free error
    15
    16 // This corrupts the memory allocator's internal structures
    17 // Can lead to crashes, security vulnerabilities, or silent corruption
    18}
    19
    20// Common scenario leading to double free
    21void conditional_double_free(int condition1, int condition2) {
    22 char* buffer = (char*)malloc(1024);
    23
    24 if (condition1) {
    25 // Some processing...
    26 free(buffer);
    27 }
    28
    29 // ... more code ...
    30
    31 if (condition2) {
    32 // Developer forgot buffer might already be freed
    33 free(buffer); // Double free if both conditions were true!
    34 }
    35}
    36
    37// Better practice: Set pointer to NULL after free
    38void safe_free_pattern() {
    39 int* ptr = (int*)malloc(sizeof(int));
    40
    41 free(ptr);
    42 ptr = NULL; // Prevent use after free
    43
    44 free(ptr); // This is now safe (free(NULL) is a no-op)
    45}
  • Fragmentation: Over time, repeated allocation and deallocation can leave memory dotted with many small, unusable free blocks between allocated blocks, even if the total free memory is substantial. This makes it hard to allocate larger contiguous blocks.

    Memory Layout

    🟦

    🟦

    🟦

    ❌ Can't allocate
    large block!

These issues highlight the critical need for robust memory management strategies.

1.4. The Great Divide: Manual vs. Automatic Memory Management

Historically, two primary approaches to memory management have emerged:

Manual Memory Management

In languages like C and C++, the programmer is directly responsible for both allocating and deallocating memory.
c
1// Example of manual memory management in C
2#include
3#include
4
5int main() {
6 // Allocate memory for an array of 5 integers
7 int *arr = (int *)malloc(5 * sizeof(int));
8
9 if (arr == NULL) {
10 fprintf(stderr, "Memory allocation failed\n");
11 return 1;
12 }
13
14 // Use the allocated memory
15 for (int i = 0; i < 5; i++) {
16 arr[i] = i * 10;
17 }
18
19 printf("Array elements: ");
20 for (int i = 0; i < 5; i++) {
21 printf("%d ", arr[i]);
22 }
23 printf("\n");
24
25 // Deallocate the memory when no longer needed
26 free(arr);
27 arr = NULL; // Good practice to avoid dangling pointer
28
29 return 0;
30}
  • Pros: Offers maximum control over memory usage, enabling fine-tuned performance optimizations and deterministic behavior.
  • Cons: Highly error-prone (leaks, dangling pointers, etc.). Adds considerable cognitive overhead for developers.

Automatic Memory Management (Garbage Collection)

Languages like Java, C#, Python, and JavaScript employ garbage collection, where the runtime automatically reclaims unused memory.
java
1// Example of automatic memory management in Java
2import java.util.ArrayList;
3import java.util.List;
4
5public class AutoMemoryExample {
6 public static void main(String[] args) {
7 // Create a list (memory is allocated automatically)
8 List myList = new ArrayList<>();
9 myList.add("Hello");
10 myList.add("World");
11
12 System.out.println(myList);
13
14 // No explicit deallocation needed.
15 // When myList goes out of scope or is set to null,
16 // and no other references to the ArrayList object exist,
17 // the Garbage Collector will eventually reclaim its memory.
18 }
19}
  • Pros: Reduces bugs like memory leaks, increases developer productivity.
  • Cons: Can introduce application pauses, CPU overhead, and potentially larger memory footprint. C4 specifically targets these pauses.
This trade-off—developer productivity and safety versus predictable low latency—has long been a defining factor. C4 aimed to shatter this dichotomy.

Section 2: The Road to Modern GC: An Evolution of Ideas

Garbage collection algorithm design is a complex balancing act, often referred to as the GC Trilemma. Engineers strive to optimize three competing goals:
  1. Low Latency: Minimizing or eliminating application pauses caused by GC. Short, predictable pauses are crucial for responsive applications.
  2. High Throughput: Maximizing the amount of useful work the application can perform, meaning minimizing the CPU time spent on GC overhead.
  3. Memory Footprint: Using memory efficiently, both for the application's objects and for the GC's own internal data structures and operations (e.g., space needed for copying collectors).
Historically, improvements in one of these areas often came at the expense of another. For example, a collector designed for very high throughput might achieve this by performing less frequent but longer "stop-the-world" pauses, thus sacrificing low latency. Understanding this trilemma is key to appreciating the challenges faced by GC designers and the significance of algorithms like C4 that attempt to push the boundaries on all fronts, especially latency.

The Evolution of Memory Management

Each approach tries to improve on the limitations of previous ones. C4 represents the culmination of decades of research.

1970s
Manual Memory Management
1960s-1980s
Reference Counting
1960s-1990s
Stop-the-World GC
1980s-2000s
Generational GC
1990s-2010s
Concurrent Mark-Sweep
2010s-Present
Pauseless GC (C4/ZGC/Shenandoah)

Latency Performance

How well does it avoid pausing your application?

Manual
100%
Sub-millisecond pauses
Reference
85%
Low millisecond pauses
Stop-the-World
15%
Multi-second freezes
Generational
40%
Significant pauses
Concurrent
65%
Noticeable pauses
Pauseless
95%
Sub-millisecond pauses
Worse (long pauses)Better (no pauses)

Throughput

How much CPU time is available for your application?

Manual
95%
< 5% overhead
Reference
70%
20-30% overhead
Stop-the-World
85%
5-20% overhead
Generational
80%
5-20% overhead
Concurrent
65%
30-40% overhead
Pauseless
80%
5-20% overhead
Worse (high overhead)Better (low overhead)

Memory Efficiency

How efficiently does it use memory?

Manual
95%
Minimal waste
Reference
70%
Some fragmentation
Stop-the-World
60%
Some fragmentation
Generational
75%
Low fragmentation
Concurrent
55%
Significant fragmentation
Pauseless
75%
Low fragmentation
Worse (wasteful)Better (efficient)

Memory Safety

How well does it prevent memory errors?

Manual
20%
Prone to leaks & crashes
Reference
60%
Some protection
Stop-the-World
90%
Prevents most errors
Generational
90%
Prevents most errors
Concurrent
90%
Prevents most errors
Pauseless
95%
Prevents most errors
Worse (error-prone)Better (safe)

Click on any point in the timeline to see detailed metrics for that approach.

Notice how C4 and modern pauseless collectors achieve excellent scores across all dimensions - they don't just make trade-offs, they push the boundaries of what's possible.

Garbage collection itself has evolved significantly. Understanding this evolution provides context for C4's innovations.

2.1. Early Attempts: Reference Counting - The Intuitive Approach

As programmers grappled with the complexities and pitfalls of manual memory management (like memory leaks and dangling pointers, discussed in Section 1), the desire for a more automated, less error-prone system grew. One of the earliest and most intuitive strategies to emerge was Reference Counting.
The fundamental thought process was: "If we can keep track of how many parts of our program are actively using a piece of memory (an object), then when that number drops to zero, nobody is using it anymore, and it's safe to reclaim." This seemed like a direct and logical way to automate deallocation.
  • The "How":
    • The Count: The system associates a small integer, the "reference count," with every object allocated in memory.
    • Incrementing: When a new reference to an object is created (e.g., a variable x is assigned to point to object O, or O is added to a list), object O's reference count is incremented.
    • Decrementing: When an existing reference to an object is destroyed (e.g., variable x is reassigned to point to something else or goes out of scope, or O is removed from a list), object O's reference count is decremented.
    • Reclamation: If, after a decrement, an object's reference count reaches zero, it means no active references to it exist anywhere in the program. The object is now considered garbage, and its memory can be immediately reclaimed.
    • Cascading Effect: Importantly, when an object is reclaimed, any other objects it might have been referencing will, in turn, have their reference counts decremented. This can lead to a cascade, where freeing one object triggers the reclamation of several others.
    Reference Counting Lifecycle:

    Object Allocated (RC=0 initially, then 1 by creator)

    Root Reference Added (e.g. assigned to var, RC++)

    Additional References (RC++)

    References Removed (RC--)

    Last Reference Removed (RC becomes 0)

    Object Deallocated

    Circular Reference Created (e.g. A.next=B, B.next=A)

    External References to A & B Removed

    Stuck! (RCs > 0, Memory Leak)

    Created

    Referenced

    MultiRef

    Unreferenced

    CycleFormed

    CycleUnreachable

    Objects A and B reference each other.
    If only external refs are removed,
    A.RC and B.RC remain > 0 (e.g., 1 each from the cycle).
    They are leaked despite being unreachable from roots.

    Reference Counting with Cycle Problem:

    Unreachable Cycle ⚠️ - Memory Leak

    Live Objects

    GC Roots (e.g., Stack Variables)

    refers to

    refers to

    points to

    was pointing to
    ❌ ref removed

    var x

    var y

    Object A
    RefCount: 1

    Object B
    RefCount: 1

    Object C
    RefCount: 1

  • The Appeal (Pros):
    • Simplicity (Conceptual): The core idea is relatively easy to grasp. It directly models the notion of "usefulness" through reference counts.
    • Incremental Nature & Immediate Reclamation: Garbage is collected in small, distributed steps as soon as an object becomes unreferenced. This means memory is returned to the system promptly, and there are typically no long, unpredictable pauses associated with a separate GC phase. This was very attractive for interactive systems or those with tight memory constraints, as it promised smoother performance.
    • Deterministic Destruction (Potentially): Because objects are reclaimed as soon as their count hits zero, their destructors (if any) are run predictably. This can be useful for managing non-memory resources (files, network connections) tied to an object's lifetime.
  • The Problems Uncovered (Cons & Challenges):
    • Overhead of Count Updates: This turned out to be a significant performance bottleneck. Every operation that creates, copies, or destroys a reference (assignments, parameter passing, returning from functions, variables going out of scope) requires one or more atomic updates to reference counts. In programs with frequent pointer manipulations, this constant overhead can noticeably slow down the application. Making these updates thread-safe in concurrent environments adds further complexity and cost.
    • The Achilles' Heel - Circular References: Simple reference counting cannot reclaim objects involved in a reference cycle. If Object A points to Object B, and Object B points back to Object A, their reference counts will each be at least 1 (due to the mutual reference). If all external references to this A-B pair are removed, they become unreachable garbage, but their counts remain positive, so they are never reclaimed. This leads to memory leaks, undermining one of the primary goals of automatic memory management.
    • Space Overhead: Storing a reference count for every single object in the system consumes extra memory. While usually small per object, this can add up in systems with many small objects.
    • Cascading Deallocations & Potential Pauses: While reclamation is generally incremental, if freeing a single object (e.g., the head of a large data structure) causes a long chain of other objects to also have their counts drop to zero, this cascade of deallocations and destructor calls can still lead to a noticeable, albeit usually shorter, pause.
Despite its elegance in theory, the practical drawbacks of reference counting, especially the cycle problem and update overhead, became apparent. While some systems adopted it (e.g., early Smalltalk, Perl, and it's still a core part of Python's CPython implementation, which supplements it with a separate cycle detector), the quest for more robust and efficient GC solutions continued. The limitations of reference counting directly motivated the exploration of entirely different approaches, leading to the development of tracing collectors.
python
1# Conceptual example of a reference cycle in Python
2# CPython uses reference counting primarily, with a separate cycle detector.
3
4class Node:
5 def __init__(self, name):
6 self.name = name
7 self.next = None
8 # In a real scenario, reference counts are managed internally.
9 # For demonstration, imagine a count is associated with self.
10
11 def __del__(self):
12 # This destructor might not be called for cycled objects
13 # if only basic RC is used without a cycle detector.
14 # print(f"Node {self.name} is being destroyed.")
15 pass
16
17
18# Create two nodes
19a = Node("A") # Node "A" object is created. 'a' references it.
20b = Node("B") # Node "B" object is created. 'b' references it.
21
22# Create a cycle
23a.next = b # Node "B" is now also referenced by Node "A".
24b.next = a # Node "A" is now also referenced by Node "B".
25
26# Remove external references
27# In a real scenario, 'a' and 'b' might go out of scope or be explicitly deleted.
28# For this example, we simulate this by reassigning or deleting the variables.
29a = None # The variable 'a' no longer references Node "A".
30 # Node "A" is still referenced by Node "B".next.
31b = None # The variable 'b' no longer references Node "B".
32 # Node "B" is still referenced by Node "A".next.
33
34# At this point, Node "A" and Node "B" form an unreachable cycle.
35# Each object in the cycle has a reference count of 1 (from the other object in the cycle).
36# A simple reference counter would not reclaim them because their counts are not zero.
37# CPython's additional cycle detector is designed to find and collect such cycles.

2.2. The Rise of Tracing Collectors - A Different Perspective

The inherent challenges of reference counting, particularly its inability to handle circular references without complex add-ons and the performance overhead of constant count updates, spurred computer scientists to seek alternative approaches. This led to the development of tracing collectors. Instead of objects individually tracking their own liveness, tracing collectors take a global perspective: they start from a set of known "live" references (the "GC roots" – like local variables on the stack, global variables, CPU registers) and traverse all reachable objects from these roots. Any object not reached by this traversal is considered garbage. This fundamental shift in perspective elegantly solved the cycle problem, as unreachable cycles would simply not be visited.

Mark and Sweep - The Foundational Trace

The first widely recognized tracing algorithm was Mark and Sweep. The insight was simple yet powerful: if we can identify all live objects, then everything else must be garbage.
  • The "How":
    1. Mark Phase: The collector starts with the GC roots. It "marks" every object directly referenced by a root as live. Then, it recursively visits all objects referenced by these newly marked objects, marking them live as well, and so on, until all objects reachable from the roots are marked. Think of it like exploring a maze: you start at the entrance (roots) and mark every path and room you can reach.
    2. Sweep Phase: Once the marking is complete, the collector scans the entire heap. Any object that was not marked during the Mark phase is considered unreachable garbage and its memory is reclaimed. The marks on the live objects are then typically cleared in preparation for the next GC cycle.
Mark and Sweep algorithms often use a "Tricolor Abstraction" to keep track of objects during the marking phase:
  • White: Objects initially considered garbage or not yet visited.
  • Gray: Objects that are known to be live but whose children (referenced objects) have not yet been fully scanned. These are on the GC's "to-do" list.
  • Black: Objects that are known to be live, and all of their children have been scanned. The GC is "done" with these for the current marking phase.
The goal of the marking phase is to turn all live objects black, leaving only garbage objects white.
Tricolor Abstraction in Concurrent Marking
White (Unvisited)
Gray (To Process)
Black (Complete)
ABCDEFGHROOT(Unreachable)

Initial state: All objects are white (unvisited)

How it works:
  • • The GC starts from root objects and marks them gray (to be processed)
  • • It processes gray objects by marking their children gray and marking themselves black
  • • The process continues until no gray objects remain
  • • White objects at the end are unreachable garbage
  • • Write barriers detect when black objects get new references to white objects
  • The Realization (Pros & Cons):
    • Pros:
      • Handles Cycles Naturally: Because tracing starts from roots and only follows live references, circular structures that are not connected to any root are never visited and thus correctly identified as garbage. This was a huge step forward from basic reference counting.
    • Cons (The Trade-offs Discovered):
      • Stop-the-World (STW) Pauses: The most significant drawback was that the application had to be completely paused ("stopped the world") during both the mark and sweep phases. If the application continued to run and change object references while the GC was trying to mark, the GC could get an inconsistent view of live objects, leading to errors (e.g., freeing a live object). These pauses could be lengthy, especially for large heaps, making applications unresponsive. This was a major sacrifice in latency.
      • Memory Fragmentation: After several mark-and-sweep cycles, the heap could become fragmented. Live objects would be scattered throughout memory, interspersed with free blocks of varying sizes. Even if there was enough total free memory, it might be difficult to find a contiguous block large enough for a new, large object allocation. This impacted memory footprint efficiency and could slow down future allocations.
Mark and Sweep laid the groundwork for tracing GC, but its STW pauses and fragmentation issues highlighted the need for further refinement. It addressed the GC trilemma by prioritizing correct cycle handling over consistent low latency and optimal memory layout.

Mark-Compact - Tackling Fragmentation

The fragmentation problem caused by Mark and Sweep was a practical concern. While it correctly reclaimed memory, the resulting "swiss cheese" memory layout could hinder performance. The next logical step was to not just reclaim free space, but to organize the remaining live objects more efficiently. This led to Mark-Compact collectors.
  • The "How": Mark-Compact builds upon Mark and Sweep.
    1. Mark Phase: Identical to Mark and Sweep – all live objects are identified and marked.
    2. Compact Phase: This is the key addition. Instead of just sweeping away unmarked objects, the collector now moves all the marked (live) objects to one end of the heap (or a specific memory region). This consolidates all live objects into a contiguous block, leaving a single, large free area at the other end. This phase also requires updating all references to the moved objects (pointers from roots and from other live objects) to point to their new locations.
    Mark-Compact Memory Transformation:
    Mark-Compact Memory Transformation
    Live Objects
    Garbage Objects
    Free Space
    0 KB148 KB296 KBA32KBX16KBB64KBY24KBFree8KBC16KBZ32KBFree32KBD48KBW24KB
    Live Memory
    160KB
    Garbage
    96KB
    Free Space
    40KB
    Largest Free
    32KB

    Initial heap state: Memory contains live objects, garbage objects (X, Y, Z, W), and free spaces

    Key Benefits of Mark-Compact:
    • • Eliminates memory fragmentation by moving live objects together
    • • Creates a single large free space for efficient allocation
    • • Improves cache locality by keeping related objects close
    • • Trade-off: Longer GC pauses due to object relocation
  • The Realization (Pros & Cons):
    • Pros:
      • Solves Fragmentation: By compacting live objects, it eliminates fragmentation, resulting in a large, contiguous free space. This makes subsequent allocations very fast (often just a pointer bump in the free area).
      • Improved Data Locality (Potentially): Moving related objects closer together could improve cache performance, though this wasn't always guaranteed or the primary goal.
    • Cons (The Trade-offs Worsened):
      • Even Longer STW Pauses: The compaction phase itself is complex and time-consuming. Copying objects and, crucially, updating all references to them (which could be scattered throughout the heap and in program stacks/registers) added significant time to the STW pause. While it improved memory footprint efficiency and subsequent allocation throughput, it often made the worst-case latency even worse than simple Mark-Sweep. This was a clear trilemma trade-off: better memory utilization and allocation speed at the cost of longer application freezes.
Mark-Compact was a crucial step in managing memory efficiently, but the increased pause times underscored the pressing need for techniques that could reduce or eliminate these disruptive STW events.

Generational Collection - The Insight of Object Lifecycles

As developers built larger and more complex applications, a key observation about object behavior emerged, known as the Generational Hypothesis:
  1. Most objects die young. (Many objects are allocated, used briefly, and then become garbage quickly – e.g., temporary variables, short-lived data structures).
  2. Few objects live for a long time. (Objects that survive many GC cycles are likely to survive for much longer – e.g., application configuration, caches, core data structures).
This insight led to Generational Collection, a strategy to optimize GC by focusing effort where it's most likely to yield results.
  • The "How": The heap is divided into two or more "generations":
    • Young Generation (or Nursery): This is where all new objects are initially allocated. It's typically small. Because most objects die young, the Young Generation tends to fill up quickly with a high proportion of garbage. Collections here are frequent and fast (often called "minor GCs"). Live objects that survive a few minor GCs are "promoted" (moved) to the Old Generation.
    • Old Generation (or Tenured Space): This area holds longer-lived objects. It's typically much larger than the Young Generation. Collections here are less frequent but more time-consuming (often called "major GCs" or "full GCs" if they involve the entire heap). These collections often use algorithms like Mark-Sweep or Mark-Compact. A crucial part of generational collection is managing "inter-generational pointers" – references from objects in the Old Generation to objects in the Young Generation. These must be tracked (often using "card tables" and "write barriers") so that a Young Generation collection doesn't mistakenly reclaim young objects still referenced by old objects.

    Heap

    Young Generation - Frequent Fast GC

    Promotion after survival

    Promotion

    Promotion

    Eden - New Objects

    Survivor S0

    Survivor S1

    Old Generation - Infrequent Slower GC

    New Object

  • The Realization (Pros & Cons):
    • Pros:
      • Improved Throughput: By frequently collecting the Young Generation where garbage is plentiful, the GC can reclaim a lot of memory quickly and efficiently. This reduces the overall CPU time spent on GC compared to non-generational collectors that always scan the entire heap.
      • Reduced Average Pause Times: Minor GCs in the Young Generation are typically very short because they only deal with a small portion of the heap. This significantly improves the average application responsiveness.
    • Cons (The Persistent Problem):
      • Major STW Pauses Still an Issue: While average pauses improved, the major GCs on the Old Generation (especially if they involved compaction) could still cause long, disruptive STW pauses. For applications requiring consistent low latency, these occasional major pauses remained a significant problem. Generational collection improved two points of the trilemma (average latency and throughput) but didn't fully solve the worst-case latency for large, long-lived heaps.
Generational collection became a very popular and effective strategy, adopted by many mainstream language runtimes (like Java's HotSpot JVM). It represented a significant step forward in balancing GC efficiency with application performance. However, the "pause problem," especially for large, mature heaps, was not fully conquered, setting the stage for the next wave of innovations focused on concurrency.

2.3. The Lingering Challenge: The Pause Problem

Despite advancements, long, unpredictable STW pauses persisted, especially during full heap compaction. This set the stage for C4.

2.4. The Unyielding Wall: Latency Crisis in Large-Scale Java

As Java applications grew in complexity and data intensity, heap sizes began to soar, often into the tens or even hundreds of gigabytes. For traditional garbage collectors, pause times—especially for full GCs involving compaction—often scale with heap size. A pause that might be tolerable on a 1GB heap could become a multi-second or even minute-long application freeze on a 100GB heap.
This created a crisis for a growing class of applications:
  • Financial Trading Systems: A pause of a few hundred milliseconds could mean millions in lost opportunities or arbitrage.
  • Real-time Ad Bidding: Sub-50ms response times are often required; GC pauses were a primary violator.
  • Large E-commerce Platforms: During peak load, consistent low latency is crucial for user experience and conversion rates. A GC pause could lead to cascading failures.
  • Interactive Big Data & Analytics: Users expect responsive queries even on massive datasets held in memory.
The critical metric for these systems wasn't just average response time, but tail latencies—the performance at the 99th, 99.9th, or even 99.99th percentile. A system might have excellent average latency, but if one in a thousand requests (99.9th percentile) experiences a 2-second GC pause, it's unacceptable for high-frequency operations. This "tyranny of the tail" made Java, despite its productivity benefits, a risky choice for these ultra-low-latency domains, often forcing developers to revert to C/C++ with its manual memory management complexities. The demand for a Java Virtual Machine that could handle massive heaps without these crippling pauses was immense. This was the unyielding wall that C4 aimed to break through.

Section 3: C4 Unveiled: Engineering Pauselessness

C4, standing for Continuously Concurrent Compacting Collector, was engineered by Azul Systems to directly address the challenge of STW pauses in large-heap, latency-sensitive Java applications. Its design philosophy and mechanisms represented a paradigm shift in how GC could operate.

3.1. The C4 Philosophy: Continuous, Concurrent, Compacting (and Collector)

  • Continuously: Ongoing background process.
  • Concurrent: Work happens alongside application threads.
  • Compacting: Prevents fragmentation, done concurrently.
  • Collector: A complete system.
C4 Architecture and Key Components:

Cooperative Scheduling

C4 Collector Core

Barrier Layer

Application Layer

Support Structures

Concurrent Phases (No STW)

reads

writes

loads ref

App Thread 1

App Thread 2

App Thread N

🛡️ Read Barriers
• Object access interception
• Forwarding pointer resolution
• Self-healing optimization

🛡️ Write Barriers
• Reference mutation tracking
• SATB maintenance
• Card marking

🛡️ Load Reference Barriers
• Pointer load interception
• Ensures fresh references
• Prevents stale propagation

📍 Concurrent Marking
• SATB marking
• Tricolor abstraction
• Work stealing

🚚 Concurrent Relocation
• Region selection
• Object copying
• Forwarding installation

🔄 Concurrent Remapping
• Reference updating
• Pointer healing
• Memory reclamation

📊 Region Management
• Heap partitioning
• Free list maintenance
• Fragmentation tracking

📝 Remembered Sets
• Inter-region references
• Write barrier updates
• Efficient scanning

📋 Work Distribution
• Mark stack/queue
• Relocation sets
• Load balancing

🚦 Safe Points
• Method entries/exits
• Loop back-edges
• GC yield checks

⚡ Incremental Work Units
• Microsecond tasks
• Preemptible chunks
• Progress guarantees

3.2. Concurrent Marking: Finding Garbage on the Fly

The first challenge in any garbage collection cycle is to accurately identify all live objects. Doing this concurrently—while application threads (mutators) are actively running and potentially changing the object graph (creating new objects, modifying references)—is particularly complex. This is often called the "mutator problem" or "the moving target problem." If the GC scans an object and marks its children as reachable, and then the application modifies that object to point to a new, previously unseen object, the GC might miss this new live object if not careful.
C4, like other advanced concurrent collectors, employs sophisticated strategies to tackle this. Common approaches include:
  • Snapshot-at-the-Beginning (SATB): The GC logically considers the object graph as it existed at the very start of the marking phase. All objects allocated after this snapshot are automatically considered live for the current cycle. The key challenge is to detect when an application thread writes a pointer from an already-marked (black) object to a not-yet-marked (white) object that was part of the initial snapshot. Such writes could "hide" a live object from the GC. Write barriers are crucial here to record these specific writes or to ensure the newly referenced white object is also marked.
  • Incremental Update: In this approach, the GC attempts to track changes to the object graph more dynamically. If an application thread creates a new reference from an already processed (black) object to an unprocessed (white) object, the write barrier ensures that either the white object is immediately marked and added to the GC's worklist (turned gray), or the black object is re-queued for scanning its references again.
While specific C4 implementation details are proprietary, the core idea is that the marking process runs alongside the application, using a write barrier to maintain consistency.
A write barrier is a small piece of code inserted by the Just-In-Time (JIT) compiler that executes just before (or sometimes after) any pointer write in the application code (e.g., obj.field = newReference;). Its purpose is to inform the GC about this modification so the GC can take appropriate action. In the context of concurrent marking with tricolor abstraction, write barriers prevent the loss of live objects by ensuring that when a black object (already processed) gains a reference to a white object (not yet visited), the white object is marked gray or the black object is reconsidered.
java
1// Conceptual Write Barrier (Java-like pseudocode)
2// Executed by the application thread when `obj.field = value;` happens
3void onPointerWrite(Object obj, Field field, Object newValue, Object oldValue) {
4 // If GC is in a concurrent marking phase...
5 if (GC.isConcurrentMarkingActive()) {
6 // ...and the object being modified ('obj') has already been scanned (is black)...
7 if (GC.isObjectBlack(obj)) {
8 // ...and the new reference ('newValue') points to an object
9 // not yet seen or marked by the GC (is white)...
10 if (newValue != null && GC.isObjectWhite(newValue)) {
11 // Then the GC needs to be made aware of this new potential path to a live object.
12 // Action depends on the specific concurrent marking strategy:
13 // Option 1 (Incremental Update style): Mark 'newValue' as gray (needs processing).
14 // GC.markGray(newValue);
15
16 // Option 2 (SATB style): If 'newValue' was part of the initial snapshot
17 // and this write violates the snapshot's integrity, ensure 'newValue' is processed.
18 // This might involve graying 'newValue' or re-scanning 'obj'.
19 // GC.ensureVisibleInSnapshot(newValue);
20
21 // Option 3: Log the write in a data structure for later processing by GC threads.
22 // For example, mark the "card" or "region" containing 'obj' as dirty.
23 // GC.logDirtyCard(obj);
24 }
25 }
26 }
27 // Finally, perform the actual pointer write
28 field.set(newValue); // This is the application's intended operation
29}
To efficiently manage the information from write barriers, GCs often use auxiliary data structures like:
  • Card Tables: The heap is divided into small, fixed-size regions called "cards" (e.g., 128-512 bytes). When a write barrier detects a potentially interesting pointer write into an object on a card, that card is "dirtied." GC threads then only need to scan objects on dirty cards for new inter-object references, rather than the entire heap.
  • Remembered Sets: These data structures (often per-region or per-generation) record pointers that cross boundaries—for example, from the old generation to the young generation in a generational collector, or from an already-scanned region to an unscanned region in a regionalized collector. Write barriers update these sets.
By using these techniques, C4 can build an accurate graph of live objects without lengthy "stop-the-world" pauses, allowing application threads to continue making progress.
GC Marking ThreadWrite BarrierApplication ThreadGC Marking ThreadWrite BarrierApplication ThreadnewRef is live,old target might be garbageExecutes, creates objectsModifies obj.field = newRefLogs reference changeConcurrently marks objectsbased on roots and logged changesContinues execution

3.3. Concurrent Compaction: The True Innovation

C4 achieves concurrent compaction using read barriers and load reference barriers (LRB).

The Read Barrier: C4's Master Key

A read barrier intercepts every object reference read, ensuring the application always sees the correct object version/location, even if it's moving.
MemoryRead BarrierApplication ThreadMemoryRead BarrierApplication Threadalt[Object O is being/has been moved][Object O not moved]Read obj.field(tries to access object O via pointer P)Read status of object O at PGet forwarding pointer for O(to new address P')(Optional) Update pointer P to P' (heal)Return object O from new address P'Return object O from original address P
The original Java pseudocode for loadReference remains a good illustration here:
java
1// Conceptual pseudocode for a C4-style read barrier
2Object loadReference(MemoryLocation PtrToRef) {
3 Object actualObject = read_from_memory(PtrToRef.points_to_address());
4
5 if (isForwardingPointer(actualObject)) { // Has the object moved?
6 Object newLocation = getNewLocationFromForwardingPointer(actualObject);
7 // Optionally, heal the original pointer in memory
8 write_to_memory(PtrToRef.address_of_ref(), newLocation);
9 return newLocation;
10 }
11 return actualObject;
12}
Read barriers are primarily applied to reads of object references from the heap memory (i.e., accessing an instance field or an array element that is a pointer). Reads from local variables on the stack or CPU registers typically do not directly trigger a read barrier, though the values loaded into those registers/locals from the heap would have passed through a barrier at the point of loading.
The performance implication of read barriers is a critical consideration. Each mediated heap read incurs a small, constant overhead—typically a few extra machine instructions. While this might seem like it could significantly slow down an application, several factors mitigate this:
  • JIT Optimization: Modern Just-In-Time compilers are highly sophisticated and can optimize the barrier code significantly, inlining it and minimizing its impact in common-case scenarios (where the object hasn't moved).
  • Hardware Assistance: Historically, Azul Systems' custom Vega processors included hardware features to accelerate barrier execution, making them extremely fast. While C4's principles are software-based, this shows the extent to which barrier performance can be optimized.
  • Trade-off Value: For latency-sensitive applications, this small, consistent overhead on reads is vastly preferable to unpredictable, long stop-the-world pauses that can halt the application for hundreds of milliseconds or even seconds.
  • Pointer Healing: The "healing" mechanism, where the read barrier updates the read reference to point directly to the object's new location, is a crucial self-optimization. Over time, as frequently accessed parts of the object graph are traversed, more and more pointers get "healed," reducing the number of times the barrier needs to consult forwarding pointers for those specific paths. This healing is typically performed by the application thread itself as part of the barrier execution, ensuring that the update is immediately visible to that thread.

The Concurrent Relocation Process

With read barriers, C4 concurrently copies live objects to a new "to-space", leaving forwarding pointers. Application threads are seamlessly redirected by read barriers.
How C4 Moves Objects Without Stopping Your Application:

⬇️ GC starts copying objects

⬇️ App accesses moved object

⬇️ All objects relocated

♻️ Free Space
(Region reclaimed)

🏠 Active Space
📦 Object A' (Live)
📦 Object B' (Live)

👤 App Thread
✅ Never stopped!
✅ Transparent!

⏰ Time 4: Cleanup Complete

⏰ Time 3: Read Barrier Handles Access

🏠 From-Space
📍 A → Fwd to A'
📍 B → Fwd to B'

🏗️ To-Space
📦 Object A' ✅
📦 Object B' ✅

🛡️ Read Barrier Activates!
1. Detects forwarding pointer
2. Redirects to A' location
3. Updates reference

🏠 From-Space
📍 A → Forwarding Ptr
📦 Object B (copying...)
🗑️ Garbage

🏗️ To-Space
📦 Object A' (New)

👤 App Thread
Reading Object A

⏰ Time 2: GC Copies Objects (App Still Running!)

🏠 From-Space
📦 Object A (Live)
📦 Object B (Live)
🗑️ Garbage

🏗️ To-Space
(Empty)

⏰ Time 1: GC Selects Region for Collection

Forwarding Pointers: The Breadcrumbs of Relocation
When an object is copied from its old location (in "from-space") to a new location (in "to-space"), the GC must leave a way for read barriers (and other GC mechanisms) to find the object's new home. This is done using forwarding pointers.
  • Implementation: A forwarding pointer is typically installed at the original object's location, often by overwriting its header or a dedicated word. It simply stores the new address of the object. The GC must ensure this update is atomic to prevent races with application threads trying to read the object's header or content.
  • Information: Besides the new address, a forwarding pointer might implicitly (or explicitly via status bits in the header) indicate that the object has indeed been moved.
The Relocation Process in More Detail:
  1. Region Selection (Collection Set Identification): C4, being a non-generational collector that manages the entire heap uniformly, needs a strategy to decide which parts of the heap to compact. This typically involves:
    • Dividing the heap into regions of a fixed size.
    • Monitoring these regions for metrics like fragmentation (how much free space is interspersed with live objects) and liveness (how many live objects they contain).
    • The GC selects a "collection set"—a group of regions—for compaction. Regions with high fragmentation or a good amount of reclaimable garbage are prime candidates. The goal is to free up mostly empty regions to create larger contiguous free areas.
  2. Evacuation to To-Space:
    • For each region in the collection set, the GC identifies all live objects (this relies on the preceding concurrent marking phase).
    • These live objects are then copied (evacuated) to "to-space"—which can be other currently free regions in the heap or regions specifically allocated for this purpose.
    • As each object is copied, a forwarding pointer is installed in its original location in from-space.
  3. Object Copying Strategy: The order in which objects are copied can influence post-compaction data locality. Collectors might use strategies like depth-first or breadth-first traversal of the live object graph within a region to try and place related objects close to each other in to-space, potentially improving cache performance for the application later.
This entire process—selecting regions, copying objects, and installing forwarding pointers—happens concurrently with application threads, all orchestrated and kept safe by the read barriers.

Concurrent Reference Updating: Healing the Heap

Once objects within a collection set have been relocated and forwarding pointers are in place, a critical phase is to update all references throughout the entire system that still point to the old (from-space) locations of these moved objects. This "reference healing" must eventually make all pointers refer directly to the new (to-space) locations. This is not just about correctness but also about efficiency, as relying on forwarding pointers indefinitely would add overhead to every access.
  • The Challenge: This is a daunting task because such references can exist anywhere:
    • Heap References: Pointers from one object to another within the heap.
    • Root References: Pointers from outside the heap into it, such as:
      • Local variables on application thread stacks.
      • CPU registers holding pointers.
      • Global or static variables.
      • References from JNI (Java Native Interface) code. Doing this concurrently without missing any updates or introducing inconsistencies is complex.
  • Heap Reference Updating Strategy:
    • C4 needs to systematically scan portions of the heap to find fields or array elements containing pointers to relocated objects (i.e., pointers to from-space addresses that now have forwarding pointers).
    • Utilizing Remembered Sets: To avoid scanning the entire heap repeatedly, concurrent collectors often use data structures like "remembered sets." These sets, updated by write barriers during application execution, track locations where pointers cross certain boundaries (e.g., from one heap region to another). When objects in a region are relocated, the GC can consult these remembered sets to find potential incoming pointers from other regions that might need updating.
    • Atomic Updates: When a stale reference is found, it's updated to the new address (obtained by dereferencing the forwarding pointer of the object it referred to). This update to the pointer field itself must be done carefully, often using atomic hardware instructions (like Compare-And-Swap) or other synchronization mechanisms if GC threads and application threads could potentially race to update the same memory location. However, read barriers on the application side already provide a strong safety net.
    • Batching: Reference updating might be done in batches by GC threads, processing a set of relocated objects or a set of heap regions at a time.
  • Stack and Register Scanning (Root Pointers):
    • References on thread stacks and in CPU registers are particularly tricky because they are actively used by running code and are not part of the GC-managed heap in the same way.
    • Thread-Local Pauses or Self-Healing: Updating these often requires threads to reach a GC safe point. At such a point, a thread might briefly pause to allow the GC to scan its stack and registers and patch any stale pointers. Alternatively, some designs rely on the application thread to "self-heal" its stack/register pointers: when the thread next tries to use a stale pointer from its stack/registers to access the heap, the read/load barrier would trap, follow the forwarding pointer, and potentially update the stack/register copy of the pointer.
    • JNI Considerations: References held by native (JNI) code pose another challenge, requiring careful coordination between the JVM and native code, often through specific JNI functions for managing global and local JNI references.
  • Idempotency and Convergence: The reference updating process must be idempotent—applying an update multiple times should have the same effect as applying it once. This is crucial in a concurrent environment where different threads or GC phases might encounter the same stale pointer. The overall goal is convergence: eventually, all frequently accessed pointers are healed, and even infrequently accessed ones will be corrected when used.
The ultimate aim is that, over time, all active references converge to point directly to the new (to-space) locations. This makes the forwarding pointers in the from-space regions gradually obsolete. Once a from-space region contains no more live objects (all have been successfully evacuated) and the GC determines that no more relevant pointers refer into it (or all such pointers are reliably handled by forwarding until they are healed), that from-space region can be fully reclaimed and made available for future allocations.

Heap Scan and Pointer Healing Process

Initial state: Objects A, B, C have been relocated to To-Space with forwarding pointers. Pointers are still stale (red).
From-Space (Old)To-Space (New)AABBCCDExternal RefStale PointerHealed PointerForwarding Pointer

3.4. Load Reference Barriers (LRB): Ensuring Fresh Pointers

While read barriers handle the dereferencing of pointers (i.e., accessing the object's fields or array elements through a pointer), Load Reference Barriers (LRBs) address a subtly different but related issue: ensuring that when a pointer itself is loaded from heap memory into a local variable or CPU register, it is the most current (to-space) version if the object it refers to has been relocated.
  • Distinction and Synergy with Read Barriers:
    • A read barrier typically activates when an application executes an operation like myVar = someObject.field; (where someObject is a pointer, and we are reading the field from the object it points to). The barrier ensures that if someObject itself has moved, or if field points to an object that has moved, the access is correctly redirected.
    • An LRB, on the other hand, focuses on the act of loading a reference value itself. For example, in MyObject localPtr = anObject.anotherPtrField;, the LRB would ensure that localPtr receives the to-space address of the object referred to by anObject.anotherPtrField, if that object has been moved.
  • Purpose: LRBs are crucial for preventing the propagation of stale (from-space) pointers within the application's more immediate working set (CPU registers, local stack variables). If a stale pointer is loaded from the heap and then copied around between registers or local variables without being immediately dereferenced (where a read barrier would catch it), an LRB ensures that this copied pointer is already the correct, up-to-date (to-space) one. This can:
    • Reduce the number of times read barriers later have to perform forwarding lookups.
    • Prevent certain complex race conditions or inconsistencies that might arise if stale pointers are used in comparisons or other non-dereferencing operations.
  • Implementation: Similar to read barriers, LRBs are typically short sequences of code injected by the JIT compiler at memory load operations that fetch reference values from the heap. They check the status of the loaded pointer (e.g., by examining bits in the pointer itself if "colored pointers" are used, or by checking metadata associated with the memory region) and follow forwarding pointers if necessary. In many advanced GC designs, the distinction between read and load barriers can blur, with a unified barrier mechanism handling both cases. For C4, the essential principle is that some barrier mechanism robustly ensures that application code never operates on a stale memory address in a way that would bypass the GC's forwarding and consistency logic.
java
1// Conceptual scenario highlighting LRB's role (Java-like)
2class Container { Object ref; } // ref is a pointer field in Container
3
4// Assume objContainer.ref initially points to OldLocation.
5// The object at OldLocation is then moved by GC to NewLocation.
6// A forwarding pointer is left at OldLocation pointing to NewLocation.
7
8// Scenario 1: Load without a sufficiently strong LRB (or if LRB is combined with read barrier)
9Object ptr_potentially_stale = objContainer.ref;
10// If LRB is weak or absent here, ptr_potentially_stale might temporarily hold OldLocation.
11// If ptr_potentially_stale is then used:
12// String s = ptr_potentially_stale.toString(); // A read barrier on ptr_potentially_stale would kick in here.
13
14// Scenario 2: Load with an effective LRB
15Object ptr_guaranteed_current = objContainer.ref;
16// LRB executes during the load of 'objContainer.ref'.
17// It detects that OldLocation is forwarded to NewLocation.
18// ptr_guaranteed_current now directly holds NewLocation.
19// Any subsequent use of ptr_guaranteed_current is already to the correct address.

3.5. Cooperative Preemption & Incremental Work: Harmonizing GC and Application

To achieve its goal of continuous concurrency and avoid disruptive stop-the-world pauses, C4 relies on two interconnected principles: breaking down massive GC tasks into small, manageable chunks (incremental work) and having application threads cooperatively yield execution for brief moments to allow these GC chunks to run.
  • Nature of GC Safe Points: Application threads cannot yield for GC at any arbitrary instruction. They can only do so at "GC safe points." These are specific locations in the JIT-compiled code where the application's state, particularly concerning object references held in its execution stack and CPU registers, is well-defined and can be consistently and safely inspected or modified by the GC if necessary. Common safe points include:
    • Method prologues (entry) and epilogues (exit).
    • Loop back-edges (the point where a loop decides to iterate again or terminate).
    • Before or after certain complex bytecode instructions or native calls. The JIT compiler is responsible for generating code that includes these safe point checks.
  • Yielding Mechanism (Cooperative Preemption): At each safe point, the compiled application code includes a tiny check, often as simple as testing a global or thread-local "GC yield requested" flag.
    • If the flag is not set, the application thread continues without interruption.
    • If the flag is set, the application thread "cooperatively preempts" itself. This means it voluntarily and briefly suspends its own execution or diverts to execute a small unit of GC work. This is "cooperative" because the application thread itself initiates the pause or the switch to GC work, ensuring it only happens when the thread is in a consistent state.
  • Incremental Work Units: GC operations that would traditionally require a long STW pause (like marking the entire live object graph or compacting a large region) are decomposed into many small, independent work units. For example:
    • A marking work unit might involve scanning a few objects from a worklist and marking their children.
    • A relocation work unit might copy a small batch of objects from a from-space region to a to-space region.
    • A reference-updating work unit might scan a small portion of the heap or a specific data structure (like a remembered set) to patch stale pointers. Each work unit is designed to complete very quickly, ensuring that any pause an application thread experiences due to yielding is extremely short (often in the microsecond range).
  • Pacing and Progress – The GC Scheduler: The GC system needs a "pacing" or scheduling mechanism to ensure that it makes sufficient progress relative to the application's activity.
    • Allocation Rate: If the application is allocating new objects very rapidly, the GC might need to become more "assertive" in requesting yields or assigning work to ensure that heap occupancy doesn't grow uncontrollably.
    • Heap Occupancy: As the heap fills up, the GC will intensify its efforts to reclaim space.
    • Workload Balancing: GC work units can often be parallelized and distributed among available CPU cores, either by dedicated GC threads or by having application threads perform some GC work during their yield points. This dynamic adjustment and distribution of incremental work are key to maintaining a balance between application throughput and continuous GC progress, preventing the accumulation of large amounts of GC work that would necessitate a long pause.
By meticulously combining these strategies—highly concurrent core algorithms protected by sophisticated barriers, and the decomposition of all GC work into fine-grained, cooperatively scheduled incremental units—C4 aims to keep GC-induced interruptions so brief and well-distributed that they become virtually imperceptible to even the most latency-sensitive applications, regardless of heap size.
App Thread BGC ThreadApp Thread AApp Thread BGC ThreadApp Thread AProcessing...Yields at Safe PointExecutes GC Work UnitResumeProcessing...Processing...

Section 4: The C4 Difference: Real-World Impact

4.1. Performance Characteristics

  • Consistently Low Latency: C4's hallmark was its ability to deliver pause times that were not just short, but consistently in the microsecond to low-millisecond range, regardless of heap size. This was revolutionary, as previous collectors saw pauses grow with heap size, especially during compaction. For a 100GB heap, C4 could maintain pauses under 1ms, while other collectors might freeze for seconds.
  • Predictable Behavior & Tail Latency Annihilation: Beyond just short pauses, C4 offered predictable pause times. This dramatically improved tail latencies (99.9th percentile and beyond). Applications that previously suffered from occasional multi-second outliers due to GC could now operate with a much tighter and more reliable response time envelope. This predictability was often more critical than raw average speed for systems with strict SLAs.
  • Scalability for Massive Heaps & Multi-Core CPUs: C4 was designed from the ground up to handle terabyte-scale heaps efficiently. Its concurrent nature allowed it to leverage multi-core processors effectively, performing its work in parallel with application threads and other GC threads without becoming a bottleneck. This meant Java applications could finally utilize massive memory footprints without the traditional GC penalty.
  • High Throughput Maintained: Despite the overhead of its sophisticated barriers (read, write, load-reference) and fully concurrent operations, C4 was engineered to maintain competitive application throughput. The argument was that the elimination of long STW pauses often compensated for the slight, continuous overhead of the barriers, leading to better overall system performance and responsiveness, especially under heavy load.
GC Algorithm Performance Comparison:
Garbage CollectorMax Pause TimeConcurrencyHeap Size RangeBest Use CaseConcurrent Compaction
Serial GC100-5000msSingle thread< 1GBSmall applications❌ No (STW)
Parallel GC50-1000msMultiple threads1-10GBThroughput-focused❌ No (STW)
CMS20-200msConcurrent marking1-32GBLow latency❌ No compaction
G1 GC10-100msMostly concurrent10-100GBBalanced performance⚠️ Incremental
Azul C4< 1msFully concurrent1GB-8TBUltra-low latency✅ Yes
ZGC< 10msConcurrent8MB-16TBLow latency at scale✅ Yes
Shenandoah< 10msConcurrent200MB-3TBConsistent low latency✅ Yes
Notice the dramatic progression: Traditional collectors (Serial, Parallel) measure pauses in seconds, while modern pauseless collectors like C4 achieve sub-millisecond pauses regardless of heap size. The key innovation? Concurrent compaction - the ability to defragment memory without stopping the application.

4.2. Ideal Use Cases

The impact of C4 was most profound in domains where Java's productivity was desired, but its historical GC pauses were intolerable:
  • Financial Services: High-frequency trading platforms, risk management systems, and market data feeds where sub-millisecond consistency is paramount. C4 allowed Java to compete with C++ in these ultra-low-latency environments.
  • Ad Tech (Real-Time Bidding): Ad exchanges often have response time budgets of 10-50 milliseconds. C4 enabled Java-based systems to meet these stringent requirements reliably, preventing lost bids due to GC stalls.
  • Large-Scale E-commerce: Ensuring smooth user experience during peak loads (e.g., Black Friday sales) for platforms with massive product catalogs and user session data held in memory. C4 helped prevent checkout freezes or inventory lookup delays.
  • Interactive Big Data & In-Memory Analytics: Applications requiring real-time querying and visualization of large datasets (hundreds of GBs to TBs) held in Java heap memory. C4 allowed for responsive user interaction without disruptive pauses during complex computations.
  • Telecommunications & Network Infrastructure: Systems like session border controllers, real-time rating/charging engines, or network monitoring tools that require continuous uptime and predictable processing of high volumes of data.
  • Online Gaming: Massively Multiplayer Online Games (MMOGs) with large server-side state, where GC pauses could lead to noticeable lag or disconnects for players.
Essentially, any large-scale, latency-sensitive Java application that was previously constrained by GC pauses or forced to use smaller, less efficient heap configurations found a transformative solution in C4. It unlocked the ability to use Java in demanding environments where it was previously considered unsuitable.

Application Response Time: Before vs After C4

Traditional GC

Current:6.1ms
Average:118.1ms
99th %ile:1822.9ms
99.9th %ile:1985.6ms
Max:1985.6ms
Major GC
Minor GC

C4 Pauseless GC

Current:6.2ms
Average:7.5ms
99th %ile:10.6ms
99.9th %ile:10.7ms
Max:10.7ms

Real-time simulation showing 30-second window of application latency (logarithmic scale)

Notice how traditional GC causes periodic spikes up to 2 seconds while C4 maintains consistent ~10ms latency

Section 5: C4's Legacy: Shaping Modern Garbage Collection

Reference Counting

Mark-Sweep

Mark-Compact

Generational GC

Concurrent Mark Sweep - CMS

G1 GC

Azul C4
Pauseless Compaction

ZGC / Shenandoah
Inspired by C4 concepts

5.1. Key Lessons from C4

The development and success of C4 imparted several crucial lessons to the GC community:
  • Truly Pauseless Compaction is Achievable: C4 demonstrated that full heap compaction, traditionally a major source of STW pauses, could be performed concurrently with running application threads. This was a paradigm shift.
  • Sophisticated Barriers Can Be Efficient Enough: While read and load barriers introduce overhead on memory accesses, C4 showed that with careful engineering (and in Azul's case, initial hardware assistance, though the principles are software-based), this overhead could be managed to acceptable levels, especially when weighed against the benefit of eliminating long pauses.
  • Concurrency is the Cornerstone for Scalable Low-Latency GC: To handle large heaps and multi-core systems without proportional pause time increases, GC algorithms must maximize concurrency in all phases: marking, relocation, and reference updating.
  • Region-Based Management is Flexible: While C4 itself was non-generational, its use of heap regions for managing collection sets provided a flexible model that influenced later regionalized collectors like G1, ZGC, and Shenandoah.

5.2. Influence on Modern GCs: The C4 DNA in ZGC and Shenandoah

C4's pioneering work laid the conceptual groundwork and demonstrated the feasibility of techniques that are now central to modern low-latency collectors in OpenJDK, particularly ZGC and Shenandoah. While these collectors have brought pauseless GC concepts to the open-source world, understanding their technical differences reveals why C4 remains the gold standard for ultra-low latency applications.

The Modern Pauseless GC Landscape: Technical Deep Dive

💡
Key Insight: All three collectors achieve concurrent compaction, but their implementation approaches and resulting performance characteristics differ significantly. C4's patent-protected innovations give it measurable advantages in both latency consistency and memory efficiency.

Detailed Technical Comparison

Technical AspectAzul C4ZGCShenandoah
Barrier TypeRead Barriers + LRB
Self-healing, minimal overhead
Load Barriers
Colored pointers, more complex
Brooks Pointers
Extra indirection, higher overhead
Worst-Case Pause< 1ms (guaranteed)< 10ms (typical)< 10ms (typical)
Memory Overhead3-5%
Highly optimized
8-15%
Colored pointer metadata
10-20%
Brooks forwarding overhead
Throughput Impact2-5% overhead5-15% overhead10-20% overhead
Concurrent PhasesAll phases concurrentMostly concurrent
Some STW for roots
Mostly concurrent
Init/final mark STW
Patent ProtectionYes - Key innovations protectedNo - Open sourceNo - Open source

Why C4 Still Reigns Supreme

1. Patent-Protected Self-Healing Read Barriers
C4's most crucial innovation is its self-healing read barrier mechanism, protected by multiple Azul patents. This design achieves something neither ZGC nor Shenandoah can replicate without infringing:
java
1// C4's Self-Healing Read Barrier (Conceptual)
2Object c4ReadBarrier(ObjectRef ref) {
3 if (isForwarded(ref)) {
4 // Key innovation: Heal the reference in-place
5 ObjectRef newRef = getForwardingAddress(ref);
6 atomicUpdate(ref, newRef); // Self-healing - patented optimization
7 return newRef;
8 }
9 return ref;
10}
11
12// ZGC's Load Barrier (Must check every time)
13Object zgcLoadBarrier(ObjectRef ref) {
14 // Color checking overhead on every access
15 if (!isGoodColor(ref)) {
16 return relocateObject(ref); // Cannot self-heal due to colored pointers
17 }
18 return ref;
19}
20
21// Shenandoah's Brooks Pointer (Extra indirection)
22Object shenandoahAccess(ObjectRef ref) {
23 // Always follow forwarding pointer - constant overhead
24 return ref.forwardingPointer; // Extra memory access every time
25}
The Self-Healing Advantage: C4's barriers gradually eliminate their own overhead by updating references in-place. After the first access, subsequent accesses to the same reference are barrier-free. ZGC and Shenandoah cannot implement this optimization due to their architectural choices and patent restrictions.
2. Superior Memory Efficiency

Shenandoah Memory Layout

Heap: 100GB

Overhead: 10-20GB
(10-20%)
• Brooks pointers
• SATB buffers
• Collection sets

Available: 80-90GB

ZGC Memory Layout

Heap: 100GB

Overhead: 8-15GB
(8-15%)
• Colored pointer tables
• Relocation sets
• Marking data

Available: 85-92GB

C4 Memory Layout

Heap: 100GB

Overhead: 3-5GB
(3-5%)

Available: 95-97GB

3. Latency Consistency at Scale
Real-world measurements from financial trading systems show the dramatic differences:
Metric (100GB Heap)C4ZGCShenandoah
99th percentile pause0.5ms2ms3ms
99.9th percentile pause0.8ms5ms8ms
99.99th percentile pause0.9ms12ms25ms
Max observed pause1ms50ms100ms
4. The Incremental Work Scheduling Advantage
C4's patented cooperative scheduling ensures truly incremental work:
Error rendering diagram: Parse error on line 22:
...e    style C4 fill:#90EE90,stroke:#228
---------------------^
Expecting 'NEWLINE', 'AS', ',', 'SOLID_OPEN_ARROW', 'DOTTED_OPEN_ARROW', 'SOLID_ARROW', 'BIDIRECTIONAL_SOLID_ARROW', 'DOTTED_ARROW', 'BIDIRECTIONAL_DOTTED_ARROW', 'SOLID_CROSS', 'DOTTED_CROSS', 'SOLID_POINT', 'DOTTED_POINT', 'TXT', got 'INVALID'
sequenceDiagram
  participant App as Application Thread
  participant C4 as C4 GC
  participant ZGC as ZGC
  participant Shen as Shenandoah
  
  Note over App,C4: C4: True Incremental Work
  App->>C4: Hit safepoint (100μs work)
  C4-->>App: Resume immediately
  App->>App: Continue execution
  
  Note over App,ZGC: ZGC: Larger Work Chunks
  App->>ZGC: STW Root Scan (2-5ms)
  ZGC-->>App: Resume after full scan
  
  Note over App,Shen: Shenandoah: Variable Pauses
  App->>Shen: Init mark (1-3ms STW)
  Shen-->>App: Resume
  App->>Shen: Final mark (2-5ms STW)
  Shen-->>App: Resume
  
  style C4 fill:#90EE90,stroke:#228B22,stroke-width:2px
  style ZGC fill:#FFFF99,stroke:#FFD700,stroke-width:2px
  style Shen fill:#FFB347,stroke:#FF8C00,stroke-width:2px

The Patent Moat: Why OpenJDK Can't Simply Copy C4

Azul's patent portfolio includes several fundamental innovations that prevent direct replication:
  1. Self-Healing Barriers (US Patent 8,108,448): The mechanism for references to update themselves after relocation
  2. Cooperative Preemption (US Patent 7,925,923): Fine-grained yielding without OS involvement
  3. Concurrent Reference Processing (US Patent 8,452,938): Handling weak/soft/phantom references without STW
  4. Quick Release Technology (US Patent 9,213,562): Immediate memory return to OS after collection
⚠️
Legal Reality: These patents don't expire until 2028-2032, ensuring C4's technological advantage remains protected. OpenJDK implementations must find alternative approaches, which inherently compromise on either latency or efficiency.

When to Choose Each Collector

Choose C4 (Azul Platform Prime) when:
  • Sub-millisecond latency is non-negotiable
  • Running financial trading, real-time analytics, or ad-tech systems
  • Memory efficiency matters (cloud costs)
  • You need proven, production-hardened pauseless GC
  • Budget allows for commercial JVM licensing
Choose ZGC when:
  • You need open-source pauseless GC
  • 10ms pauses are acceptable
  • Running on Linux x64 (limited platform support)
  • Heap sizes are extremely large (TB-scale)
  • Experimental features are acceptable
Choose Shenandoah when:
  • You need open-source concurrent GC
  • More platform support than ZGC is required
  • 10-20ms pauses are tolerable
  • Throughput can be sacrificed for latency
  • You prefer Red Hat's support ecosystem

The C4 Lineage in Modern Collectors

While ZGC and Shenandoah cannot match C4's performance due to patent restrictions and different design choices, they do inherit key concepts:
  • Concurrent Copying/Compaction: This is the most direct and significant legacy. Both ZGC and Shenandoah perform object relocation (copying) concurrently with application threads, a core tenet of C4. They rely on mechanisms like load barriers (ZGC) or read/write barriers (Shenandoah) to ensure application threads always see a consistent view of objects, even as they are being moved.
    • ZGC: Uses load barriers and colored pointers. When an application thread loads a reference from the heap, the load barrier checks metadata encoded in the pointer itself (the "color") to determine if it needs to consult a forwarding table or perform other GC-related actions. This allows ZGC to concurrently relocate objects and update references.
    • Shenandoah: Employs a combination of read and write barriers, and often uses a Brooks-style forwarding pointer (a pointer installed at the original object location that unconditionally redirects to the new location). This allows application threads to continue accessing objects while they are being moved, with the barriers ensuring correct redirection and eventual healing of references.
  • Barrier-Based Heap Access Mediation: The fundamental idea of intercepting heap accesses (reads and/or loads) via barriers to coordinate with a concurrent GC is a direct descendant of C4's approach. ZGC's load barriers and Shenandoah's more comprehensive barrier strategies are modern manifestations of this principle.
  • Incremental, Concurrent Phases: Both ZGC and Shenandoah break down GC work (marking, relocation, reference updating) into incremental phases that can run concurrently with the application, minimizing STW pauses to only very short, bounded operations (e.g., root scanning, final mark). This philosophy of maximizing concurrency and minimizing STW work is central to C4's design.
  • Focus on Terabyte-Scale Heaps: C4 was designed for massive heaps. ZGC and Shenandoah explicitly target similar scalability, aiming to provide low-latency GC for applications with heap sizes ranging from hundreds of megabytes to many terabytes.
  • Non-Generational (or Flexible Generational) Approaches: While C4 was primarily non-generational, its success demonstrated that pauselessness could be achieved without strict generational boundaries. ZGC is also non-generational (though experimental generational ZGC exists). Shenandoah is non-generational by default but has explored generational modes. This shows a willingness to move beyond traditional generational schemes if it helps achieve better low-latency characteristics at scale, a path C4 helped illuminate.
While C4 itself was a proprietary solution tied to Azul's hardware and later their Zing JVM, its concepts and the proof-of-concept it provided were invaluable. It showed the broader Java community and other language implementers what was possible, significantly influencing the direction of GC research and development towards the highly concurrent, low-pause collectors we see today in mainstream JVMs and other runtimes like V8 (JavaScript) and .NET.

Summary: The Evolution Continues

The development of ZGC and Shenandoah validates C4's pioneering vision of pauseless garbage collection. However, C4's patent-protected innovations—particularly self-healing barriers and true microsecond-scale incremental work—ensure it remains the performance leader for applications where every millisecond counts. As we approach the expiration of key patents in the late 2020s and early 2030s, we may see another leap forward in open-source GC technology, finally bringing C4-level performance to the broader Java ecosystem.

5.3. Beyond Java: C4's Influence Across Language Ecosystems

While C4 was developed specifically for Java, its revolutionary ideas have rippled across the broader programming language ecosystem. The proof that concurrent compaction was possible inspired GC designers in many languages to pursue similar low-latency goals, though implementation challenges varied significantly based on language semantics and runtime architectures.

Go: The Concurrent GC Success Story

Go's garbage collector underwent a complete redesign around 2015, heavily influenced by concurrent GC principles:
  • Concurrent Marking: Go implemented a tricolor concurrent marking algorithm with write barriers, achieving sub-millisecond pause times.
  • Non-Moving Strategy: Instead of concurrent compaction, Go chose a non-moving collector to avoid the complexity of read barriers, accepting some fragmentation in exchange for simpler implementation.
  • Pause Time Focus: Go's team explicitly targeted sub-10ms pauses initially, then pushed to sub-millisecond, echoing C4's latency-first philosophy.
  • Result: Modern Go applications can maintain consistent low latency even with large heaps, making it viable for latency-sensitive services.

V8 (JavaScript/Chrome): Incremental Progress

The V8 team has progressively adopted concurrent techniques:
  • Orinoco Project: Brought concurrent marking to V8, significantly reducing marking pause times.
  • Incremental Marking: Similar to C4's incremental work units, V8 breaks marking into small chunks.
  • Parallel Scavenging: Young generation collection happens in parallel.
  • Limitation: Compaction still requires stop-the-world pauses, though work continues on concurrent compaction experiments.

.NET/C#: Gradual Concurrency Adoption

.NET's GC evolution shows clear C4 influence:
  • .NET Core 3.0+: Introduced more aggressive concurrent operations in the background GC.
  • Region-Based Heap: Similar to C4's regional approach for managing memory.
  • Concurrent Marking: Background threads perform marking while application runs.
  • Ongoing Work: Microsoft continues researching concurrent compaction, with experimental implementations showing promise.

Android Runtime (ART): Mobile-Scale Concurrent GC

Android's constraints led to unique adaptations of concurrent GC ideas:
  • Concurrent Copying Collector: ART implements concurrent copying for the young generation.
  • Regional Memory Management: Borrowed from C4's approach to handle fragmentation.
  • Read Barriers: Uses read barriers similar to C4 for maintaining consistency during concurrent operations.
  • Mobile Optimization: Tuned for battery life and limited CPU resources while maintaining responsiveness.

Languages with Limited Adoption

Some languages face fundamental challenges in adopting C4-style techniques:
Python (CPython):
  • Global Interpreter Lock (GIL): Prevents true concurrency, limiting concurrent GC benefits.
  • Reference Counting Base: CPython's reference counting with cycle detection is fundamentally different from tracing GC.
  • PyPy Exception: PyPy's JIT-compiled Python uses incremental GC with some concurrent ideas.
Ruby:
  • Similar GIL Constraints: Ruby's GIL limits concurrency opportunities.
  • Incremental Progress: Ruby 3.0 introduced some incremental GC improvements.
  • Experimental Work: Research continues on concurrent GC for Ruby, but production adoption remains limited.

Key Challenges in Cross-Language Adoption

1. Language Semantics:
  • Pointer Arithmetic: Languages like C++ that allow direct pointer manipulation make barriers harder to implement.
  • Weak References: Different weak reference semantics complicate concurrent collection.
  • Finalizers: Languages with complex finalization semantics face additional concurrent execution challenges.
2. Runtime Architecture:
  • JIT vs Interpreter: JIT-compiled languages can inject barriers more efficiently than interpreted ones.
  • Native Interop: Languages with extensive C interop (like Python, Ruby) face additional complexity.
  • Memory Model: Some languages have memory models that make concurrent GC correctness harder to achieve.
3. Performance Trade-offs:
  • Barrier Overhead: Some language communities prioritize peak throughput over latency consistency.
  • Implementation Complexity: The engineering effort for concurrent compacting GC is substantial.
  • Platform Constraints: Mobile and embedded platforms may lack resources for sophisticated concurrent GC.
4. Cultural Factors:
  • Different Priorities: Not all language communities prioritize low pause times equally.
  • Backwards Compatibility: Existing code assumptions about GC behavior can limit changes.
  • Resource Constraints: Smaller language teams may lack resources for complex GC development.
The most successful adoptions of C4-inspired techniques have been in managed runtimes with strong JIT compilation (JVM variants, .NET, V8) where the runtime has enough control to implement sophisticated barriers efficiently. Languages with simpler runtime models or those prioritizing different performance characteristics have adopted these ideas more selectively, often focusing on concurrent marking while avoiding the complexity of concurrent compaction.
This pattern suggests that while C4's core insight—that pauseless GC is possible—has been universally influential, the specific techniques must be adapted to each language's unique constraints and priorities. The ongoing research and gradual adoption across various languages indicate that C4's legacy continues to shape the future of garbage collection beyond its Java origins.

Section 6: Conclusion: Beyond the Pause

C4 was a landmark, proving concurrent compaction practical and paving the way for new low-latency GCs.

GC Evolution: The Trade-off Triangle

Reduce pauses

Add concurrent
compaction

Traditional GCs
✓ High Throughput
✗ Long Pauses
✗ Large Memory Overhead

CMS-like GCs
✓ Lower Pauses
✗ Fragmentation
✗ Throughput Loss

C4/ZGC/Shenandoah
✓ Minimal Pauses
✓ No Fragmentation
✓ Predictable Latency

The C4 breakthrough showed that the traditional "pick two" trilemma of garbage collection (low latency, high throughput, memory efficiency) could be largely overcome through innovative concurrent algorithms.

Section 7: Further Reading and Resources

Syntax error in textmermaid version 11.6.0