A couple of weeks ago, we noticed that the <span class="codeinline">HeapInUse</span> metric reported by the Go runtime, which tracks the number of in-use bytes on the heap, looked like the following for the WarpStream control plane:
This was alarming, as the linear increase strongly indicates a memory leak. The leak was very slow, and our control planes are deployed almost daily (sometimes multiple times per day), so while the memory leak didn’t represent an immediate issue, we wanted to get to the bottom of it.
The WarpStream control plane is written in Go, which has excellent built-in support for debugging application memory issues with pprof. We’ve used pprof hundreds of times in the past to debug performance issues, and usually memory leaks are particularly easy to spot.
The pprof heap profiles can be used to see which objects are still “live” on the heap as of the latest garbage collection run, so figuring out the source of a memory leak is usually as simple as grabbing a couple of heap profiles at different points in time. The differences in the memory occupied by live objects will explain the leak.
As expected, comparing heap profiles taken at different times showed something very suspicious:
The profile on the right, which was taken later, showed that the size of the live <span class="codeinline">FileMetadata</span> objects created by the compaction scheduler almost doubled! To understand what the profile is telling us here, we have to get into WarpStream’s job scheduling framework briefly.
For a WarpStream cluster to function efficiently, a few background jobs need to run regularly. An example of such a job is the compaction jobs that periodically rewrite and merge the data files in object storage. These jobs run in the Agent, but are scheduled by the control plane.
To orchestrate these jobs, a polling model is used as shown in Figure 3 below. The control plane maintains a job queue to which the various job schedulers submit jobs. The Agent will periodically poll the control plane for outstanding jobs to run, and once a job is completed, an acknowledgement is sent back to the control plane, allowing the control plane to remove the specified job from the queue. Additionally, the control plane regularly scans the jobs in the job queue to remove jobs it considers timed out, preventing queue buildup.
Knowing how job scheduling works, it was surprising to see the <span class="codeinline">FileMetadata</span> objects being highlighted in the heap profiles. These objects, serving as inputs for the compaction jobs, have a pretty deterministic lifecycle: they should be removed from the queue and eventually garbage collected as compaction jobs complete or time out.
So, how can we explain the increased memory usage due to these <span class="codeinline">FileMetadata</span> objects? We had two hypotheses:
With our logs and metrics, the first hypothesis was ruled out. To confirm the second one, we carefully went through the job queue code, spotted and fixed a potential source of leak, and yet the fix did not stop the leak. Much of this relied on our familiarity with the codebase, so even when we thought we had a fix, there was no concrete proof.
We were stumped. We set out thinking that profiling would provide all the answers, but were left perplexed. With no remaining hypothesis to validate, we had to revisit the fundamentals.
The Go runtime comes with a garbage collector (GC) and most of the time we don’t have to think about how it works, until we need to understand why a certain object is being retained. The fact that the <span class="codeinline">FileMetadata</span> objects showed up in the in-use space view of the heap profiles means that the GC still considered them live. But what does that mean?
The Go GC employs the mark-sweep algorithm, meaning its cycles include a mark phase and a sweep phase. The mark phase figures out if an object is reachable and the sweep phase reclaims the unreachable objects determined from the mark phase.
To figure out whether an object is reachable, the GC has to traverse the object graph starting from the GC roots marking objects referenced by reachable objects as reachable. The complete list of GC roots can be found below, but examples include global variables and live goroutine stacks.
<figcaption>Figure 4: Pseudocode based on the Go GC’s logic showing the mark phase starting from the GC roots.</figcaption>
That means that for the <span class="codeinline">FileMetadata</span> objects to be retained, they must be traceable back to some GC root. The question then became: could we figure out the precise chain of object references leading to the <span class="codeinline">FileMetadata</span> objects? Unfortunately, this isn’t something that pprof could help with.
The heap profiles were very effective at telling us the allocation sites of live objects, but provided no insights into why specific objects were being retained. Getting the GC roots of these objects would be crucial for understanding the leak.
For that, we used gcore from gdb to take a core dump of the control plane process in our staging environment by running the following command:
However, raw core dumps can be notoriously difficult to interpret. While the snapshot of the heap from the core dump tells us about object relationships, understanding what those objects mean in the context of our application is a whole other challenge. So, we turned to viewcore for analysis, as it enriches the core dump with DWARF debugging information and provides handy utilities for exploring the state of the dumped process.
We ran the following commands to see the live <span class="codeinline">FileMetadata</span> objects along with their virtual addresses:
The resulting output looked like this:
<figcaption>Figure 5: A sample of the live FileMetadata objects that viewcore showed from the core dump.</figcaption>
To get the GC root information for a given object, we ran:
That gave us the chain of references shown below:
<figcaption>Figure 6: The precise chain of references from a FileMetadata object to a GC root.</figcaption>
Now this chain of references from the core dump revealed something less obvious. That is, these <span class="codeinline">FileMetadata</span> objects, which we said were created by the compaction scheduler, were retained by the deadscanner scheduler, which is used to scan and remove files in the object store that are no longer tracked by the control plane.
This gave us another angle to consider: how could the deadscanner scheduler possibly be retaining jobs that it did not create? As revealed by the object relationship from Figure 6 and the diagram from Figure 3, the compaction and deadscanner schedulers share a reference to the same job queue. Consequently, the fact that a compaction job is not retained by the compaction scheduler, and rather the deadscanner scheduler, implies that the compaction scheduler had terminated already, while the deadscanner scheduler continued to run.
This behavior was unexpected. All job schedulers for a virtual cluster are bundled into a single computational unit called an actor, and the actor dictates the lifecycle of its internal components. Consequently, the various schedulers shut down if and only if the job actor shuts down. At least, that’s how it’s supposed to work!
That information narrowed down the scope of the search, and upon investigation, we discovered that the memory leak could be attributed to a goroutine leak in the deadscanner scheduler. The important code snippet is reproduced below:
The scheduler runs in background and periodically schedules jobs for the Agents to execute. These jobs are submitted to a queue, and we block on job submission until one of the terminating conditions is met. The rationale is simple: if the queue is full at the time of the submission, the scheduler will wait for inflight jobs to complete and queue slots to become available.
And that precisely was the cause of the leak. When a job actor is shutting down, it signals to the contained job schedulers that a shutdown is in progress by canceling the context passed to the <span class="codeinline">RunAsync</span> function.
However, there is a catch. If the deadscanner scheduler is busy spinning inside the for loop in <span class="codeinline">runOnce</span> due to a back-pressured signal indicating a full queue at the time of the context cancellation, it will not be aware of the cancellation! What is worse is that during job actor shutdown, the queue will most likely be full because the queue will not be serving poll requests from the Agents anymore, and the outstanding jobs will remain, causing job submission to be backpressure continuously, and the goroutine from the deadscanner scheduler to be stuck.
The fix was simple. All we needed to do was to make the job queue submission function check for context cancellation before doing anything else. The deadscanner scheduler will see the job submission error due to an invalid context, break from the loop form <span class="codeinline">runOnce</span>, and shut down properly.
<figcaption>Figure 8. The replicated patch to the job queue that returns an error for job submissions with a canceled context.</figcaption>
At this point one might start to wonder when the job actor gets shut down. If this only happened during control plane shutdowns, the effects would have been benign. The reality is more complex. The control plane explicitly shuts down job actors in the following scenarios:
Consider a scenario where a tenant disconnects all their Agents from the control plane. This corresponds to the first case: if a cluster is no longer receiving poll job requests, then the job actor can be purged to free up resources. Scenario 2 is related to the multi-tenant nature of the control plane.
As shown in Figure 7, every virtual cluster gets its own job actor for isolation, and the various job actors are distributed among the control plane replicas. To avoid overloading individual replicas with memory-intensive job actors, the control plane periodically assesses the memory usage of the replicas. When significant imbalances are detected, it redistributes actors by shutting down an actor on the replica with the highest memory usage and re-spawning it on the replica with the lowest usage. The combination of these two factors led to more frequent and yet less predictable memory leak occurrences.
To confirm that we had the right fix, we deployed the patch and monitored the <span class="codeinline">HeapInUse</span> metric shown previously in Figure 1. This time, the metric looked a lot healthier:
The cause of a memory leak is always more obvious in retrospect. The investigation took several twists and turns before we arrived at the correct solution. So we wondered: could we have approached this more effectively? Since we now know that the root cause was a goroutine leak, we should have been able to rely on the goroutine profiles to uncover the problem.
It turned out that sometimes the global picture is not very telling. When comparing two profiles showing all goroutines, the leak was not very obvious to the human eye:
However, when we zoomed in on the offending deadscanner package, a more significant change was revealed:
The art of debugging complex systems is simultaneously holding both the system-wide perspective and the microscopic view, and knowing and using the right tools at each level of detail. As we have seen, seemingly subtle changes can have a significant impact on a global level.
The debugging journey often begins with examining global trends using diagnostic tools like profiling. However, when those observations are inconclusive, isolating the data by specific dimensions can also be beneficial. While the selection of these dimensions might involve some trial and error, the results can still be very insightful. And as a last resort, reverting to the lowest-level tools is always a viable option.