Advanced Deep Learning Interview Questions #2 - The Memory Fragmentation Trap

•March 23, 2026

AI Interview Prep•Mar 23, 2026

Key Takeaways

•Stack traces often mislead on OOM origins
•Use torch.cuda.memory_snapshot to detect fragmentation
•Avoid storing full loss tensors; use loss.item()
•Profile memory peaks with PyTorch Profiler
•Adjust hyperparameters only after structural fixes

Summary

In a Meta senior ML engineer interview, candidates are asked how to debug a 500‑line PyTorch out‑of‑memory (OOM) stack trace without simply lowering the batch size. The post argues that stack traces are unreliable and that the real issue is often memory fragmentation or lingering computational graphs. It prescribes a three‑step workflow: capture a memory allocator snapshot, hunt for dangling graph references, and profile peak memory usage with PyTorch Profiler. Only after confirming code integrity should hyper‑parameters be tweaked.

Pulse Analysis

Out‑of‑memory errors have become a routine obstacle for teams training large neural networks, yet many engineers treat the symptom rather than the cause. Modern GPUs allocate memory in arenas that can become fragmented, and a single allocation failure may mask a series of smaller leaks. Relying on the final line of a stack trace is akin to diagnosing a heart attack by looking at the last ECG reading; the underlying pathology remains hidden. Understanding how PyTorch’s memory allocator works and recognizing that the reported line is rarely the offender is the first step toward efficient debugging.

The recommended workflow begins with an immediate snapshot of the CUDA memory allocator using torch.cuda.memory_snapshot(). This dump reveals fragmented blocks and lingering allocations that standard usage metrics overlook. Next, developers should audit their code for dangling computational graphs—common when raw loss tensors are appended to Python lists instead of extracting scalar values with loss.item(). Such patterns trap entire graphs in memory across training steps, inflating usage dramatically. Finally, employing the PyTorch Profiler to visualize memory peaks uncovers transient buffers that persist longer than intended, allowing engineers to pinpoint and eliminate unnecessary allocations before they cascade into OOM failures.

Beyond technical correctness, this method carries strategic weight in hiring and production environments. Interviewers at top tech firms like Meta gauge a candidate’s depth by probing beyond the obvious batch‑size suggestion, expecting a systematic, data‑driven approach. In production, eliminating memory leaks translates to higher GPU utilization, reduced cloud costs, and smoother scaling of models from research to deployment. Embedding these practices into the development lifecycle not only safeguards resources but also cultivates a culture of rigorous performance engineering.