
LLM System Design Interview #31 - The View vs Copy Trap

Key Takeaways
- •Transposing tensors changes strides without moving data, making them non‑contiguous.
- •Calling .reshape or .view on non‑contiguous tensors forces a memory copy.
- •The hidden copy inflates VRAM usage, causing unexpected profiler spikes.
- •Using .contiguous() before view avoids extra allocation but adds overhead.
- •Proper tensor memory handling improves GPU utilization and interview performance.
Pulse Analysis
In PyTorch, a tensor is essentially a descriptor that points to a flat block of memory together with stride information that defines how indices map to that block. Operations such as .transpose() or .permute() merely reorder the stride metadata; they do not relocate the underlying data. The result is a mathematically transposed tensor that lives in a non‑contiguous memory region. While this saves time, it also means that any subsequent operation requiring contiguous storage—like .reshape() or .view()—must first create a copy. This behavior is documented in PyTorch’s contiguous‑tensor guidelines.
The copy operation is invisible in the source code but shows up as a sudden jump in VRAM consumption on the PyTorch profiler. When a developer adds .reshape() or .contiguous().view() to silence a contiguity error, the runtime allocates a new contiguous buffer, copies the original values, and discards the old view. This hidden allocation can double memory usage for large matrices, leading to out‑of‑memory crashes or degraded performance. Explicitly calling .contiguous() before reshaping makes the copy intentional and allows developers to measure its cost. Profiling tools like torch.utils.benchmark can quantify the overhead before deployment.
For senior machine‑learning engineers, especially those interviewing at research labs like DeepMind, understanding this view‑vs‑copy trap is a litmus test of systems‑level fluency. Efficient GPU utilization translates directly into faster training cycles and lower cloud expenses, which are critical in large‑scale model development. Moreover, recognizing when a tensor is non‑contiguous helps avoid ad‑hoc fixes such as torch.cuda.empty_cache(), which merely masks the symptom. Mastery of PyTorch’s memory model therefore strengthens both interview performance and production‑grade code reliability.
LLM System Design Interview #31 - The View vs Copy Trap
Comments
Want to join the conversation?