01 — The Problem

GPU memory hierarchy explained: the reason your 70B model inference underperforms the theoretical peak by 10× is almost never the ALUs. It is the memory system. The H100 SXM5 delivers 3,958 TFLOPS of FP8 compute — a figure that appears in every press release. What appears less often is that the memory bandwidth to feed those units at full utilization is 3.35 TB/s from HBM3, which works out to roughly 0.85 bytes per FLOP. For a matrix-vector multiply (the dominant operation in decode), you need approximately 2 bytes per FLOP of memory traffic. The arithmetic ceiling is not the problem. The memory wall is.

This shapes every architectural decision downstream: which attention variant fits a given sequence length, why speculative decoding works, why continuous batching changes the economics, and why H200’s bandwidth bump matters more than its FLOPs bump for most production inference loads.

02 — Five Levels, Five Latencies

The H100’s memory subsystem is a five-level hierarchy, each with a different capacity/bandwidth/latency tradeoff:

LevelCapacityBandwidthLatency
Registers256 KB / SM~100 TB/s~1 cycle
L1 / Shared mem228 KB / SM (configurable)~30 TB/s~30 cycles
L2 cache50 MB~10 TB/s~200 cycles
HBM380 GB3.35 TB/s~600 cycles
NVLink / PCIe900 GB/s / 64 GB/s~5 µs

The key insight: bandwidth drops by roughly 3× at each level boundary, while latency increases by 5–10×. A kernel that cannot hide memory latency with compute will stall.

03 — Roofline Analysis

The roofline model characterizes whether a kernel is compute-bound or memory-bound. For a given operation with arithmetic intensity II (FLOPs per byte), peak performance PP (FLOPs/s) is:

P=min(Ppeak, IBW)P = \min\left(P_{\text{peak}},\ I \cdot BW\right)

where BWBW is memory bandwidth.

The ridge point for H100 with HBM3 is:

Iridge=PpeakBW=3,958×10123.35×10121,180 FLOPs/byteI_{\text{ridge}} = \frac{P_{\text{peak}}}{BW} = \frac{3{,}958 \times 10^{12}}{3.35 \times 10^{12}} \approx 1{,}180\ \text{FLOPs/byte}

Any kernel with arithmetic intensity below 1,180 FLOPs/byte is memory-bound on H100. For comparison:

  • GEMM (large batch): ~200–1,000 FLOPs/byte depending on tile size
  • Attention (decode, single token): ~1–4 FLOPs/byte — deeply memory-bound
  • LayerNorm / elementwise ops: ~2–10 FLOPs/byte — memory-bound
  • Large dense GEMM (prefill): ~500–2,000 FLOPs/byte — can be compute-bound

This is why decode is fundamentally different from prefill. During decode you process one token against all KV cache entries — the operation is memory-bandwidth-bound. During prefill you process the entire prompt at once — sufficiently large batch sizes make it compute-bound.

04 — Shared Memory and Tiling

Kernels improve arithmetic intensity by tiling: loading a block of data into shared memory (L1, 228 KB on H100), reusing it across many operations before going back to HBM. A tiled GEMM with tile size T×TT \times T achieves:

Itiled=2T34T2=T2 FLOPs/byteI_{\text{tiled}} = \frac{2T^3}{4T^2} = \frac{T}{2}\ \text{FLOPs/byte}

For 32-bit floats. At T=128T = 128, I=64I = 64 FLOPs/byte — still memory-bound but ~64× better than an untiled kernel. At BF16 (T=128T = 128), it reaches ~128 FLOPs/byte. Still below the ridge point, but approaching it for large-batch prefill.

This is the logic behind FlashAttention. Rather than materializing the full N×NN \times N attention matrix in HBM — which requires O(N2)O(N^2) memory bandwidth — it tiles the computation entirely in shared memory, trading recomputation during the backward pass for dramatically reduced HBM traffic. The FlashAttention-2 paper shows this achieves 2–4× speedup for sequence lengths ≥ 1024.

# Simplified tiled attention (pseudocode, not production)
# Full implementation: https://github.com/Dao-AILab/flash-attention
import torch

def flash_attention_naive_tiled(Q, K, V, block_size=64):
    B, H, N, d = Q.shape
    O = torch.zeros_like(Q)
    L = torch.zeros(B, H, N, 1, device=Q.device)

    for i in range(0, N, block_size):
        q_block = Q[:, :, i:i+block_size]          # load from HBM once
        acc = torch.zeros_like(q_block)
        l_acc = torch.zeros(B, H, block_size, 1, device=Q.device)

        for j in range(0, N, block_size):
            k_block = K[:, :, j:j+block_size]       # tile K in SRAM
            v_block = V[:, :, j:j+block_size]       # tile V in SRAM
            s = torch.einsum('bhid,bhjd->bhij', q_block, k_block)
            s = s / (d ** 0.5)
            p = torch.exp(s - s.amax(dim=-1, keepdim=True))
            acc += torch.einsum('bhij,bhjd->bhid', p, v_block)
            l_acc += p.sum(dim=-1, keepdim=True)

        O[:, :, i:i+block_size] = acc / l_acc

    return O

05 — HBM Bandwidth and Inference Economics

For single-token decode, the operation per layer is a matrix-vector multiply: weight matrix WRdmodel×dffW \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}} against a single activation vector. For Llama-3-70B (dmodel=8192d_{\text{model}} = 8192, dff=28672d_{\text{ff}} = 28672), one feed-forward layer requires:

bytes=2×8192×28672=469 MB\text{bytes} = 2 \times 8192 \times 28672 = 469\ \text{MB}

at BF16. Across 80 layers (8B model) or 80 layers (70B model), the total weight traffic per token is in the range of 140 GB for a 70B model. At H100 HBM bandwidth of 3.35 TB/s, the minimum time per token is:

tmin=140 GB3.35 TB/s42 mst_{\text{min}} = \frac{140\ \text{GB}}{3.35\ \text{TB/s}} \approx 42\ \text{ms}

That is ~24 tokens/second maximum for single-batch inference, purely from memory bandwidth. This is why batching is critical: with batch size BB, the same weights serve BB tokens simultaneously, and throughput scales linearly until you hit a different bottleneck (KV cache memory, or eventually compute).

For a deeper treatment of the inference economics and hardware tradeoffs, see H100 vs H200 vs B200 TCO — the B200’s 8 TB/s bandwidth changes this picture significantly. For the interconnect implications in multi-GPU inference, see distributed interconnects.

References

  1. [1] Dao et al.. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness . NeurIPS, 2022. arXiv:2205.14135.
  2. [2] Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . ICLR, 2024. arXiv:2307.08691.
  3. [3] Williams et al.. Roofline: An Insightful Visual Performance Model for Multicore Architectures . Communications of the ACM, 2009. arXiv:2212.09561.

BibTeX

@article{fp4-2606001,
  title   = {GPU Memory Hierarchy and Kernel Performance},
  author  = {fp4 editorial desk},
  year    = {2026},
  url     = {https://fp4.dev/silicon/gpu-memory-hierarchy/},
  journal = {fp4}
}