GPU Memory Hierarchy and Kernel Performance

01 — The Problem

GPU memory hierarchy explained: the reason your 70B model inference underperforms the theoretical peak by 10× is almost never the ALUs. It is the memory system. The H100 SXM5 delivers 3,958 TFLOPS of FP8 compute — a figure that appears in every press release. What appears less often is that the memory bandwidth to feed those units at full utilization is 3.35 TB/s from HBM3, which works out to roughly 0.85 bytes per FLOP. For a matrix-vector multiply (the dominant operation in decode), you need approximately 2 bytes per FLOP of memory traffic. The arithmetic ceiling is not the problem. The memory wall is.

This shapes every architectural decision downstream: which attention variant fits a given sequence length, why speculative decoding works, why continuous batching changes the economics, and why H200’s bandwidth bump matters more than its FLOPs bump for most production inference loads.

02 — Five Levels, Five Latencies

The H100’s memory subsystem is a five-level hierarchy, each with a different capacity/bandwidth/latency tradeoff:

Level	Capacity	Bandwidth	Latency
Registers	256 KB / SM	~100 TB/s	~1 cycle
L1 / Shared mem	228 KB / SM (configurable)	~30 TB/s	~30 cycles
L2 cache	50 MB	~10 TB/s	~200 cycles
HBM3	80 GB	3.35 TB/s	~600 cycles
NVLink / PCIe	—	900 GB/s / 64 GB/s	~5 µs

The key insight: bandwidth drops by roughly 3× at each level boundary, while latency increases by 5–10×. A kernel that cannot hide memory latency with compute will stall.

03 — Roofline Analysis

The roofline model characterizes whether a kernel is compute-bound or memory-bound. For a given operation with arithmetic intensity $I$ (FLOPs per byte), peak performance $P$ (FLOPs/s) is:

P = \min\left(P_{\text{peak}},\ I \cdot BW\right)

where $BW$ is memory bandwidth.

The ridge point for H100 with HBM3 is:

I_{\text{ridge}} = \frac{P_{\text{peak}}}{BW} = \frac{3{,}958 \times 10^{12}}{3.35 \times 10^{12}} \approx 1{,}180\ \text{FLOPs/byte}

Any kernel with arithmetic intensity below 1,180 FLOPs/byte is memory-bound on H100. For comparison:

GEMM (large batch): ~200–1,000 FLOPs/byte depending on tile size
Attention (decode, single token): ~1–4 FLOPs/byte — deeply memory-bound
LayerNorm / elementwise ops: ~2–10 FLOPs/byte — memory-bound
Large dense GEMM (prefill): ~500–2,000 FLOPs/byte — can be compute-bound

This is why decode is fundamentally different from prefill. During decode you process one token against all KV cache entries — the operation is memory-bandwidth-bound. During prefill you process the entire prompt at once — sufficiently large batch sizes make it compute-bound.

04 — Shared Memory and Tiling

Kernels improve arithmetic intensity by tiling: loading a block of data into shared memory (L1, 228 KB on H100), reusing it across many operations before going back to HBM. A tiled GEMM with tile size $T \times T$ achieves:

I_{\text{tiled}} = \frac{2T^3}{4T^2} = \frac{T}{2}\ \text{FLOPs/byte}

For 32-bit floats. At $T = 128$ , $I = 64$ FLOPs/byte — still memory-bound but ~64× better than an untiled kernel. At BF16 ( $T = 128$ ), it reaches ~128 FLOPs/byte. Still below the ridge point, but approaching it for large-batch prefill.

This is the logic behind FlashAttention. Rather than materializing the full $N \times N$ attention matrix in HBM — which requires $O(N^2)$ memory bandwidth — it tiles the computation entirely in shared memory, trading recomputation during the backward pass for dramatically reduced HBM traffic. The FlashAttention-2 paper shows this achieves 2–4× speedup for sequence lengths ≥ 1024.

# Simplified tiled attention (pseudocode, not production)
# Full implementation: https://github.com/Dao-AILab/flash-attention
import torch

def flash_attention_naive_tiled(Q, K, V, block_size=64):
    B, H, N, d = Q.shape
    O = torch.zeros_like(Q)
    L = torch.zeros(B, H, N, 1, device=Q.device)

    for i in range(0, N, block_size):
        q_block = Q[:, :, i:i+block_size]          # load from HBM once
        acc = torch.zeros_like(q_block)
        l_acc = torch.zeros(B, H, block_size, 1, device=Q.device)

        for j in range(0, N, block_size):
            k_block = K[:, :, j:j+block_size]       # tile K in SRAM
            v_block = V[:, :, j:j+block_size]       # tile V in SRAM
            s = torch.einsum('bhid,bhjd->bhij', q_block, k_block)
            s = s / (d ** 0.5)
            p = torch.exp(s - s.amax(dim=-1, keepdim=True))
            acc += torch.einsum('bhij,bhjd->bhid', p, v_block)
            l_acc += p.sum(dim=-1, keepdim=True)

        O[:, :, i:i+block_size] = acc / l_acc

    return O

05 — HBM Bandwidth and Inference Economics

For single-token decode, the operation per layer is a matrix-vector multiply: weight matrix $W \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ against a single activation vector. For Llama-3-70B ( $d_{\text{model}} = 8192$ , $d_{\text{ff}} = 28672$ ), one feed-forward layer requires:

\text{bytes} = 2 \times 8192 \times 28672 = 469\ \text{MB}

at BF16. Across 80 layers (8B model) or 80 layers (70B model), the total weight traffic per token is in the range of 140 GB for a 70B model. At H100 HBM bandwidth of 3.35 TB/s, the minimum time per token is:

t_{\text{min}} = \frac{140\ \text{GB}}{3.35\ \text{TB/s}} \approx 42\ \text{ms}

That is ~24 tokens/second maximum for single-batch inference, purely from memory bandwidth. This is why batching is critical: with batch size $B$ , the same weights serve $B$ tokens simultaneously, and throughput scales linearly until you hit a different bottleneck (KV cache memory, or eventually compute).

For a deeper treatment of the inference economics and hardware tradeoffs, see H100 vs H200 vs B200 TCO — the B200’s 8 TB/s bandwidth changes this picture significantly. For the interconnect implications in multi-GPU inference, see distributed interconnects.

References

[1] Dao et al.. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness . NeurIPS, 2022. arXiv:2205.14135.
[2] Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . ICLR, 2024. arXiv:2307.08691.
[3] Williams et al.. Roofline: An Insightful Visual Performance Model for Multicore Architectures . Communications of the ACM, 2009. arXiv:2212.09561.

BibTeX

@article{fp4-2606001,
  title   = {GPU Memory Hierarchy and Kernel Performance},
  author  = {fp4 editorial desk},
  year    = {2026},
  url     = {https://fp4.dev/silicon/gpu-memory-hierarchy/},
  journal = {fp4}
}