The engineering
layer of AI.
Deep technical writing on LLM, GPU, and ML systems internals. Decoded from silicon to system to algorithm — for the engineers who already know what RAG is.
The hardware that runs the model.
HBM hierarchy, NVLink topology, tensor core generations, CUDA streams, kernel-level reality.
The infra that serves the model.
KV cache, paged attention, continuous batching, parallelism strategies, NCCL collectives.
The math that is the model.
Attention variants, MoE routing, quantization, alignment, long context, positional encoding.
H100 vs H200 vs B200: TCO for Inference Infrastructure
Beyond the spec sheet: deriving actual cost per million tokens for each generation, accounting for memory capacity, bandwidth, rack power, and cooling — the numbers that determine your infrastructure decision.
Intra-node vs Inter-node Interconnects in Distributed Training
NVLink, NVSwitch, InfiniBand, and RoCE — the bandwidth and latency numbers that determine whether your distributed training job scales or stalls.
GPU Memory Hierarchy and Kernel Performance
Why memory bandwidth — not FLOPs — is the binding constraint for most LLM workloads, and how H100's five-level hierarchy determines what your kernels can actually achieve.
We decode AI one layer at a time. Silicon tells you what the hardware can do. System tells you how inference actually runs. Algorithm tells you why the math works. All three, in depth, without the vendor gloss.
Written for the engineers building inference infrastructure — not the engineers explaining what inference is. Dense. Verifiable. No filler.