Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

At batch size 1, autoregressive decoding is dominated by matrix-vector work. Memory bandwidth is the primary bottleneck for fast token generation, not FLOPS—modern AI GPUs expose hundreds of peak FLOPs per byte of HBM bandwidth, making token generation speed capped by memory bandwidth before compute.

solid systems work on the inference stack. the framing around agentic workloads driving single-request latency (vs aggregate throughput) is pragmatic—different optimization targets than what typical batch-serving stacks pursue. the memory-bandwidth-first analysis is correct and often glossed over in the hype cycle. the numbers are interesting but the real contribution is showing existing datacenter GPUs can be pushed closer to their theoretical ceilings through co-design of model, runtime, and kernels. worth reading for the latency-vs-throughput tradeoff discussion and the concrete memory-bandwidth bounding math, even if you don’t adopt their specific engine.