sicutdeux@blog:~/links$cat llms-are-complicated-now.md
LLMs Are Complicated Now
---
source_url:
source_name:
ianbarber.blog
published:
2026-06-23
status:
published
---
The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference. If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can't afford for it to be an order-of-magnitude worse.
barber traces how llms went from clean transformer stacks to baroque collections of specialized attention variants, moe routing, and cross-gpu communication ops. the parallel to recsys is instructive: once performance becomes load-bearing rather than optional, you can’t iterate on novel architectures without expensive hand-fused baselines. the argument for composable designs (flexattention as exemplar) cuts against both naive pure-python approaches and handwavy “agents will optimize this” thinking. worth reading if you’ve wondered why modern model development involves so much kernel work.