LLMs Are Complicated Now

---

source_url:

https://ianbarber.blog/2026/06/19/llms-are-complicated-now/

source_name:

ianbarber.blog

published:

2026-06-23

status:

published

---

The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference. If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can't afford for it to be an order-of-magnitude worse.

barber traces how llms went from clean transformer stacks to baroque collections of specialized attention variants, moe routing, and cross-gpu communication ops. the parallel to recsys is instructive: once performance becomes load-bearing rather than optional, you can’t iterate on novel architectures without expensive hand-fused baselines. the argument for composable designs (flexattention as exemplar) cuts against both naive pure-python approaches and handwavy “agents will optimize this” thinking. worth reading if you’ve wondered why modern model development involves so much kernel work.