Transformers Rethink Normalization: A Simple ‘DyT’ Layer Achieves Top Performance

March 15, 2025

In a groundbreaking discovery by researchers including Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and project lead Zhuang Liu (published at CVPR 2025), they challenge conventional beliefs about the necessity of normalization layers in Transformers. Their work introduces Dynamic Tanh (DyT) as a simple yet effective replacement for traditional normalization methods like Layer Norm or RMSNorm without sacrificing performance.

Transformers equipped with DyT achieve comparable or better results than their normalized counterparts across various tasks and architectures, including vision models such as ViT and ConvNeXt, speech processing using wav2vec 2.0, diffusion models like DiT, large language models like LLaMA, DNA sequence modeling in HyenaDNA and Caduceus, self-supervised learning with MAE and DINO—all without extensive hyperparameter tuning.

The key insight behind DyT lies in its resemblance to layer normalization’s input-output mappings that closely mimic scaled tanh functions. In early layers, these relationships are mostly linear but evolve into distinct S-shaped curves characteristic of tanh functions as we move deeper within the network architecture.

The implementation of DyT can be easily integrated into existing Transformer models using PyTorch with just a few lines of code provided in their GitHub repository (https://github.com/jiachenzhu/DyT). For further details about this research, readers are encouraged to download the full paper from arXiv (https://arxiv.org/abs/2503.10622) or check out a concise summary available on X (https://x.com/liuzhuang1234/status/1900370738588135805).

The paper’s BibTeX entry for citation purposes is given below:
@inproceedings{Zhu2025DyT,
title={Transformers without Normalization},
author={Zhu, Jiachen and Chen, Xinlei and He, Kaiming and LeCun, Yann and Liu, Zhuang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}

Correspondence regarding this study can be directed to jiachen [dot] zhu [at] nyu [dot] edu or zhuangl [at] princeton [dot] edu.

Complete Article after the Jump: Here!

In

Why categories when I’ve Tags!