Inheritune: Training Smaller Yet More Attentive Language Models

Published on Nov 1 2024

Written by Sunny Sanyal¹
Joint work with Ravid Shwartz-Ziv², Alex Dimakis¹ and Sujay Sanghavi¹
UT Austin¹ and Newyork University²

Paper: https://arxiv.org/abs/2404.08634

Code: https://github.com/sanyalsunny111/LLM-Inheritune

Slides: https://docs.google.com/presentation/d/1RIdDTbIR14P9cH75w41AO5LIWlLeMx7ISpqjvjtpgis/edit?usp=sharing

TL;DR: Here we observe that in decoder-style LLMs (built with vanilla transformer blocks and multi-headed attention), many attention matrices in the deeper layers degenerate to rank-1. A significant number of these rank-1 matrices are essentially single-column attention matrices. To address this, we proposed a solution where we remove the deeper layers with degenerated attention and progressively grow the model in a stacking-like manner. This method allowed us to train smaller, yet equally performant language models, compared to their larger, less efficient counterparts.

We computed the layerwise rank and mass of all the attention matrices for all the attention heads in both GPT2 medium and large as shown in Fig.1. It is quite evident that attention matrices of many deeper layers are only focusing on a single column. We refer to this phenomenon as attention degeneration.

Now let’s visualize some of the attention matrices of a pre-trained GPT2-medium model(refer Fig.2).

Inheritune: Cutoff Degenerated Layers and Stagewise Retrain

We explain our method Inheritune with an example of a 36 layer GPT-2 large model.

Main Results

Turns out we can train much smaller models once we address the attention degeneration issue (refer Fig.9 and other results below).

Bibtex

@article{sanyal2024inheritunetrainingsmallerattentive, title={Inheritune: Training Smaller Yet More Attentive Language Models}, author={Sunny Sanyal and Ravid Shwartz-Ziv and Alexandros G. Dimakis and Sujay Sanghavi}, year={2024}, eprint={2404.08634}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2404.08634}, }

Inheritune: Training Smaller Yet More Attentive Language Models

Inheritune: Cutoff Degenerated Layers and Stagewise Retrain

Main Results

Further Reading

Bibtex

FOOTER SECTION ONE

FOOTER SECTION TWO

FOOTER SECTION THREE

Attention Matrices Degenerates to Single Column in Pre-trained Decoder-Style GPT2 Models

Inheritune: Cutoff Degenerated Layers and Stagewise Retrain

Main Results

Further Reading

Bibtex

Footer

FOOTER SECTION ONE

FOOTER SECTION TWO

FOOTER SECTION THREE