• Skip to primary navigation
  • Skip to main content
  • Skip to footer
UT Shield
The University of Texas at Austin
  • Home
  • Papers
  • Research Blogs
  • Academic Stuff

October 30, 2024, Filed Under: Uncategorized

Inheritune: Training Smaller Yet More Attentive Language Models

Published on Nov 1 2024

Written by Sunny Sanyal1
Joint work with Ravid Shwartz-Ziv2, Alex Dimakis1 and Sujay Sanghavi1
UT Austin1 and Newyork University2

Paper: https://arxiv.org/abs/2404.08634

Code: https://github.com/sanyalsunny111/LLM-Inheritune

Slides: https://docs.google.com/presentation/d/1RIdDTbIR14P9cH75w41AO5LIWlLeMx7ISpqjvjtpgis/edit?usp=sharing

TL;DR: Here we observe that in decoder-style LLMs (built with vanilla transformer blocks and multi-headed attention), many attention matrices in the deeper layers degenerate to rank-1. A significant number of these rank-1 matrices are essentially single-column attention matrices. To address this, we proposed a solution where we remove the deeper layers with degenerated attention and progressively grow the model in a stacking-like manner. This method allowed us to train smaller, yet equally performant language models, compared to their larger, less efficient counterparts.

Attention Matrices Degenerates to Single Column in Pre-trained Decoder-Style GPT2 Models

We computed the layerwise rank and mass of all the attention matrices for all the attention heads in both GPT2 medium and large as shown in Fig.1. It is quite evident that attention matrices of many deeper layers are only focusing on a single column. We refer to this phenomenon as attention degeneration.

Now let’s visualize some of the attention matrices of a pre-trained GPT2-medium model(refer Fig.2).

Inheritune: Cutoff Degenerated Layers and Stagewise Retrain

We explain our method Inheritune with an example of a 36 layer GPT-2 large model.

Main Results

Turns out we can train much smaller models once we address the attention degeneration issue (refer Fig.9 and other results below).

Further Reading


Paper 1: https://arxiv.org/abs/2406.04267

Paper 2: https://proceedings.mlr.press/v119/bhojanapalli20a.html

Paper 3: https://arxiv.org/abs/2404.07647


Bibtex

@article{sanyal2024inheritunetrainingsmallerattentive,
title={Inheritune: Training Smaller Yet More Attentive Language Models},
author={Sunny Sanyal and Ravid Shwartz-Ziv and Alexandros G. Dimakis and Sujay Sanghavi},
year={2024},
eprint={2404.08634},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.08634},
}

Footer

FOOTER SECTION ONE

FOOTER SECTION TWO

FOOTER SECTION THREE

  • Email
  • Facebook
  • Instagram
  • Twitter

UT Home | Emergency Information | Site Policies | Web Accessibility | Web Privacy | Adobe Reader

© The University of Texas at Austin 2025