The broad theme of my current work is efficient pre-training of Large Language Models (LLMs). I believe pre-training of large scale language models is poorly understood and most research is focused towards applications of these models. Moreover very few institutions with extremely large computing resources can only pre-train LLMs of reasonably good performance. Through my research (sitting in academia with fewer resources) I want to democratize LLM pre-training by creating better models using much lesser resources. To achieve this in [2] we show that LLMs can be trained much faster without compromising any generalization (val loss i.e. log perplexity) using far away checkpoint averaging throughout training. We presented results with Pythia 1B, 2.8B, 6.9B and 12B models along with GPT2-large, GPT2-medium and GPT2-small. Next in [1] we empirically explored the question that How much performance one can achieve when pre-training an LLM of few billion parameters with just 1 billion tokens? Here we provide insights of sample efficient pre-training of small base language models.
- Pre-training Small Base LMs with Fewer Tokens
Sunny Sanyal, Sujay Sanghavi and Alex Dimakis.
Under Submission
[paper][code][blog] - Early Weight Averaging Meets High Learning Rates for LLM Pre-training
Sunny Sanyal, Atula Tejaswi, Jean Kaddour, Abhishek Kumar, and Sujay Sanghavi.
Under Submission
[paper][code][blog]
Presented at NeurIPS 2023 WANT workshop (OpenReview).
Featured in two popular newsletters a) Ahead of AI b) Interconnect.ai