BabyLM is a recent competition in NLP to develop methods to train large language models (LLMs) using small datasets. The project will extend these methods to multilingual datasets.

School

Computer Science and Engineering

Research Area

Natural language processing | Deep learning

Research Theme: BabyLM provided a dataset of <100M words, mimicking the vocabulary size of a 13-year-old human. The objective of BabyLM was two-fold: to provide a sandbox to develop new methods for sample-efficient pre-training LLMs, and to compare learning in LLMs with what we know about human language acquisition.

Intuitively speaking, the question underlying BabyLM is "If humans can learn a language after hearing a small set of words, how far do LLMs get with a dataset of a similar size?" With its multilingual nature, the project extends the question to: "If bilingual humans can learn two languages using a small set of words, how far do LLMs get with datasets of similar sizes in the two languages?"

Datasets: We will be using BabyLM's dataset as the base English dataset. 

Programming language: The research will involve coding in Python using popular NLP/deep learning libraries.

The student will be closely supported by the supervisor throughout the project. Broadly, the research activities will involve:

  • Creation of a multilingual dataset of a similar scale 
  • Examine reported pre-training methods on the multilingual dataset 
  • (Stretch)Extend a method to adapt to multilingual setting

Note: 'Multilinguality' for the sake of this project implies languages closely related to English."

The expected outcomes are:

  • a multilingual dataset,
  • small-scale LLMs trained on the dataset,
  • code repository,
  • a project report summarizing the approach and results.

At the end of the project, the student will be able to:

  • Understand different pre-training strategies
  • Train LLMs from scratch
  • Apply the experience to train use case-specific LMs

BabyLM provides a controlled setting to pre-train LLMs, and can be extended to ideas investigating the future of LLMs and their connections with human language acquisition. As a result, the project has the potential to be extended as an honours project.

View Profile
  • https://babylm.github.io/
  • https://news.mit.edu/2023/language-models-scalable-self-learners-0608
  • https://arxiv.org/abs/2301.11796