Computer Science and Engineering

BabyLM is a recent competition in NLP to develop methods to train large language models (LLMs) using small datasets. The project will extend these methods to multilingual datasets.
Computer Science and Engineering
Natural language processing | Deep learning
Research Theme: BabyLM provided a dataset of <100M words, mimicking the vocabulary size of a 13-year-old human. The objective of BabyLM was two-fold: to provide a sandbox to develop new methods for sample-efficient pre-training LLMs, and to compare learning in LLMs with what we know about human language acquisition.
Intuitively speaking, the question underlying BabyLM is "If humans can learn a language after hearing a small set of words, how far do LLMs get with a dataset of a similar size?" With its multilingual nature, the project extends the question to: "If bilingual humans can learn two languages using a small set of words, how far do LLMs get with datasets of similar sizes in the two languages?"
Datasets: We will be using BabyLM's dataset as the base English dataset.
Programming language: The research will involve coding in Python using popular NLP/deep learning libraries.
The student will be closely supported by the supervisor throughout the project. Broadly, the research activities will involve:
Note: 'Multilinguality' for the sake of this project implies languages closely related to English."
The expected outcomes are:
At the end of the project, the student will be able to:
BabyLM provides a controlled setting to pre-train LLMs, and can be extended to ideas investigating the future of LLMs and their connections with human language acquisition. As a result, the project has the potential to be extended as an honours project.