Reconstructing ancient proto-languages is a foundational task in historical linguistics, allowing researchers to trace language evolution and uncover historical connections. Recent advancements have produced semisupervised neural models like the DPD-BiReconstructor (Lu, Xie and Mortensen 2024) (Best Paper Awards ACL 2024), which has shown success in learning from unlabeled cognates for language families like Romance and Sinitic (Chinese). This model works by training a reconstructor (D2P) and a validator (P2D) together, checking if a proposed proto-word can be transformed back into its modern descendants.

However, a key limitation of this method is that the validator network can often predict correct modern words (reflexes) even when the reconstructed proto-word is incorrect (Lu, Xie and Mortensen 2024). This can weaken the training signal, limiting the reconstruction model’s potential performance. This project aims to address this weakness by implementing a more discerning adversarial training setup and evaluating its effectiveness across different language families.

This project has two primary contributions:

  1. Dataset Expansion (WikiHan+): We will expand the original highly cited WikiHan dataset (Chang et al. 2022) by incorporating Sino-Xenic loanwords from Korean, Japanese, and Vietnamese. These loanwords are invaluable as they preserve features of older stages of Chinese pronunciation. To ensure an efficient and scalable data pipeline, we will work with Wiktionary dumps rather than scraping the live website. This approach, which follows the methodology of the original WikiHan paper’s use of a CBOR snapshot, allows for rapid extraction of the necessary structured data. The extracted entries will be cross-referenced with other open-source datasets like KanjiDictVN (Nguyen 2023) and CVDict (Phong 2021) to create a new, comprehensive resource: WikiHan+. This will benefit historical linguistic studies in East Asian languages significantly.
  2. Methodological Improvement (Adversarial Reconstruction): We will replace the DPD-BiReconstructor’s simple consistency check with a more robust Generative Adversarial Network (GAN) framework. A language embedding is attached to each component below for specific language reconstruction.
School

Computer Science and Engineering

Research Area

Natural language processing (NLP)

Suitable for recognition of Work Integrated Learning (industrial training)? 

No

This project is suitable for a student with demonstrated experience in deep learning for natural language processing either in the form of academic courses or projects. The project will not involve the use of LLM API calls - therefore, the student must be skilled in the use of pytorch-based NLP libraries. The project forms a part of the NLP research group's focus on language varieties. The student will have access to computing infrastructure (NCI). 

The student will be expected to attend 30 min weekly meetings where they present ongoing updates in the form of powerpoint slides and code walkthrough. 

  • Well-documented code and report ready to be submitted as a paper
  • A new, publicly available dataset, WikiHan+, that provides a richer source of evidence for East Asian historical linguistics.
  • A quantitative demonstration that the adversarial training framework outperforms the baseline DPD-BiReconstructor across two distinct language families.
  1. Kalvin Chang, Chenxuan Cui, Youngmin Kim, and David R. Mortensen. 2022. WikiHan: A New Comparative Dataset for Chinese Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3563–3569, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  2. Liang Lu, Peirong Xie, and David Mortensen. 2024. Semisupervised Neural Proto-Language Reconstruction. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14715–14759, Bangkok, Thailand. Association for Computational Linguistics.
  3. Carlo Meloni, Shauli Ravfogel, and Yoav Goldberg. 2021. Ab Antiquo: Neural Proto-language Reconstruction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4460–4473, Online. Association for Computational Linguistics.
  4. Phong, Phan (2021) CVDICT [Source code]. Available at: https://github.com/ph0ngp/CVDICT (Accessed: 15 August 2025).
  5. Nguyen, Trung (2023) KanjiDictVN [Source code]. Available at: https://github.com/trungnt2910/KanjiDictVN (Accessed: 15 August 2025).