Reading the Methods section of a scientific paper is often slow and difficult, because crucial details are spread across different formats: text, equations, and figures. For example, the text may describe the optimisation process, the corresponding loss function is written as an equation, and the overall workflow is only shown in a diagram. Current document analysis systems usually process text only, or at most treat equations and figures separately. This means subtle but important innovations in methods are often hidden, requiring researchers to read papers line by line.

Research Gap: While recent tools (e.g., Nougat for equation recognition, PDFFigures 2.0 for figure extraction, LayoutLMv3 and MDocAgent for multimodal document analysis) can process different components individually, no existing approach aligns method sentences with the exact equations and figures that implement them. Without this alignment, computers cannot “understand” how methods are structured or compare innovations across papers.

Project Goal: This project will close that gap by creating the first small dataset and prototype system for methodology-aware alignment in research papers. Specifically, the student will:

  • Build a gold-standard dataset linking text, equations, and figures in the Methods sections of ~50–100 research papers.
  • Develop a baseline alignment model that can automatically suggest which equations and figures belong to each method step.
  • Evaluate the prototype against the annotated dataset to measure performance.

This project is ambitious but achievable within 60 FTE days for a motivated student, and it will make a valuable research contribution by providing resources and a baseline for future Honours, Masters, and PhD projects.

School

Computer Science and Engineering

Research Area

Natural language processing (NLP) | Machine learning | Deep learning | Information retrieval | Document understanding | Scientific knowledge extraction

Suitable for recognition of Work Integrated Learning (industrial training)? 

No

The student will join an active NLP and document understanding research group, attend regular supervision meetings, and receive structured guidance and feedback.

Dataset: First annotated corpus of text–equation–figure alignments in Methods sections.

Prototype: Baseline alignment model demonstrating feasibility.

Experience: Student gains practical skills in NLP, computer vision, dataset annotation, and scholarly document understanding.

Contribution: Provides a foundation for follow-up research (e.g., scaling dataset, training advanced multi-agent systems).

Section-aware / multi-agent scholarly document understanding

  • Han, S., Xia, P., Zhang, R., Sun, T., Li, Y., Zhu, H., & Yao, H. (2025). MDocAgent: A multi-modal multi-agent framework for document understanding. arXiv preprint arXiv:2503.13964. https://arxiv.org/abs/2503.13964
  • Gokdemir, O., Siebenschuh, C., Brace, A., Wells, A., Hsu, B., Hippe, K., Setty, P. V., Ajith, A., Pauloski, J. G., Sastry, V., Zheng, H., Ma, H., Kale, B., Chia, N., Gibbs, T., Papka, M. E., Brettin, T., Alexander, F. J., Anandkumar, A., Foster, I., Stevens, R., Vishwanath, V., & Ramanathan, A. (2025). HiPerRAG: High-Performance Retrieval Augmented Generation for scientific insights. arXiv:2505.04846 (PASC ’25). https://arxiv.org/abs/2505.04846

Equation & table alignment / structural understanding

  • Zhong, Y., Zeng, Z., Chen, L., Yang, L., Zheng, L., Huang, J., Yang, S., & Ma, L. (2025). DocTron-Formula: Generalized formula recognition in complex and structured scenarios. arXiv preprint arXiv:2508.00311. https://arxiv.org/abs/2508.00311
  • Ho, X., Kumar, S., Wu, Y.-A., Boudin, F., Takasu, A., & Aizawa, A. (2025). Table–text alignment: Explaining claim verification against tables in scientific papers. arXiv preprint arXiv:2506.10486. https://arxiv.org/abs/2506.10486

Diagram/layout-grounded explanations (DocVQA)

  • Souibgui, M. A., Choi, C., Barsky, A., Jung, K., Valveny, E., & Karatzas, D. (2025). DocVXQA: Context-aware visual explanations for document question answering. arXiv preprint arXiv:2505.07496. https://arxiv.org/abs/2505.07496
  • Hormazábal Lagos, M., Cerezo-Costas, H., & Karatzas, D. (2025). Spatially grounded explanations in vision–language models for document visual question answering. arXiv preprint arXiv:2507.12490. https://arxiv.org/abs/2507.12490