Interpretable Methodology Extraction: Aligning Text, Equations, and Figures in Scientific Papers

Reading the Methods section of a scientific paper is often slow and difficult, because crucial details are spread across different formats: text, equations, and figures. For example, the text may describe the optimisation process, the corresponding loss function is written as an equation, and the overall workflow is only shown in a diagram. Current document analysis systems usually process text only, or at most treat equations and figures separately. This means subtle but important innovations in methods are often hidden, requiring researchers to read papers line by line.

Research Gap: While recent tools (e.g., Nougat for equation recognition, PDFFigures 2.0 for figure extraction, LayoutLMv3 and MDocAgent for multimodal document analysis) can process different components individually, no existing approach aligns method sentences with the exact equations and figures that implement them. Without this alignment, computers cannot “understand” how methods are structured or compare innovations across papers.

Project Goal: This project will close that gap by creating the first small dataset and prototype system for methodology-aware alignment in research papers. Specifically, the student will:

Build a gold-standard dataset linking text, equations, and figures in the Methods sections of ~50–100 research papers.
Develop a baseline alignment model that can automatically suggest which equations and figures belong to each method step.
Evaluate the prototype against the annotated dataset to measure performance.

This project is ambitious but achievable within 60 FTE days for a motivated student, and it will make a valuable research contribution by providing resources and a baseline for future Honours, Masters, and PhD projects.

Email Jiaojiao

School

Computer Science and Engineering

Research Area

Suitable for recognition of Work Integrated Learning (industrial training)?

Research environment
Expected outcomes
Supervisory team
Reference material/links

The student will join an active NLP and document understanding research group, attend regular supervision meetings, and receive structured guidance and feedback.

Dataset: First annotated corpus of text–equation–figure alignments in Methods sections.

Prototype: Baseline alignment model demonstrating feasibility.

Experience: Student gains practical skills in NLP, computer vision, dataset annotation, and scholarly document understanding.

Contribution: Provides a foundation for follow-up research (e.g., scaling dataset, training advanced multi-agent systems).

Jiaojiao Jiang

Senior Lecturer

View Profile

Yucheng Lin

Section-aware / multi-agent scholarly document understanding

Han, S., Xia, P., Zhang, R., Sun, T., Li, Y., Zhu, H., & Yao, H. (2025). MDocAgent: A multi-modal multi-agent framework for document understanding. arXiv preprint arXiv:2503.13964. https://arxiv.org/abs/2503.13964
Gokdemir, O., Siebenschuh, C., Brace, A., Wells, A., Hsu, B., Hippe, K., Setty, P. V., Ajith, A., Pauloski, J. G., Sastry, V., Zheng, H., Ma, H., Kale, B., Chia, N., Gibbs, T., Papka, M. E., Brettin, T., Alexander, F. J., Anandkumar, A., Foster, I., Stevens, R., Vishwanath, V., & Ramanathan, A. (2025). HiPerRAG: High-Performance Retrieval Augmented Generation for scientific insights. arXiv:2505.04846 (PASC ’25). https://arxiv.org/abs/2505.04846

Equation & table alignment / structural understanding

Zhong, Y., Zeng, Z., Chen, L., Yang, L., Zheng, L., Huang, J., Yang, S., & Ma, L. (2025). DocTron-Formula: Generalized formula recognition in complex and structured scenarios. arXiv preprint arXiv:2508.00311. https://arxiv.org/abs/2508.00311
Ho, X., Kumar, S., Wu, Y.-A., Boudin, F., Takasu, A., & Aizawa, A. (2025). Table–text alignment: Explaining claim verification against tables in scientific papers. arXiv preprint arXiv:2506.10486. https://arxiv.org/abs/2506.10486

Diagram/layout-grounded explanations (DocVQA)

Souibgui, M. A., Choi, C., Barsky, A., Jung, K., Valveny, E., & Karatzas, D. (2025). DocVXQA: Context-aware visual explanations for document question answering. arXiv preprint arXiv:2505.07496. https://arxiv.org/abs/2505.07496
Hormazábal Lagos, M., Cerezo-Costas, H., & Karatzas, D. (2025). Spatially grounded explanations in vision–language models for document visual question answering. arXiv preprint arXiv:2507.12490. https://arxiv.org/abs/2507.12490

Follow

Interpretable Methodology Extraction: Aligning Text, Equations, and Figures in Scientific Papers

Jiaojiao Jiang

Yucheng Lin