Description of field of research:

Vision and Language are the most common ways of comprehending and expression our knowledge about the world. One of the overarching goals of Artificial Intelligence is to develop systems that can understand the visual content and answer questions like humans. A Visual Question Answering (VQA) system takes as input an image and a natural language question about the image and produces a natural language answer as the output. Like VQA in the general domain, the Med-VQA task aims to design system that can answer diagnostically relevant natural language questions asked on medical images. With current rise of Electronic Health Records (EHRs) and the increasing interest in AI to support clinical decision making and improve patient engagement, the Med-VQA systems can give clinicians a second opinion and provides customers a better opportunity to have better interpretation of their radiology exams. In contrast to the general domain, the joint modeling of visual and linguistic information in the medical domain has received little attention. The Med-VQA task combine techniques from Computer Vision and Natural Language Processing, where CV provides an understanding of the content of the image and NLP provides an understanding of the questions and the ability to produce answers. The Med-VQA is a challenging task given system must interpret natural language question, do reasoning on medical images, and then generate answers in natural language that are coherent, fluent, and diagnostically accurate. Apart from this, there is a huge diversity of clinical questions and medical images. The developed Med-VQA systems can potentially augment radiologists by providing them second opinion, can be used in medical education to train medical professionals, and can be integrated in medical conversational agents, in turn improving radiology workflow and patient engagement.

Research Area

Computer vision |
Natural language processing |
Deep learning |
Medical imaging

The research team for this project consists of Sonit Singh from the School of Computer Science and Engineering (CSE) and an undergraduate student from UNSW. The team will be advised by Prof. Arcot Sowmya, a senior researcher in the field of medical imaging leading a team of Senior Research Associates, Research Associates, and multiple PhD students. The team will utilize publicly available datasets such as VQA-RAD, ROCO, and SLAKE to develop Med-VQA system. The research will involve reading research articles to understand existing methods for Med-VQA and developing new algorithms using deep learning frameworks such as TensorFlow, PyTorch, and Keras. 

A research report will be written outlining introduction, literature review, data exploration and analysis, developed model, experimental setup, and results. The team aims to produce a research paper that can be submitted to a relevant conference for publication. 

  1. Lau et al., A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, Vol. 5, No. 1, pp. 1-10, 2018.
  2. Pelka et al., Radiology Objects in Context (ROCO): A multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-scale Annotation of Biomedical Data and Expert Label Synthesis, 2018.
  3. Abacha et al., Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain, CEUR, Vol. 2936, 2021.
  4. et al., SLAKE: A Semantically Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering, 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 1650-1654
  5. Singh et al., Pushing the Limits of Radiology with Joint Modeling of Visual and Textual Information, Proceedings of ACL 2018, Student Research Workshop, Association for Computational Linguistics, 2018.