Description of field of research:

The advancements in technology and innovative approaches to study living organisms produce huge amount of biological data. The vast amount of biological data can broadly be categorized into three classes: (1) Sequences, data generated by omics technologies such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics; (2) Images, data generated by biomedical imaging techniques containing cellular, sub-cellular, and diagnostic images; and (3) Signals, electrical signals generated by the brain or body muscles and acquired by appropriate sensors. Of these three big biological data ecosystems, Images have leveraged the benefit of deep learning with the application of Convolutional Neural Networks (CNNs). However, signals and sequences have not benefitted because CNNs can't directly be applied on these data types given data is not in the 2D format (images). Recent advances in high-throughput next-generation sequencing technologies have generated a plethora of gene-expression data from which multi-omics studies are favored.

Analysing multi-omics data helps us to gain insight into the mechanisms of diseases, drug discovery, and lead to prevention and treatment of diseases such as cancer. In general, omics data suffer from the curse of dimensionality, where the number of samples is usually much smaller than the number of genes. Also, although sequencing technologies generate hundreds of thousands of gene-expression data, only few of them are considered clinically significant. In the past, domain experts manually select specific genes confirming gene signatures (also called biomarkers) that can be used to successfully predict certain clinical outcomes such as recurrence, treatment benefits, metastasis, etc. In classical machine learning, hand-crafted feature engineering is done to reduce the dimensionality of the input space and to select essential features for classification. However, as the complexity of data increases, manual feature engineering is not the optimal choice and leads to sub-optimal model performance. Deep learning algorithms such as CNNs have ability to overcome this issue of manual feature engineering by automatically extracting features from raw data and model training in an end-to-end manner providing superior performance on various tasks.

As CNNs are suitable for processing data in the form of 2D or 3D matrices such as images, they can't directly be applied on omics data to leverage their high classification performance. Therefore, there is a need to transform non-image data to a well-organised image so that these images can be given as an input to CNNs for various downstream tasks. Early approaches to apply CNNs to non-image data were restricted to 1D CNN architectures. Recently, algorithms such as DeepInsight, REpresentation of Features as Images with NEighborhood Dependencies (REFINED), and Image Generator for Tabular Data (IGTD) have been proposed that transform tabular or non-image data into images.

The project aims to develop methods to transform non-image multi-omics data into images for the application of CNNs. The developed method will be applied to various publicly available omics datasets such as gene-expression data from The Cancer Genome Atlas (TCGA) program and Gene Expression Omnibus repository. The application of state-of-the-art deep learning algorithms by transforming non-image multi-omics data into images can help to solve broader queries pertaining to basic and applied areas of biology.

School

Computer Science and Engineering

Research areas

Computer Vision, Machine Learning, Deep Learning, Bioinformatics

The research team for this project consists of Dr. Sonit Singh from the School of Computer Science and Engineering (CSE) and an undergraduate student from UNSW. The team will be advised by Prof. Arcot Sowmya, a senior researcher in the field of medical imaging leading a team of Senior Research Associates, Research Associates, and multiple PhD students. The team will utilize publicly available datasets such as TCGA RNA-seq and population genomics data to develop models. The research will involve reading research articles to understand existing methods for transforming tabular/non-image data into images for the application of CNNs and developing methods using deep learning frameworks such as TensorFlow, PyTorch, and Keras.

A research report will be written outlining motivation, literature review, data exploration and analysis, developed model, experimental setup, and results. The team aims to produce a research paper that can be submitted to a relevant conference for publication. 

  1. "Mahmud et al., (2021). Deep learning in mining biological data, Cognitive Computation 13, 1-33 (2021). https://doi.org/10.1007/s12559-020-09773-x 
  2. Sharma et al., (2019). DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture, Scientific Reports 9, 11399 (2019). https://doi.org/10.1038/s41598-019-47765-6
  3. Bazgir et al., (2020). Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks, Nature Communications 11, 4391 (2020). https://doi.org/10.1038/s41467-020-18197-y 
  4. Zhu et al., (2021). Converting tabular data into images for deep learning with convolutional neural networks, Scientific Reports 11, 11325 (2021). https://doi.org/10.1038/s41598-021-90923-y"