Dr Penghao Wang
Mass spectrometry (MS) provides a high-throughput method for studying proteins. A central component of MS-based protein analysis is protein identification. It is the basis for estimating protein expression and understanding protein function. Accurate protein identification can lead to the discovery of disease biomarkers, new drugs and treatments, effective diagnoses and prognoses, and so on. In a MS-based protein experiment, proteins are digested into shorter peptides, which are then fragmented at their backbones, ionised, and finally captured by the mass spectrometer. The proteins then need to be inferred from the captured mass spectra using statistical and computational methods.
Protein identification is very challenging and unfortunately existing methods have several limitations, the most serious being low identification coverage and high false discovery rate. I will firstly introduce the statistical challenges that are involved and discuss why existing identification methods give sub-optimal results. I will then briefly describe our new probability-based protein identification methods. One of our methods, PTMexplorer, is able to simultaneously identify protein and protein post-translational modifications, thus significantly increasing the identification coverage. Our EMSSL method, utilising a self-boosted learning algorithm, can further decrease the false discovery rate of the identification. Experimental results on real datasets demonstrate that our methods, compared with existing methods, are able to identify significantly more proteins while keeping the false discovery rate low.