16 Host proteomics (HP) data processing

Proteins are highly complex molecules that require extensive processing to derive meaningful biological insights from the large number of spectra generated by mass spectrometers, in contrast to nucleic acids. In quantitative proteomics and metabolomics, tandem mass-spectrometry (MS/MS) is the most widely used form of data collection. MS1-level quantification and MS2-level identification are used to identify and quantify features that elute from the chromatograph column at an expected retention time. This is achieved through area under the curve (AUC) or peak height calculation for each feature. The corresponding features in HP, MP, and ME are then quantified using MS1 detection, and feature identification is achieved through MS2 using search algorithms that compare the recorded MS2 spectrum to a feature spectrum from a predefined database. HP and MP databases are typically protein databases translated from genomic data, although other strategies such as spectral libraries or mRNA databases have also been successful. However, assembling the identified peptides into proteins can be challenging, especially when dealing with redundant peptides or spliced proteins. Recent advances in computational methods for predicting protein structures are expected to expand the reference databases for proteomics.