Improvements in software tools
The massive amount of MS data generated in proteomics experiments requires computational aid for effective data processing and analysis. A growing number of open-access computational tools concerning all steps of proteomics data analysis are now freely available to users, a subset of which are listed in Table 1.
Table 1 Selected open access software tools in proteomics
Open access tools Language/framework License Publication Website
Database search engine (untargeted proteomics)
Comet*,† C++ Apache 2.0 [93] [94]
MS-GF+*,† Java Custom/Academic [31] [95]
MSAmanda† C#/Mono Custom/Academic [96] [97]
ProLuCID*,† Java Custom/Academic [26] [98]
X!Tandem*,† C++ OSI Artistic [99] [100]
Targeted proteomics and/or data-independent acquisition
Skyline* C# Apache 2.0 [101] [102]
OpenSWATH*,† C++ BSD 3-Clause [103] [104]
Protein inference and/or search post-processing
Percolator*,† C++ Apache 2.0 [34] [105]
ProteinProphet*,† C++ GNU LGPLv2 [106] [107]
ProteinInferencer† Java Custom/Academic [35] [98]
Protein quantification
MaxQuant .NET Custom/Academic [108] [109]
Census† Java Custom/Academic [110] [98]
PLGEM*,† R GNU GPLv2 [111] [112]
QPROT*,† C GNU GPLv3 [37] [113]
Pipelines and toolkits
Perseus .NET Custom/Academic [114] [115]
Crux* C++ Apache 2.0 [116] [105]
OpenMS* C++ BSD 3-Clause [117] [118]
TPP* C++ GNU LGPLv2 [119] [120]
Data access and reuse
PeptideShaker* Java Apache 2.0 [121] [122]
PRIDE inspector* Java Apache 2.0 [123] [124]
Proteomics software tools that provide open access to users. Many of these tools are also open source which potentially allows users to participate in the continual development of the tools
* Available open-source source code repository at the time of writing
†Platform-independent (Windows, Linux, Mac)
A major computational task in shotgun proteomics is to efficiently interpret the mass and intensity information within mass spectral data to identify proteins. The computational task can be formulated thus: given a particular tandem mass spectrum, identify the peptide sequences most likely to have given rise to the set of observed parent molecular mass and fragment ion patterns in a reasonable time frame. A general solution to this problem is “database search”, which involves generating theoretical spectra based on in silico fragmentation of peptide sequences contained in a protein database, and then systematically comparing the experimental MS spectra against the theoretical spectra to find the best peptide-spectrum matches. The SEQUEST algorithm was the first proposed to solve this peptide spectrum matching problem in 1994 [25] and its variants (e.g., Comet, ProLuCID [26]) remain among the most widely utilized algorithms to-date for peptide identification. SEQUEST-style algorithms score peptide-spectrum matches in two steps, with the first step calculating a rough preliminary score which empirically restricts the number of sequences being analyzed, and the second step deriving a cross-correlation score to select the best peptide-spectrum match among the candidates. Recent descendants of the SEQUEST algorithms have focused on optimizing its searching speed as well as improving the statistical rigor of candidate sequence scoring, with some programs reporting ~30 % more peptides/proteins identifiable from identical MS datasets and better definition of true-/false-positive identifications [26–29]. Other search engines also exist that are commonly in use, including X!Tandem, which calculates the dot product between experimental and theoretical spectra, then derives the expectation value of the score being achieved in a random sequence match; MaxQuant/Andromeda, which considers fragment ion intensities and utilizes a probabilistic model for fragment observations [30], MS-GF+ [31], and others. Methods have also been developed to combine the unique strengths and biases of multiple search engines to improve total protein identifications [32].
Means to distinguish true and false positives are critical to all large-scale approaches. The “two-peptide rule” was once commonly adopted to decrease false positives at the protein level by requiring each protein to be identified by at least two independent peptides. However, this rather conservative rule could inflate false negatives, as some short or protease-incompatible proteins may only produce maximally one identifiable peptide. More recent conventions involve foregoing the two-peptide rule and instead estimating the false discovery rate (FDR) of identification through statistical models, often with the aid of decoy databases. The use of decoy databases/sequences (reversed or scrambled peptide sequences), allows a quick estimation of the number of false positive proteins, by assuming identical distribution in protein identification scores for false positive hits and the decoy hits. A maximum acceptable FDR can then be specified (conventionally 1–5 %) to determine which protein identifications are accepted in the final result. To explicitly reveal the posterior probability of any particular identification being correct (also called the local FDR), a mixture model has been used that assumes that the peptide identification result is a mixture of correct and incorrect peptides with two distinct Poisson distributions of identification scores [33]. Auxiliary determinants including the presence of other identified peptides from the same proteins can also be applied to infer overall likelihood of protein assignment [33]. Machine learning algorithms (e.g., Percolator) have been demonstrated to build classifiers that automatically distribute peptide spectrum matches into true and false positives [34]. New inference approaches have also been demonstrated that consider peptide and protein level information together to improve the confidence of identification [35].
With the increase in data size and multiplexity (number of sample compared) in proteomics experiments, statistical approaches to analyze data have also evolved to tackle high-dimensionality data. Whereas early studies utilized mostly confirmatory statistics, modern proteomics datasets typically contain thousands of features (e.g., protein expression) over a handful of observations, hence simply testing whether each protein is significantly altered across experimental conditions can result in under-analysis and failure to distinguish latent structures across multiple dimensions, e.g., whether there exists a subproteome of co-regulated proteins across multiple treatment categories. To gain biological insights, quantitative proteomics datasets are now routinely mined using statistical learning strategies that comprise feature selection (e.g., penalized regression methods), dimensionality reduction (e.g., principal component analysis), and both supervised and unsupervised learning (e.g., support vector machine and hierarchical models) to discern significant protein signatures, disease-implicated pathways, or interconnected co-expression networks (Fig. 4).
Fig. 4 Proteomics data mining and functional annotations. Common computation approaches to extract information from massive proteomics datasets include (1) unsupervised cluster analysis, class discovery and visualization; (2) motif analysis and annotation term enrichment; (3) statistical learning methods for disease signature extraction; (4) network analysis; and (5) annotations with other functional information including protein motifs and cardiac disease relevance
Improvements to computational methods that allow more robust results from label-free quantification are an area of active research. For example, recent works (e.g., QProt) have attempted to resolve the respective quantities of multiple proteins that share common peptide sequences in spectral counting, either using weighted average methods or more statistically motivated models [36, 37]. In ion intensity approaches, chromatographic features that correspond to peptide signals over mass- and retention time-dimensions are identified using image analysis or signal processing algorithms. Because LC gradients are seldom perfectly reproducible, nonlinear distortions in retention time may occur. To ensure identical ions are compared between experiments, automatic chromatographic alignment and clustering methods are used. Some software can identify small chromatographic features based on accurate mass/retention time alone, such that some peptides may be quantified even in experiments where they were not explicitly identified. These processes tend to become computationally expensive for large experimental files [38], thus faster solutions are continuously developed.
With the proliferation of inter-compatible tools, an ongoing trend is to daisy-chain individual tools into user-friendly pipelines that provide complete solutions to a set of related data analysis problems. An ideal proteomics pipeline may combine identification, quantification, and validation tools in a modular organization accessible from a single location. Computation may be performed on the cloud to avoid the need to repeatedly copy, transfer, and store large files. Researchers can carry out computational tasks remotely from the browser on any computer system, obviating the need for redundant infrastructure investments. Currently, the Trans-Proteomics Pipeline [39] and the Integrated Proteomics Pipeline [40] are two example “end-to-end” pipelines that connect raw MS proteomics data to analysis output, whereas comprehensive, open-access pipelines have also been demonstrated in other omics fields, including Galaxy for genomics/transcriptomics [41] and MetaboAnalyst for metabolomics [42]. In parallel, tools are also being federated into interoperable networks through open frameworks. A modular and open-source software development paradigm, where individual software functionalities can interoperate via common interfaces and standards, helps ensure that new software can dovetail with existing ones with ease, and that software development may continue following inactivity from the original research team. Examples of such frameworks include the GalaxyP proteomics extension [43], and the proteomics packages within the R/BioConductor framework [44].