Despite impressive gains, improvement of protein identification will likely continue to be an area of active research. It is estimated that up to 75–85 % of mass spectra generated in a proteomics experiment can remain unidentified by current data analysis workflows [62, 86], thus leaving room for continuous growth through better bioinformatics in the near future. Currently the unidentified “junk” spectra are mostly siloed or discarded, thus they constitute a major untapped source of biomedical big data. More inclusive search criteria (e.g., considering non-tryptic cleavage) can enhance identification, but there also exists a substantial portion of spectra that represent bona fide peptides not amenable to existing methods. These include peptides too short to score well in searching algorithms (≤5 amino acids), and peptides that are absent from protein sequence databases, e.g., variant peptides from polymorphisms or unknown splice isoforms. The advent of massive publicly available datasets has opened new avenues to tackle this problem [62, 63]. For instance, the millions of unidentified spectra that are uploaded to the PRIDE proteomics data repository may be systematically sorted and clustered, then analyzed via more exhaustive search protocols to identify what peptides are commonly present but unidentified across datasets and experiments.