Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry

TLDR

Proteomics seeks comprehensive proteome coverage, but the rapid growth of large, heterogeneous tandem mass spectrometry datasets creates unsolved challenges in controlling protein identification uncertainty beyond peptide‑level confidence measures. This work introduces MAYU, a strategy designed to reliably estimate false discovery rates for protein identifications in large‑scale proteomics data sets. MAYU was validated and applied to diverse large datasets, and is available both as standalone software and integrated into the Trans‑Proteomic Pipeline. The analyses reveal that dataset size substantially reduces protein identification reliability, with protein false discovery rates markedly higher than peptide‑level rates, underscoring MAYU’s importance for maintaining proteome repository quality.

Abstract

Comprehensive characterization of a proteome is a fundamental goal in proteomics. To achieve saturation coverage of a proteome or specific subproteome via tandem mass spectrometric identification of tryptic protein sample digests, proteomics data sets are growing dramatically in size and heterogeneity. The trend toward very large integrated data sets poses so far unsolved challenges to control the uncertainty of protein identifications going beyond well established confidence measures for peptide-spectrum matches. We present MAYU, a novel strategy that reliably estimates false discovery rates for protein identifications in large scale data sets. We validated and applied MAYU using various large proteomics data sets. The data show that the size of the data set has an important and previously underestimated impact on the reliability of protein identifications. We particularly found that protein false discovery rates are significantly elevated compared with those of peptide-spectrum matches. The function provided by MAYU is critical to control the quality of proteome data repositories and thereby to enhance any study relying on these data sources. The MAYU software is available as standalone software and also integrated into the Trans-Proteomic Pipeline.

References

Page 1

	Year	Citations

Page 1