Publication | Open Access
Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies
511
Citations
27
References
2016
Year
Multiple NaturesEngineeringMultiomicsValue Imputation MethodsProteomic TechnologyData ScienceBiostatisticsBiomarker DiscoveryProteomicsStatisticsOmicsFunctional GenomicsBioinformaticsCompare Imputation StrategiesOmics DatasetsComputational BiologyImputation MethodSystems BiologyMedicineAccurate Imputation Method
Missing values are a major challenge in label‑free quantitative proteomics, and prior imputation surveys have overlooked that missingness mechanisms differ across datasets and that each method is tailored to a specific mechanism. The study aims to identify the most appropriate imputation method for a given dataset rather than a universally best one, and to provide practical guidelines for selecting and applying imputation strategies. Comparisons show that an apparently under‑performing method can outperform a state‑of‑the‑art method when applied at the correct stage of the pipeline and to a dataset with matching missingness, supporting the proposed guidelines.
Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.
| Year | Citations | |
|---|---|---|
Page 1
Page 1