Concepedia

Abstract

Human genome sequencing has generated population variant datasets containing millions of variants from hundreds of thousands of individuals 1-3 . The datasets show the genomic distribution of genetic variation to be influenced on genic and sub-genic scales by gene essentiality, 1,4,5 protein domain architecture 6 and the presence of genomic features such as splice donor/acceptor sites. 2 However, the variant data are still too sparse to provide a comparative picture of genetic variation between individual protein residues in the proteome. 1,6 Here, we overcome this sparsity for ∼25,000 human protein domains in 1,291 domain families by aggregating variants over equivalent positions (columns) in multiple sequence alignments of sequence-similar (paralagous) domains 7,8 . We then compare the resulting variation profiles from the human population to residue conservation across all species 9 and find that the same tertiary structural and functional pressures that affect amino acid conservation during domain evolution constrain missense variant distributions. Thus, depletion of missense variants at a position implies that it is structurally or functionally important. We find such positions are enriched in known disease-associated variants (OR = 2.83, p ≈ 0) while positions that are both missense depleted and evolutionary conserved are further enriched in disease-associated variants (OR = 1.85, p = 3.3×10 -17 ) compared to those that are only evolutionary conserved (OR = 1.29, p = 4.5×10 -19 ). Unexpectedly, a subset of evolutionary Unconserved positions are Missense Depleted in human (UMD positions) and these are also enriched in pathogenic variants (OR = 1.74, p = 0.02). UMD positions are further differentiated from other unconserved residues in that they are enriched in ligand, DNA and protein binding interactions (OR = 1.59, p = 0.003), which suggests this stratification can identify functionally important positions. A different class of positions that are Conserved and Missense Enriched (CME) show an enrichment of ClinVar risk factor variants (OR = 2.27, p = 0.004). We illustrate these principles with the G-Protein Coupled Receptor (GPCR) family, Nuclear Receptor Ligand Binding Domain family and In Between Ring-Finger (IBR) domains and list a total of 343 UMD positions in 211 domain families. This study will have broad applications to: (a) providing focus for functional studies of specific proteins by mutagenesis; (b) refining pathogenicity prediction models; (c) highlighting which residue interactions to target when refining the specificity of small-molecule drugs.

References

YearCitations

Page 1