Concepedia

Publication | Open Access

Genome-wide analysis relating expression level with protein subcellular localization

67

Citations

14

References

2000

Year

Abstract

We investigate the relationship between protein subcellular localization and gene expression for a variety of whole-genome expression datasets. We find high expression levels for cytoplasmic proteins and low ones for nuclear and membrane proteins. Excreted proteins have large fluctuations in expression level over various time courses. Our results can be interpreted in terms of protein structure and function. Detailed statistics are at http://bioinfo.mbb.yale.edu/genome/expression. The recent advent of experiments that measure gene expression levels (mRNA transcript abundance levels) on a genome-wide scale allows a comprehensive view of gene activity patterns in cells. For instance, these experiments have demonstrated that the expression patterns of many functionally related genes are similar1Velculescu V.E et al.Characterization of the yeast transcriptome.Cell. 1997; 88: 243-251Abstract Full Text Full Text PDF PubMed Scopus (915) Google Scholar, 2DeRisi J.L et al.Exploring the metabolic and genetic control of gene expression on a genomic scale.Science. 1997; 278: 680-686Crossref PubMed Scopus (3738) Google Scholar, 3Cho R.J et al.A genome-wide transcriptional analysis of the mitotic cell cycle.Mol. Cell. 1998; 2: 65-73Abstract Full Text Full Text PDF PubMed Scopus (1740) Google Scholar, 4Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar, 5Roth F.P et al.Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation.Nat. Biotechnol. 1998; 16: 939-945Crossref PubMed Scopus (773) Google Scholar, 6Spellman P.T et al.Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.Mol. Biol. Cell. 1998; 9: 3273-3297Crossref PubMed Scopus (3950) Google Scholar, 7Jelinsky S.A Samson L.D Global response of Saccharomyces cerevisiae to an alkylating agent.Proc. Natl. Acad. Sci. U. S. A. 1999; 96: 1486-1491Crossref PubMed Scopus (383) Google Scholar, 8Niehrs C Pollet N Synexpression groups in eukaryotes.Nature. 1999; 402: 483-487Crossref PubMed Scopus (305) Google Scholar. Here, we show that, for yeast, expression levels in these experiments are clearly correlated with the subcellular localization of the corresponding protein. Furthermore, this correlation can be interpreted in terms of broad classes of protein structures and functions. We scaled the expression levels generated by a range of techniques (gene chip, SAGE, cDNA microarray) for yeast in a variety of conditions into a common framework and cross-referenced them with the known localizations of approximately 2000 yeast proteins found in the MIPS9Mewes H.W et al.MIPS: a database for genomes and protein sequences.Nucleic Acids Res. 1999; 27: 44-48Crossref PubMed Scopus (164) Google Scholar and YPD10Hodges P.E et al.The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data.Nucleic Acids Res. 1999; 27: 69-73Crossref PubMed Scopus (192) Google Scholar databases. (Further details are given in the caption to Figure 1 and on the associated website, http://bioinfo.mbb.yale.edu/genome/expression. As shown in Figure 1, high expression levels can be observed for cytoplasmic proteins, low levels for nuclear and membrane proteins, and middling levels for secretory pathway proteins, i.e. those secreted or in the endoplasmic reticulum (ER) and golgi apparatus. Figure 2 shows a more detailed representation of the absolute expression levels in Figure 1. We chose the dataset of Holstege et al.4Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar as a reference because it results from the careful averaging over many experiments. Figure 2 shows simplified box-plot representations of the underlying distributions of expression levels for each of the different subcellular compartments for this dataset. It is evident that each compartment shows an appreciable spread of expression levels and that the distributions for the compartments with the highest expression levels are more spread-out than for those with lower expression levels. Full representations of each expression level distribution are shown on our website (http://bioinfo.mbb.yale.edu/genome/expression). These show that the distributions are roughly exponential, although the distribution tails are somewhat longer. The exponential shape may reflect the fact that many genes are expressed at a basal level whereas a smaller number are highly active in the particular state of the cell. Table 1 shows key statistics based on the box-plots in Figure 2 for the Holstege et al.4Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar dataset. For comparison, we show the same statistics for a variety of other expression experiments that used different techniques (gene chips and SAGE). The figure shows that our results are largely consistent over the variety of experiments we analysed. By this we mean that the overall trend of high expression in the cytoplasm and low expression in the nucleus can be observed consistently in the different datasets. However, the datasets do differ in the exact value of each statistic, with these differences probably resulting from the slightly different protocols and growth conditions employed in each experiment.Table 1Key statistics of the box-plot representationsaKey statistics derived from Fig. 2. For comparative purposes we show these statistics for a variety of different experiments as well as for that of Holstege et al.4.Distribution parametersbNumber of different transcripts with expression levels that are associated with the cellular compartment in each experiment. These were obtained by cross-referencing with the localization databases; 75% expression value of the top quartile line, i.e. the 75% line. For instance, 75% of transcripts belonging to the cytoplasm in the Holstege data have a value of less than 23.5 copies per cell. Similarly, the row ‘50%’ contains the median value of expression for transcripts in each compartment and the row ‘25%’ contains the expression value of the bottom quartile line. Although the experiments vary in methodology and growth conditions, the same trend can be observed. The differentiation of expression according to subcellular compartmentalization is evident for all cases. More detailed statistics can be found on the website. All the data has been scaled into a common reference system based on the Holstege data. Furthermore, on the website, we give results similar to those in this table for a ‘combined’ dataset derived from averaging many different experiments. Abbreviation: ORFs, open reading frames.Subcellular localizationCytoplasmExtracellularERPlasma-membraneGolgiMitochondriaMembraneNucleusData set75%23.510.53.62.72.32.61.81.34Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar50% (median)4.52.51.71.21.31.10.90.725%1.30.60.80.50.90.60.40.4No. of ORFs479.016.0113.0106.048.0266.0148.0698.075%26.85.53.01.92.02.31.51.07Jelinsky S.A Samson L.D Global response of Saccharomyces cerevisiae to an alkylating agent.Proc. Natl. Acad. Sci. U. S. A. 1999; 96: 1486-1491Crossref PubMed Scopus (383) Google Scholar50% (median)4.22.51.10.50.90.80.60.425%0.90.40.50.20.50.40.20.2No. of ORFs480.017.0116.0115.048.0262.0149.0668.075%11.96.74.73.33.33.43.33.35Roth F.P et al.Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation.Nat. Biotechnol. 1998; 16: 939-945Crossref PubMed Scopus (773) Google Scholar50% (median)3.43.33.31.71.92.71.71.4Mating type a25%2.01.51.41.21.31.41.21.2No. of ORFs494.018.0117.0128.049.0273.0166.0739.075%15.05.25.52.03.03.02.02.01Velculescu V.E et al.Characterization of the yeast transcriptome.Cell. 1997; 88: 243-251Abstract Full Text Full Text PDF PubMed Scopus (915) Google Scholar50% (median)4.02.02.01.52.02.01.01.0SAGE25%1.01.01.01.01.01.01.01.0logNo. of ORFs242.06.055.048.023.097.050.0215.0phasea Key statistics derived from Fig. 2. For comparative purposes we show these statistics for a variety of different experiments as well as for that of Holstege et al.4Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar.b Number of different transcripts with expression levels that are associated with the cellular compartment in each experiment. These were obtained by cross-referencing with the localization databases; 75% expression value of the top quartile line, i.e. the 75% line. For instance, 75% of transcripts belonging to the cytoplasm in the Holstege data have a value of less than 23.5 copies per cell. Similarly, the row ‘50%’ contains the median value of expression for transcripts in each compartment and the row ‘25%’ contains the expression value of the bottom quartile line. Although the experiments vary in methodology and growth conditions, the same trend can be observed. The differentiation of expression according to subcellular compartmentalization is evident for all cases. More detailed statistics can be found on the website. All the data has been scaled into a common reference system based on the Holstege data. Furthermore, on the website, we give results similar to those in this table for a ‘combined’ dataset derived from averaging many different experiments. Abbreviation: ORFs, open reading frames. Open table in a new tab Expression measurements over time also enable us to relate fluctuations in expression to localization. As described in the caption to Figure 1, and shown in the body of the figure, we measure the magnitude of fluctuation in terms of the standard deviation in expression ratio over a time course. For the yeast cell cycle time course3Cho R.J et al.A genome-wide transcriptional analysis of the mitotic cell cycle.Mol. Cell. 1998; 2: 65-73Abstract Full Text Full Text PDF PubMed Scopus (1740) Google Scholar, 6Spellman P.T et al.Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.Mol. Biol. Cell. 1998; 9: 3273-3297Crossref PubMed Scopus (3950) Google Scholar we find that secreted proteins have, perhaps predictably, high fluctuations. (However, proteins in the secretory pathway with final destinations in the endoplasmic reticulum (ER) or golgi show fluctuations slightly below average.) Plasma membrane proteins, needed for transporting molecules out of the cell, have the second highest fluctuations. Biologically, these results are quite reasonable as the export of proteins from the cell is quite variable and depends on the exact state of the cell, whereas the amount of intracellular protein has to be maintained at a more constant level. For the diauxic shift time course2DeRisi J.L et al.Exploring the metabolic and genetic control of gene expression on a genomic scale.Science. 1997; 278: 680-686Crossref PubMed Scopus (3738) Google Scholar, which is also shown in Figure 1, we again observe high fluctuations for secreted proteins. We also observe particularly high fluctuations for cytoplasmic and mitochondrial proteins, as expected when the cell shifts from fermentation to respiration and alters the activity of many metabolic enzymes. Analysis of the functions of the associated proteins further elucidates the relationship between expression and localization. Using the MIPS classification9Mewes H.W et al.MIPS: a database for genomes and protein sequences.Nucleic Acids Res. 1999; 27: 44-48Crossref PubMed Scopus (164) Google Scholar, we can subdivide the yeast genome into various functional categories, e.g. ‘cell structure’ and ‘transcription’. Figure 3a shows the average absolute expression levels for each functional category for a variety of experiments, doing for function what Figure 1 does for localization. Likewise, Figure 3b shows box-plot representations of the expression distributions for select functional categories in a similar fashion to Figure 2. We observe that proteins in the ‘transcription’ and ‘transport’ categories have lower average expression levels (1.8 and 2.7 versus 3.2 for all classified genes; all values are in copies per cell for the data set of Holstege et al.4Holstege F.C et al.Dissecting the regulatory circuitry of a eukaryotic genome.Cell. 1998; 95: 717-728Abstract Full Text Full Text PDF PubMed Scopus (1608) Google Scholar; see Figure 3a). On the other hand, expression levels in the categories ‘protein synthesis’ and ‘energy’ are above average (16.0 and 5.4). This is all in accord with the localization results (as noted above, nuclear and membrane proteins have low expression levels and cytoplasmic proteins have high expression levels). Proteins involved in transcription include DNA-binding and regulatory proteins with clear nuclear localization; proteins involved in extracellular transport are often located in the cell membrane. Furthermore, the high level of expression associated with protein synthesis is due largely to the ribosomal proteins (average expression level of 23.2), which are in the cytoplasm. In contrast, the amino-acyl-tRNA synthetases, which are also part of the broad category of protein synthesis, have considerably lower levels of expression. Proteins involved in energy production include the cytoplasmic proteins involved in glycolysis, which have high levels of expression (20.5), and the more lowly expressed mitochondrial proteins involved in the tricarboxylic acid (TCA) cycle (2.0). The relationship between expression and localization is also linked to protein structure, although to a lesser extent. Using membrane protein prediction plus classifications of the known soluble protein folds11Gerstein M Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census.Proteins. 1998; 33: 518-534Crossref PubMed Scopus (102) Google Scholar, 12Jansen R Gerstein M Analysis of the yeast transcriptome with broad structural and functional categories: characterizing highly expressed proteins.Nucleic Acids Res. 2000; 28: 1481-1488Crossref PubMed Scopus (109) Google Scholar, we can subdivide the proteins in the yeast genome into helical membrane proteins, soluble proteins and, among the latter, proteins with a structural architecture that all α, all β and mixed αβ. Figure 3c shows average expression levels for these structural classes, in analogy to the presentation in FIGURE 1, FIGURE 3 for localization and function. We find a low average expression level for the transmembrane proteins (1.7). On the other hand, proteins with mixed αβ architecture, which are typically found in the cytosol13Hegyi H Gerstein M The relationship between protein structure and function: a comprehensive survey with application to the yeast genome.J. Mol. Biol. 1999; 288: 147-164Crossref PubMed Scopus (317) Google Scholar, are the most highly expressed among the soluble proteins (3.5 versus 2.5 for other soluble proteins). To conclude, we find a clear statistical relationship between a gene’s expression level and its subcellular localization, with cytoplasmic proteins tending to be highly expressed and nuclear or membrane proteins more lowly expressed. This relationship may be useful in predicting subcellular localization, given expression information15Drawid A Gerstein M A bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome.J. Mol. Biol. 2000; 360: 1077-1093Google Scholar. The correlation between expression and localization may be related to the volumes of the various subcellular compartments. The cytoplasm, for instance, has much more space for proteins than the other compartments. To achieve the same effective concentration, the expression level for freely diffusing proteins destined for larger compartments may need to be higher than for smaller ones. In other words, genes associated with cytoplasmic proteins may be regulated differently, in a different dynamic range, than those associated with membrane and mitochondrial proteins.

References

YearCitations

Page 1