The Sedimentary Geochemistry and Paleoenvironments Project

Abstract

Geobiology explores how Earth's system has changed over the course of geologic history and how living organisms on this planet are impacted by or are indeed causing these changes. For decades, geologists, paleontologists, and geochemists have generated data to investigate these topics. Foundational efforts in sedimentary geochemistry utilized spreadsheets for data storage and analysis, suitable for several thousand samples, but not practical or scalable for larger, more complex datasets. As results have accumulated, researchers have increasingly gravitated toward larger compilations and statistical tools. New data frameworks have become necessary to handle larger sample sets and encourage more sophisticated or even standardized statistical analyses. In this paper, we describe the Sedimentary Geochemistry and Paleoenvironments Project (SGP; Figure 1), which is an open, community-oriented, database-driven research consortium. The goals of SGP are to (1) create a relational database tailored to the needs of the deep-time (millions to billions of years) sedimentary geochemical research community, including assembling and curating published and associated unpublished data; (2) create a website where data can be retrieved in a flexible way; and (3) build a collaborative consortium where researchers are incentivized to contribute data by giving them priority access and the opportunity to work on exciting questions in group papers. Finally, and more idealistically, the goal was to establish a culture of modern data management and data analysis in sedimentary geochemistry. Relative to many other fields, the main emphasis in our field has been on instrument measurement of sedimentary geochemical data rather than data analysis (compared with fields like ecology, for instance, where the post-experiment ANOVA (analysis of variance) is customary). Thus, the longer-term goal was to build a collaborative environment where geobiologists and geologists can work and learn together to assess changes in geochemical signatures through Earth history. With respect to the data product, SGP is focused on assembling a well-vetted and comprehensive dataset that is tractable to multivariate statistical analyses accounting for multiple geological and methodological biases. Phase 1 of the project, which focused on the Neoproterozoic and Paleozoic, has been completed. Future phases will capture a broader range of geologic time, data types, and geography. The database contains tens of thousands of unpublished data points provided by consortium members, as well as detailed metadata that go beyond what is contained in papers. In many cases, these represent measurements that are tangential to a given published study but still of high utility to database studies; these allow the community to address questions that would be impossible to answer solely with the published data. For instance, in order to use a proxy such as Mo/TOC (total organic carbon) ratios in mudrocks deposited under a euxinic water column, the full suite of trace metal, iron speciation, and total organic carbon data is needed. Likewise, geospatial information is required to account for sampling biases, and many statistical learning approaches cannot accept, or have difficulty with, incomplete geological predictor variables. Ultimately, it is this complete data matrix that will allow for SGP’s most insightful analyses. This paper serves as an introduction to SGP, the process by which our data products are created, a description of the Phase 1 data product and a citable reference for that product, a description of the SGP website and API (Application Programming Interface) for open access, and a statement of our future goals. In recent years, there has been a welcome trend in the broader geochemical community toward increased data accessibility, documentation of sample context, and sample curation, albeit with challenges still ahead (Brantley et al., 2020; Cutcher-Gershenfeld et al., 2016; Planavsky et al., 2020). First, progress has been made through journals and organizations adopting stringent data archiving rules and promoting adherence to FAIR principles—findability, accessibility, interoperability, and reusability (“FAIR Play in Geoscience Data,” 2019; Wilkinson et al., 2016). Second, several databases now house geochemical data at different scales and with different focuses (Brantley et al., 2020; Gard et al., 2019; He et al., 2019; Lehnert et al., 2000). Among the largest and most active are projects such as EarthChem (earthchem.org), the Geobiodiversity Database (geobiodiversity.com), Pangaea (https://www.pangaea.de), and the StabisoDB (https://cnidaria.nat.uni-erlangen.de/stabisodb/). The SGP database was built with the data structures and standards of these other projects in mind, in keeping with FAIR principles and with the hope that data can be easily shared in the future. Consistent with the stance taken by other organizations in the community (Hanson, 2016), we also strongly encourage all members to register their samples for an International Geo Sample Number (IGSN; i.e., globally unique alphanumeric sample identifiers), which can be obtained from the System for Earth Sample Registration (www.geosamples.org). However, SGP is a domain-specific project that differs from other databases in the way the data are collected, the nature of the data collected, and the tailored way in which they are presented to our research community. Although some other databases contain sedimentary geochemical data, the vast majority of deep-time data is not available from any single source, and samples are not readily associated with critical contextual data—such as age constraints and environmental data—necessary for the types of proxy-through-time and/or environmental studies typically conducted in historical geobiology. When the SGP was founded in 2015, we believed that a “team science” philosophy would be the most effective way to move beyond spreadsheets to the type and abundance of data required. The research consortium framework we have implemented is modeled after mature consortia in human statistical genetics, such as the Psychiatric Genomics Consortium (PGC). In the PGC, researchers have aggregated data to make statistically robust observations and landmark findings not possible with the data generated by any single research group alone (Duncan et al., 2017; Schizophrenia Working group of the Psychiatric Genomics Consortium, 2014; Wray et al., 2018). Similar to biomedical research consortia, we hope that the intellectual and collaborative environment fostered by SGP will ultimately be as important as our data products or specific insights in research papers. The first priority for Phase 1 of SGP was to assemble or generate multi-proxy sedimentary geochemical data (carbon and sulfur abundances and isotopes, iron speciation, major and trace metal abundances, and trace metal isotopes, primarily from fine-grained siliciclastic rocks) from multiple regions worldwide for every Paleozoic Epoch and equivalent ~25 Myr Neoproterozoic time slice. In addition to data compilation, this has involved an effort by SGP members to generate new geochemical data from “background” intervals in the Paleozoic (i.e., not associated with events such as mass extinctions or significant climatic shifts). The first phase of data collection came to an end in 2019. At that point, a copy of the database was vetted by SGP team members and then archived—the first data “freeze” (following the best-practices approach used in medical consortia). Working groups were formed (with working group leadership established through an open call to SGP team members), and data were made available to Working group analysts via the website and through tailored queries. The first working group papers have recently been published (LeRoy et al., 2021; Lipp et al., 2021; Mehra et al., 2021), and more are in progress. Meanwhile, data collection continues, and the Phase 2 goal is to include more Mesozoic–Cenozoic and pre-Neoproterozoic time intervals and to expand the geochemical record to more diverse lithologies and grain-specific phases. The Phase 2 data freeze is currently anticipated for 2023, followed by data vetting and analyses toward group papers. SGP utilizes a relational database implemented with the PostgreSQL database management system. A full database diagram and documentation are available at https://github.com/ufarrell/sgp_phase1, and a simplified diagram is shown in Figure 2. The design was inspired by several existing data models in the geological and natural history museum communities. Tables for analytical geochemistry are from the British Geological Survey (BGS) geochemistry data model (Watson et al., 2014), with minor modifications. Tables for geological, geographical, and sample details are based on established museum collection management databases (Specify 6 https://www.specifysoftware.org/ and Arctos https://arctosdb.org/) in addition to the Observations Data Model 2 (ODM2, Horsburgh et al., 2016; Hsu et al., 2017), an information model for Earth observations. The SGP database is centered on the sample table (Figure 2). Samples are generally characterized by an individual rock sample and all resulting analyzed powders. The three key sections of the database linked to samples are (1) analytical results and associated methods, (2) geographical context, and (3) geological context. Dictionary tables (standardized lists of terms, also known as “controlled vocabularies”) are based on existing community vocabularies where possible (e.g., from EarthChem, ODM2, Macrostrat, U.S. Geological Survey (USGS), and BGS). However, in many cases, these vocabularies required additions, such as the inclusion of specific sedimentary geochemical experimental methods (e.g., sequential iron extraction techniques; Poulton & Canfield, 2005). The BGS data model for analytical methods and geochemical results has been adopted almost without modification. We store analytical data in their submitted or published format and do not standardize the results to any given unit. An analytical result may be empty (NULL) only if it is below or above detection limits, and those values are also stored if they are available. If the results are published, they are linked directly to a reference work on an individual basis so that a fine-level distinction can be made between published and related unpublished data from the same samples. Any geostandards that are analyzed alongside samples in a study are also recorded. In the SGP, we make every effort not to include the same result twice. However, replicates may legitimately be added if the same sample has undergone analysis for the same analyte more than once (this could include anything from true replicate analyses using the same methods in the same laboratory to analyses of the same sample by different research groups using different methods). We do not currently assign new sample identifiers to sub-samples. A parent–child relationship may be added in Phase 2 when the focus will expand to include carbonate data. The SGP welcomes contributions from any interested researchers. Specifically, contributing data automatically makes a researcher part of the SGP Collaborative Team, rather than one needing to “join” SGP to contribute data. In the first consortium-building stage, potential collaborators were targeted if their work was particularly relevant to the Phase 1 goals, and additional researchers were recruited via SGP representation at multiple conferences. SGP collaborators are involved in providing details about their samples and providing published data tables and unpublished data from their own archives. In addition, some data have been collected from relevant published studies where the authors are not directly involved. In such cases, contextual information was coded by SGP team members using information provided in the paper. SGP collaborators are asked to fill in a template with contextual information as completely as possible, but with an emphasis on key fields such as modern latitude and longitude, stratigraphic unit name, depositional environment, and lithology. A particularly important field is interpreted age, which is a numerical estimate for the age of each sample in millions of years (Ma). Whenever possible, the original authors, who are most familiar with the samples and stratigraphic sections, are asked to provide the interpreted age. They can use whatever method with which they feel most comfortable; for example, ages may be estimated based on assumed sedimentation rates and/or linear interpolation, or groups of samples can be assigned one age based on proximity to any available time markers. A brief justification is required for each age provided, which may be used in the future to refine ages further. Maximum and minimum age estimates can also be stored, and indeed, are critical for the type of re-weighted bootstrap analyses employed by many SGP working groups (Mehra et al., 2021). A subset of samples from two USGS databases has been integrated into the SGP database. The first of the databases used is the National Geochemical Database: Rock (USGS NGDB, U.S. Geological Survey, 2008), comprising data from USGS projects from the 1960s to1990s, largely from North America. The second is the Global Geochemical Database for Critical Metals in Black Shales project (USGS CMIBS, Granitto et al., 2017), which includes predominantly Phanerozoic shale data from all continents. Data from both USGS databases lack much of the contextual information available for samples directly coded by the SGP team members (most specifically basin type, metamorphic/maturity grade, depositional environment, and detailed age justification) and there are a higher proportion of analytes with less detailed geochemical methodology. Nevertheless, they represent large numbers of samples (74% of samples in Phase 1 are from USGS sources) with age, lithology, and geographic information that can be utilized for many types of analysis. In the case of USGS NGDB, only sedimentary samples were incorporated into SGP, and in the case of USGS CMIBS, we did not include samples with lithologies indicative of ore or studies where the authors were primarily concerned with mineral deposits or studying the effects of metamorphism on shales. An attempt was made to match USGS fields to SGP fields, with some data cleaning needed in order to extract important information such as up-to-date stratigraphic names. Samples can easily be traced back to the original USGS databases using their original identifiers. The USGS NGDB data were enhanced by adding interpreted ages. Samples were matched, using a combination of stratigraphy and location, to the continuous-time age model in Macrostrat (Peters et al., 2018). Specifically, the minimum and maximum age estimates from the Macrostrat model were entered, and the interpreted age was entered as the average of these values. Only samples with matched interpreted ages were included from USGS NGDB. The USGS CMIBS samples were associated with Macrostrat continuous-time age models where possible and given age information by SGP team members where not. However, a proportion (36%) remain without ages, and filling those in is a key goal for Phase 2. These three sources of data (direct entry by SGP team members (26% of samples), the CMIBS compilation (16% of samples), and the USGS NGDB (58% of samples)) provide a robust base platform for statistical analyses of aggregated sedimentary geochemical data through Earth history. Moving forward, we will continue direct entry from SGP team members, and work toward incorporating geochemical data compiled by additional geological surveys (for instance, incorporation of the OZCHEM whole-rock database from Geoscience Australia is currently in progress). Phase 1 of data collection ended in August 2019. A static version of the database was archived and made available to collaborators through the website (sgp-search.io) and via tailored queries. was for and any were the freeze in The Phase 1 data freeze includes samples, with analytical and was made through our website in This paper be in the future use of Phase 1 data complete information on the Phase 1 data product can be on the SGP including by age, lithology, and geochemical as well as the of how USGS databases were incorporated into the SGP The dataset includes samples with two of the data from published sources The are from unpublished including new and data. The samples from individual from (Figure Consistent with the Phase 1 goals, of samples were from the (Figure of samples are fine-grained siliciclastic or as are the majority of lithologies (Figure The data from USGS NGDB that are incorporated into the SGP database include samples with all of the samples are from the are are and do not have a specific details may be available in Figure including depositional environment and are not available for these samples, and methodological information is In the USGS NGDB samples than the SGP are from the Paleozoic, from the and from the of samples are from the The USGS database of the but given the of the with focus on deposits and sedimentary mineral the sampling may not be of the This is from the in geochemical data by which are focused on mass Earth system and other stratigraphic The data incorporated from USGS CMIBS into the SGP database include samples with The samples are from with from from the and from The majority of samples are fine-grained siliciclastic or Figure of samples with interpreted ages are Paleozoic, are are and are As was the case for USGS NGDB, contextual including depositional environment and are for these samples. However, more detailed geochemical methodological information is available. sample in CMIBS has a result from multiple values that were available et al., The of was made using a which included of the sample the sample (e.g., full the used in the analysis, and the detection et al., The SGP website (sgp-search.io) utilizes an to the Phase 1 database via an The two main types are and with a that any data. This methodological distinction is made data can be for some (e.g., and it is for many (e.g., et al., data represent of the total results and of data; this is a we given the and utility of A will an individual sample on each with geological information and geochemical analytes the Data are to one and are to (e.g., to and values are if more than one analysis was made this may average values using different analytical methods, the of samples in the database with multiple analytical values for a specific analyte is any analyses below or above detection are as these cannot be This has for abundance (e.g., in sedimentary as only results above detection limits, and higher will be We that this will the data for most interested in Earth a with age, geological context, and geochemical data for each If are to into the data and the analyses and that were to each geochemical data, then the is it lists every analysis in the database in a The also to data to the laboratory where the sample was the who made the geochemical At the time, from the to data, the and types will not information or have the to geochemical methodology. who are interested in methodological details or who would like to a data beyond the the SGP a the has a type, samples can be based on both geological and geochemical that for many samples some of geological contextual information are Thus, for example, a for samples deposited in a basin will only samples as such and not all samples in the database deposited in that samples will have data, many may result in a results will in a that can be used to the sample also has an information associated with this will a with detailed sample Finally, the may to reference information for their For every analysis is shown as an individual this will the specific for that individual For other types, this will for every a of all geochemical data to that specific When the is with their they can then of the data and a the and age of samples in their Thus, an API call would be This API call is a type for samples that from or and have total organic carbon In other for samples from America. In addition, the API call is for a results table with that or name, collection in each and the age in millions of documentation and a are available on the The goal of SGP was to provide intellectual and for the Earth community to our of environmental changes on Earth through A of Earth's history data but it a new of researchers with the data and statistical to make from large sedimentary geochemical datasets. of the focus in SGP Phase 1 was in the consortium and the data product to the where it was for analyses by the community. We now to increasingly move toward a of for data a culture of and a shared intellectual framework for such datasets. the course of Phase we to continue at also to progress and for data analysis. We will also to the community the and of different this point, we that the SGP is a research and we welcome on how to move toward our shared goals. We for the version of the SGP and and for We and for their contributions to We and for the and research consortium We the of The for of SGP website is by National BGS authors with of the of the British Geological Survey, Any use of or product is for only and not by the U.S. The authors of

References

Page 1

	Year	Citations

Page 1