A Beginner's Guide to Conducting Reproducible Research

Abstract

Replication is a fundamental tenet of science, but there is increasing fear among scientists that too few scientific studies can be replicated. This has been termed the “replication crisis” (Ioannidis 2005, Schooler 2014). Scientific papers often include inadequate detail to enable replication (Haddaway and Verhoeven 2015, Archmiller et al. 2020), many attempted replications of well-known scientific studies have failed in a wide variety of disciplines (Moonesinghe et al. 2007, Hewitt 2012, Bohannon 2015, Open Science Collaboration 2015), and rates of paper retractions are increasing (Cokol et al. 2008, Steen et al. 2013). Because of this, researchers are working to develop new ways for researchers, research institutions, research funders, and journals to overcome this problem (Peng 2011, Fiedler et al. 2012, Sandve et al. 2013, Stodden et al. 2013). Because replicating studies with new independent data is expensive, rarely published in high-impact journals, and sometimes even methodologically impossible, computationally reproducible research (most often termed simply “reproducible research”) is often suggested as a pathway for increasing our ability to assess the validity and rigor of scientific results (Peng 2011). Research is reproducible when others can reproduce the results of a scientific study given only the original data, code, and documentation (Essawy et al. 2020). This approach focuses on the research process after data collection is complete, and it has many (though not all) of the advantages of replicating studies with independent data while minimizing the largest barrier (i.e., the financial and time costs of collecting new data). Replicating studies remains the gold standard for rigorous scientific research, but reproducibility is increasingly viewed as a minimum standard that all scientists should strive toward (Peng 2011, Sandve et al. 2013, Archmiller et al. 2020, Culina et al. 2020). This commentary describes basic requirements for such reproducible research in the fields of ecology and evolutionary biology. In it, we make the case for why all research should be reproducible, explain why research is often not reproducible, and present a simple three-part framework all researchers can use to make their research more reproducible. These principles are applicable to researchers working in all sub-disciplines within ecology and evolutionary biology with data sets of all sizes and levels of complexity. Reproducible research is a by-product of careful attention to detail throughout the research process and allows researchers to ensure that they can repeat the same analysis multiple times with the same results, at any point in that process. Because of this, researchers who conduct reproducible research are the primary beneficiaries of this practice. First, reproducible research helps researchers remember how and why they performed specific analyses during the course of a project. This enables easier explanation of work to collaborators, supervisors, and reviewers, and it allows collaborators to conduct supplementary analyses more quickly and more efficiently. Second, reproducible research enables researchers to quickly and simply modify analyses and figures. This is often requested by supervisors, collaborators, and reviewers across all stages of a research project, and expediting this process saves substantial amounts of time. When analyses are reproducible, creating a new figure may be as easy as changing one value in a line of code and re-running a script, rather than spending hours recreating a figure from scratch. Third, reproducible research enables quick reconfiguration of previously conducted research tasks so that new projects that require similar tasks become much simpler and easier. Science is an iterative process, and many of the same tasks are performed over and over. Conducting research reproducibly enables researchers to re-use earlier materials (e.g., analysis code, file organization systems) to execute these common research tasks more efficiently in subsequent iterations. Fourth, conducting reproducible research is a strong indicator to fellow researchers of rigor, trustworthiness, and transparency in scientific research. This can increase the quality and speed of peer review, because reviewers can directly access the analytical process described in a manuscript. Peer reviewers' work becomes easier, and they may be able to answer methodological questions without asking the authors. Reviewers can check whether code matches with methods described in the text of a manuscript to make sure that authors correctly performed the analyses as described, and it increases the probability that errors are caught during the peer-review process, decreasing the likelihood of corrections or retractions after publication. Finally, it also protects researchers from accusations of research misconduct due to analytical errors, because it is unlikely that researchers would openly share fraudulent code and data with the rest of the research community. Finally, reproducible research increases paper citation rates (Piwowar et al. 2007, McKiernan et al. 2016) and allows other researchers to cite code and data in addition to publications. This enables a given research project to have more impact than it would if the data or methods were hidden from the public. For example, researchers can re-use code from a paper with similar methods and organize their data in the same manner as the original paper and then cite code from the original paper in their manuscript. A third team of researchers may conduct a meta-analysis on the phenomenon described in these two research papers and thus use and cite both of these papers and the data from those papers in their meta-analysis. Papers are more likely to be cited in these re-use cases if full information about data and analyses are available (Whitlock 2011, Culina et al. 2018). Reproducible research also benefits others in the scientific community. Sharing data, code, and detailed research methods and results leads to faster progress in methodological development and innovation because research is more accessible to more scientists (Parr and Cummings 2005, Roche et al. 2015, Mislan et al. 2016). First, reproducible research allows others to learn from your work. Scientific research has a steep learning curve, and allowing others to access data and code gives them a head start on performing similar analyses. For example, researchers who are new to an analytical technique can use code shared with the research community by researchers with more experience with that technique to learn how to rigorously perform and validate these analyses. This allows researchers to conduct research that is more rigorous from the outset, rather than having to spend months or years trying to figure out current “best practices” through trial and error. Modifying existing resources can also save time and effort for experienced researchers—even experienced coders can modify existing code much faster than they can write code from scratch. Sharing code thus allows experienced researchers to perform similar analyses more quickly. Second, reproducible research allows others to understand and reproduce a researcher's work. Allowing others to access data and code makes it easier for other scientists to perform follow-up studies to increase the strength of evidence for the phenomenon of interest. It also increases the likelihood that similar studies are compatible with one another, and that a group of studies can together provide evidence in support of or in opposition to a concept. In addition, sharing data and code increases the utility of these studies for meta-analyses that are important for generalizing and contextualizing the findings of studies on a topic. Meta-analyses in ecology and evolutionary biology are often hindered by incompatibility of data between studies, or lack of documentation for how those data were obtained (Stewart 2010, Culina et al. 2018). Well-documented, reproducible findings enhance the likelihood that data can be used in future meta-analyses (Gerstner et al. 2017). Third, reproducible research allows others to protect themselves from your mistakes. Mistakes happen in science. Allowing others to access data and code gives them a better chance to critically analyze the work, which can lead to coauthors or reviewers discovering mistakes during the revision process, or other scientists discovering mistakes after publication. This prevents mistakes from compounding over time and provides protection for collaborators, research institutions, funding organizations, journals, and others who may be affected when such mistakes happen. There are a number of reasons that most research is not reproducible. Rapidly developing technologies and analytical tools, novel interdisciplinary approaches, unique ecological study systems, and increasingly complex data sets and research questions hinder reproducibility, as does pressure on scientists to publish novel research quickly. This multitude of barriers can be simplified into four primary themes: (1) complexity, (2) technological change, (3) human error, and (4) concerns over intellectual property rights. Each of these concerns can contribute to making research less reproducible and can be valid in some scenarios. However, each of these factors can also be addressed easily via well-developed tools, protocols, and institutional norms concerning reproducible research. Science is difficult, and scientific research requires specialized (and often proprietary) knowledge and tools that may not be available to everyone who would like to reproduce research. For example, studies in the fields of ecology and evolutionary biology often involve study systems, mathematical models, and statistical techniques that require a large amount of domain knowledge to understand, and these analyses can therefore be difficult to reproduce for those with limited understanding of any of the necessary underlying bases of knowledge. Some analyses may require high-performance computing clusters that use several different programming languages and software packages, or that are designed for specific hardware configurations. Other analyses may be performed using proprietary software programs such as SAS statistical software (SAS Institute Inc., Cary, North Carolina, USA) or ArcGIS (Esri, Redlands, California, USA) that require expensive software licenses. Lack of knowledge, lack of institutional infrastructure, and lack of funding all make research less reproducible. However, most of these issues can be mitigated fairly easily. Researchers can cite primers on complex subjects or analyses to reduce knowledge barriers. They can also thoroughly annotate analytical code with comments explaining each step in an analysis or provide extensive documentation on research software. Using open software (when possible) makes research more accessible for other researchers as well. Hardware and software used to analyze data both change over time, and they often change quickly. When old tools become obsolete, research becomes less reproducible. For example, reproducing research performed in 1960 using that era's computational tools would require a completely new set of tools today. Even research performed just a few years ago may have been conducted using software that is no longer available or is incompatible with other software that has since been updated. One minor update in a piece of software used in one minor analysis in an analytical workflow can render an entire project less reproducible. However, this too can be mitigated by using established tools in reproducible research. Careful documentation of versions of software used in analyses is a baseline requirement that anyone can meet. There are also more advanced tools that can help overcome such challenges in making research reproducible, including software containers, which are described in further detail below. Though fraudulent research is often cited as reason to make research more reproducible (Ioannidis 2005, Laine et al. 2007, Crocker and Cooper 2011), many more innocent reasons exist as to why research is often difficult to reproduce (Elliott 2014). People forget small details of how they performed analyses. They fail to describe data collection protocols or analyses completely despite their best efforts and multiple reviewers checking their work. They fail to collect or thoroughly data that during collection but out to be for Science is performed by and a wide variety of common can render research less reproducible. not all of these challenges can be by performing research a research process can small errors and analyses. For example, details such as when and data were were during data and were used can make a in making sure that those data can be used or errors often during the data of a project, and these can be mitigated by multiple of data to data the process for data into data, and a small set of data the data set as a Researchers often to share data and code because so may other researchers to use data and code or Other researchers may use available data without to about the data that in analyses. Researchers may use available data or code without the original data or code who then not for expensive data or Researchers may to data from others so that they can perform new analyses on those data in the future without about others them using the shared can lead to to share data and code via many and we that making data openly available is likely the most of reproducible research and et al. 2013, et al. 2015, et al. et al. 2016). However, new tools for sharing data and code and in are making it easier for researchers to for so and to others from using their data during an Conducting reproducible research is not difficult does it require knowledge of research tools and they it or most researchers perform much of the work to make research reproducible. this we some basic toward making research more reproducible in stages of a research (1) data (2) during and (3) after that anyone can as as more advanced tools for those who would like to basic requirements that reproducible research of best for scientific research and that to a of reproducibility is both more and more than researchers may in the with data It does not simply from sharing data and code after a project is It is difficult to reproduce research when data are or or when it is to or how data First, data should be at of the research process and in multiple This data (e.g., data or data (i.e., data and in Because it is that researchers or data while it data should be as a It is to and save data or with a data set to ensure that these are with the data different should be in different and using different (e.g., paper and an and to of data from any are and and are should not themselves to those data should be in data in a file is a file are those that data as text with one line (e.g., or and are the most across as they can be by anyone without proprietary software For more complex data such as or other (e.g., and may be However, the of these makes them difficult for many researchers to access and use so it is best to with simpler file when It is often to data into a when and data are in (i.e., in in have data (e.g., data are not with data for a and have and (e.g., that not include like and in this are easy to and during explaining to the data and each of the should be with the are they can be et al. 2015), and is how we data across a all data sets should include that how and why data were whether a of or data, and how are should be in a that it with the data set it A few of a of within the same file may work in some or a text file can be in the same as the data if the be more In the it is best to with a simple file for to Finally, researchers should organize in a and make sure that all have It should be easy to is in a file or from and a (e.g., the with the or provides even more information when through in a A for both and also makes simpler by data, and in with It is often more to organize in small of similar rather than having one large full of of For example, computational projects within a for each project, with for the manuscript data analyses or and analysis within that this specific organization may for other of research, all of the research and documentation for a given project in this makes it much easier to at all stages of the research process and to it or share it with others the project is the research process, from data to can be used to a and provide a of that have over the of a project or research to a file or set of over time so that can specific versions between versions of and even to in the of mistakes. researchers use to in code and over time. most is which is often used via such as and These are easy to set and and they of data, code, and throughout the of a project. also enables a specific of data or code to be easily so that code used for analyses at a specific point in time (e.g., when a manuscript is can be even if that code is updated. When all data and analysis should be performed using to using or that step is and by and both on data and as a of analytical Because of this code is reproducible. errors are mistakes during data or so having a of these that analyses can be for errors and are on future data are not to script, then they should be in a file that is in the code should be thoroughly with within code as for that code, increasing should information for an to easily understand the code but not so much that through comments is a comments can be for this by a who is about the of research but is not a project In most the few of a should include a of the does and who it, by small that data, packages, and and analytical code then those and are using a and comments to explain each of code a makes code easier to well-known (e.g., for software code that were by many Researchers should of these while in that all are to some Researchers should work to develop a that for This using a (e.g., or to and information in (e.g., using as a for to or to data should also be in and into as our process of data more easily than longer of code also tasks together and can like to make code more There are several ways to mistakes and make code easier to First, researchers should For example, if a set of analysis are used those can be as a and at the of the This the of a and the of some of a so that it in different within a researchers can use to make code more by performing the same on multiple or in (though it is also important to that too many one can quickly make code A third to reduce mistakes is to reduce the number of that be to analyses on an or new data It is often best to in the data and at the of a script, so that those can then be used throughout the rest of the When on new data, these can then be at the of a rather than multiple times in throughout the Because incompatibility between or versions can the reproducibility of research, the current gold standard for that analyses can be used in the future is to a software such as a or et al. are that the entire computing used in an all of and all into one can then be or allowing them to be used in the even as packages, or change over time. creating a software is or a step than researchers are to it is important to thoroughly all software including the have been it is time for the step most with reproducible sharing research with should be by sharing the data and code is from the only of reproducible and are it becomes the data, and important results should be and easily are available to make data sharing and accessible in a variety of research There are many ways to this, several of which are described below. as it is better to use than tools in it is better to and directly from code than to these using or other A large number of errors in from not to change all or when a of an analysis and this can be when a manuscript. reproducible and are directly with code and into in a that allows when analyses are creating a For example, in and directly from a so a figure be in the when the figure is in the for a much of and can also be used to that can when code or data change, so that can be reproducible as well. using one of these tools is too large a then simply directly from of and make a substantial in increasing the reproducibility of these creating it is to make data and of and a process using is a that can be used to and such as a of independent For example, a can be that the data, and it, analyze it, and with results, and update a or manuscript with those and any in the research projects to in this some time, but it can and reduce errors in code and data that can be used to research are often in the supplementary of Some journals (e.g., are even with data and code in However, this is not a of data and analyses. materials can be if a or when a In addition, research is only reproducible if it can be and many papers are published in journals that are that make them to many researchers et al. 2013, McKiernan et al. increase access to authors can of versions of on a or of on There are several used for and at many research data and code shared on are only available as as are and can be difficult to when researchers to domain or on are also often difficult for other scientists to as they are not to the published research and lack a make research accessible to it is therefore better to use tools like data and code than in has become more in a from a of in for sharing data, an increase in data and an increasing number of and funding who or data (Whitlock et al. 2010, 2011, et al. are large that and data sets for and may be or or that multiple data Some are and others require a for often on their and these should be when a manuscript. used are and each of these a that allows data and code to be by a researchers should used in their specific fields of research. When data, code, and of a research project are these are termed a and Research are increasingly for is in research between scientific They provide a and easily to organize the materials of a research project, which enables other researchers to and research et al. 2018). In the Open Science is a project that the of and to and share of a research project using of the is to enable research to be shared at step of the scientific developing a research and a to and data and and or papers et al. Open Science is with many other reproducible research tools, including used and many researchers reproducible research with a set of advanced tools for sharing research, reproducibility is just as much about simple work as the tools used to share data and are not reproducible not use all the tools in this commentary all the time and often fail to our to our we that reproducible research is a process rather than a and work to increase the reproducibility of our work. others to the Researchers can make toward a more reproducible research process by simply about data and and for making and and be in learning and these tools and and this can be we our fellow researchers to work toward more open and reproducible research so we can all the in work scientific rigor, and in science. to and for comments on versions of this manuscript and to for this project during course at the of

References

Page 1

	Year	Citations

Page 1