diff --git a/manuscript/manuscript.md b/manuscript/manuscript.md index d1a135b..a0cf80d 100644 --- a/manuscript/manuscript.md +++ b/manuscript/manuscript.md @@ -76,12 +76,6 @@ organization: # Introduction {-} - - Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data @@ -110,47 +104,22 @@ storage often begin and end with, "use a community standard repository." This is a good advice; however, data storage policies are highly variable between repositories [@Marcial2010]. A data management plan utilizing best practices across all stages of the data life cycle will facilitate transition from local -storage to repository[@Michener2015]. Similarly it can facilitate transition +storage to repository [@Michener2015]. Similarly it can facilitate transition from repository to repository if funding runs out or needs change. Good storage practices are important even (or especially) in cases where data may not fit with an existing repository, where only derived data products (versus raw data) are suitable for archiving, or in the case where an existing repository may have lax standards. - - Therefore, this manuscript describes 10 simple rules for digital data storage that grew out of a long discussion among instructors for the Software and Data Carpentry -initiative [@Wilson2014; @Teal2015]. Software and Data Carpentry instructors are scientists from +initiatives [@Wilson2014; @Teal2015]. Software and Data Carpentry instructors are scientists from diverse backgrounds who have encountered a variety of data storage challenges and are active in teaching other scientists best practices for scientific computing and data management. Thus, this paper represents a distillation of collective experience, and hopefully will be useful to scientists facing a variety of data storage challenges. - - - # Rule 1: Anticipate how your data will be used {-} One can avoid most of the troubles encountered during the analysis, management, @@ -178,22 +147,21 @@ managed locally with a simple data management plan, whereas larger datasets (e.g. gigabytes to petabytes) will in almost all cases require careful planning and preparation (Rule 10). -Early consideration and planning should be given to the metadata of the project. -A plan should be developed early as to what metadata will be collected, and how it will be maintained and stored (Rule 7). - - +Early consideration and planning should be given to the metadata of +the project. A plan should be developed early as to what metadata will +be collected, and how it will be maintained and stored (Rule 7). # Rule 2: Know your use case {-} -Well-identified use cases make data storage easier. Ideally prior to beginning -data collection, one can answer the following questions: +Well-identified use cases make data storage easier. Ideally, prior to beginning +data collection, researchers should be able to answer the following questions: - Should the raw data be archived (Rule 3)? - Should the data used for analysis be prepared once, or re-generated from the raw data each time (and what difference would this choice make for storage, computing requirements, and reproducibility)? - - Can manual corrections be avoided in favor of programmatic or self-documenting - (e.g., ipython notebooks) approaches? + - Can manual corrections be avoided in favor of programmatic or + self-documenting (e.g., Jupyter notebook) approaches? - How will changes to the data be tracked, and where will these tracked changes be logged? - Will the final data be released, and if so, in what format? @@ -202,14 +170,15 @@ data collection, one can answer the following questions: threatened species, or confidential business information)? - Will institutional validation be required prior to releasing the data? - - Does the funding agency mandate data deposition in a publicly available archive, and - if so, where and under what license? + - Does the funding agency mandate data deposition in a publicly + available archive, and if so, when, where, and under what license? - Does the target journal mandate data deposition? -None of these questions have universal answers, nor are they the only questions -one should ask before starting data acquisition. But knowing the -what, when, and how of *your* use of the data will bring you close to a reliable -roadmap on how to handle data from acquisition through publication to archive. +None of these questions have universal answers, nor are they the only +questions to ask before starting data acquisition. But knowing the +what, when, and how of *your* use of the data will bring you close to +a reliable roadmap on how to handle data from acquisition through +publication to archive. # Rule 3: Keep raw data raw {-} @@ -241,15 +210,13 @@ For large enough datasets the likelihood of silent data corruption is high. This technique has been widely used by many Linux distributions to distribute images and has been very effective with minimal effort. - - # Rule 4: Store data in open formats {-} To maximize accessibility and long-term value, it is preferable to store data in formats whose specifications are freely available. The appropriate file type will depend on the data being stored (e.g. numeric measurements, text, images, video), but the key idea is that accessing data should not require proprietary -software, hardware, or purchasing a commercial license. Proprietary formats +software, hardware, or purchase of a commercial license. Proprietary formats change, maintaining organizations go out of business, and changes in license fees make access to data in proprietary formats unaffordable to end-users. Examples of open data formats include comma-separated values (CSV) @@ -259,7 +226,7 @@ graphics (PNG) for images, KML (or other Open Geospatial Consortium (OGC) format) for spatial data, and extensible markup language (XML) for documents. Examples of closed formats include DWG for AutoCAD drawings, Photoshop document (PSD) for bitmap images, Windows Media Audio (WMA) for audio recording files, -and Microsoft Excel (XLSX) for tabular data. Even if day-to-day processing uses +and Microsoft Excel (XLS) for tabular data. Even if day-to-day processing uses closed formats (e.g., due to software requirements), data being stored for archival purposes should be stored in open formats. This is generally not prohibitive; most closed-source software enables users to export data to an open @@ -267,19 +234,19 @@ format. # Rule 5: Data should be stored in an easily-usable format {-} -Not only data should be stored in an open format (Rule 4), but it should also be -stored in a format that computers can easily use for processing. This is -especially crucial as datasets become larger. Easily-usable data is best -achieved by using standard data formats that have open specifications (e.g., -CSV, XML, JSON, HDF5), or by using databases. Such data formats can be handled -by a variety of programming languages, as efficient and well-tested libraries -for parsing them are typically available. These standard data formats also -ensure interoperability, facilitate re-use, and reduce the chances of data loss -or mistakes being introduced during conversion between formats. Examples of -machine-readable open formats that would *not* be easy to process include data -included in the text of a Word or PDF file, or scanned images of tabular data -from a paper source. - +Not only should data be stored in an open format (Rule 4), but it +should also be stored in a format that computers can easily use for +processing. This is especially crucial as datasets become larger. +Easily-usable data is best achieved by using standard data formats +that have open specifications (e.g., CSV, XML, JSON, HDF5), or by +using databases. Such data formats can be handled by a variety of +programming languages, as efficient and well-tested libraries for +parsing them are typically available. These standard data formats also +ensure interoperability, facilitate re-use, and reduce the chances of +data loss or mistakes being introduced during conversion between +formats. Examples of machine-readable open formats that would *not* be +easy to process include data included in the text of a Microsoft Word +or PDF file, or scanned images of tabular data from a paper source. When data can be easily imported into familiar software, whether it be a scripting language, a spreadsheet, or any other computer program that can import @@ -313,7 +280,7 @@ applications, and disciplines. With machine-readable, standards-compliant data, it easier to build an Application Programming Interface (API) to query the dataset and retrieve a -subset of interest as outlined in Rule 10 +subset of interest as outlined in Rule 10. # Rule 6: Data should be uniquely identifiable {-} @@ -341,7 +308,7 @@ Semantic versioning is a richer approach to solving the same problem incremented (or bumped) when a dataset scheme has been updated, or some other change is made that is not compatible with previous versions of the data with the same major version number. This means that an experiment using version -`1.0.0` of the dataset may not run on version `2.0.0` without changes to the +`1.0.0` of the dataset may not run on version `2.0.0` without changes to the data analysis. The *minor version* should be bumped when a change has been made which is compatible with older versions of the data with the same major version. This means that any analysis that can be performed on version `1.0.0` of the @@ -359,12 +326,12 @@ written about at length in guides for data management best practices [@Michener2 @Strasser2012; @White2013]. Metadata should be as comprehensive as possible, using standards and conventions -of a discipline, and should be machine-readable. Metadata should always +of a discipline, and should be machine-readable. Metadata should always accompany a dataset, wherever it is stored, but the best way to do this depends -on the format of the data. Text files can contain meta-data in in well defined -text files such as XML or JSON). Some file formats are self-documenting, for +on the format of the data. Text files can contain meta-data in in well defined +text files such as XML or JSON). Some file formats are self-documenting, for example NetCDF, HDF5, and many image files allow for embedded metadata -[@rew1990netcdf; @koziol1998hdf5]. In a relational database, metadata tables +[@rew1990netcdf; @koziol1998hdf5]. In a relational database, metadata tables should be clearly labeled and linked to the data. Ideally a schema will be provided that also shows the linkages between data tables and metadata tables. Another scenario is a set of flat text files--in this case a @@ -409,8 +376,6 @@ bringing computation to data storage facilities instead of vice versa [@Gaye2014 Having a plan for privacy before data acquisition is important, because it can determine or limit how data will be stored. - - # Rule 9: Have a systematic backup scheme {-} Every storage medium can fail, and every failure can result in loss of data. @@ -446,8 +411,8 @@ repository dissolves? # Rule 10: The location and method of data storage depends on how much you have {-} The storage method you should choose depends on the size and nature of your -data, the cost of storage, the time it takes to transfer the data, how the data -will be used and any privacy concerns. Data is increasingly generated in the +data, the cost of storage and later access, the time it takes to transfer the data, how the data +will be used, and any privacy concerns. Data is increasingly generated in the range of many terabytes by environmental sensors, satellites, automated analytical tools, simulation models, and genomic sequencers. Even larger data generating machines like the Large Hadron Collider (LHC) and the Large Scale @@ -457,10 +422,8 @@ study. While the cost of storage continues to decrease, the volume of data to be stored impacts the choice of storage methods and locations: for large datasets it is necessary to balance the cost of storage with the time of access and costs of re-generating the data. With new commercial cloud offerings (e.g., Amazon S3) -the cost of retrieving the data might exceed the cost analysis or re-generating -the data. - - +the cost of retrieving the data might exceed the cost of analysis or re-generating +the data from scratch. When data takes too long to transfer or is costly to store, it can become more efficient to use a computer that can directly access and use the data in place. @@ -487,18 +450,15 @@ careless abuse of resources. The time required to re-download and recompute results can be reduced by 'caching'. Caching stores copies of downloads and generated files that are recognized when the same script is run multiple times. - # Further Reading and Resources {-} Digital data storage is a vast topic; the references given here and elsewhere in -this paper proivde some starting points for interested readers. For beginning +this paper provide some starting points for interested readers. For beginning users of scientific data, [Data Carpentry](http://datacarpentry.org) offers workshops and resources on data management and analysis, as do the DataONE education modules [@Dataone2012]. For librarians and others who are responsible for data archiving, Data Curation Profiles [@Witt2009] may be of interest. - - # Glossary and abbreviations used in the manuscript {-} ## Projects and Initiatives {-} @@ -573,8 +533,6 @@ for data archiving, Data Curation Profiles [@Witt2009] may be of interest. attacks. Key Derivation Function (KDF) implementations like BCrypt and PBKDF2 are considered significantly more secure, but by design more costly to compute. - - * Apache **Spark** is an open source computing platform for querying large data sets in memory, in contrast to on disk based methods like MapReduce. @@ -609,7 +567,6 @@ for data archiving, Data Curation Profiles [@Witt2009] may be of interest. * **URL** (Uniform Resource Locator) gives the location of an object on the World Wide Web; the most familiar type of URL is a website address. - # Acknowledgements {-} We would like to thank G. Wilson and the Software Carpentry instructor community @@ -637,7 +594,6 @@ Ontario. \newpage - # Figure Legends {-} \textbf{Figure 1}: Example of an untidy dataset (A) and its tidy equivalent @@ -661,11 +617,8 @@ and length), information about "where", "when", and "what" animals were measured can be considered meta-data. Using the tidy format makes this distinction clearer. - # Figures {-} -\textbf{Figure 1} - \begin{figure}[h!] \centering \includegraphics[width=\columnwidth]{resources/tidy_data.eps} @@ -673,11 +626,6 @@ distinction clearer. \label{fig:tidy-data} \end{figure} - - - \nolinenumbers \newpage diff --git a/manuscript/manuscript.pdf b/manuscript/manuscript.pdf index fcda5b7..ef52974 100644 Binary files a/manuscript/manuscript.pdf and b/manuscript/manuscript.pdf differ