-
Notifications
You must be signed in to change notification settings - Fork 11
Data Management
This page is intended to document all aspects of data management, from the day-to-day, formal NGS and proteomics plans, and general archiving options. Inspiration for this has been provided by Tim Essington and Gordon Holtgrieve who have developed similar documentation.
Data must be 1) adequately described via metadata, 2) managed for data quality, 3) backed up in a secure manner, and 4) archived in an easily reproducible format.
All research data must be accompanied with a thorough description of that data from the beginning of the work. Metadata describes information about a dataset, such that a dataset can be understood, reused, and integrated with other datasets. Information described in a metadata record includes where the data were collected, who is responsible for the dataset, why the dataset was created, and how the data are organized.
Students must take care that protocols and methods are employed to ensure that data are properly collected, handled, processed, used, and maintained, and that this process is documented in the metadata.
Primary should be stored in several locations with canonical versions on Owl (see below).
Data, including intermediate analysis, needs to have a url. This most often means it will live on a Network Attached Storage Device (NAS; aka a server).
Owl is a Synology DS1812+ NAS with a Synology DX513-2 extension unit:
- DS1812+ uses 4TB HDDs (n = 8) & DX513-2 uses 8TB HDDs (n =5)
-
- Mirrors data across HDDs, which reduces total storage capacity by 50%
- Allows for up to two concurrent HDD failures before data loss occurs
-
- One-way sync from Owl to UW Google Drive via the Synology Cloud Sync app.
- Backup frequency: Daily
- Access: Public (read-only)
-
- One-way sync from Owl to Amazon Glacier via the Synology Glacier Backup app.
- Backup frequency: Weekly
- Access: Private (requires Amazon AWS credentials)
Using the Owl NAS to store your data:
- Ask Steven or Sam to generate a user account for you. A folder will be created for you in:
owl/web/
Ask Steven/Sam for the name of the folder, as well as your username and password. - Upload data to your Owl web folder:
- Navigate to http://owl.fish.washington.edu/
- Click on
Web Browser login
.- If it's your first time visiting this page, your browser will present you with a warning about an insecure site or bad certificate. That's OK. Click on the option to add an exception for this site.
- Enter username and password. (NOTE: If it's your first time accessing your account, please change your password by clicking on the silhouette in the upper right corner, then "Personal" in the dropdown menu).
- Navigate to File Station > web > your_folder (If you don't see the File Station icon, click on the icon of four squares in the upper left corner and select File Station from the subsequent menu).
- Click-and-drag files from your computer to your
owl/web
folder.
Files that you have uploaded to your_folder are publicly viewable: http://owl.fish.washington.edu/your_folder
You can use the URLs for your files for linking in your notebook.
All folders need to contain a readme file.
The readme files should be plain text (i.e. do not create/edit the file with a word processor like Microsoft Word or LibreOffice Writer) and should describe the contents of the folder. If there are directories in the same folder as your readme file, the directory names should be listed and a brief description of their contents should be provided.
Please refrain from using any non alpha-numeric (including spaces) in file and folder names.
Raw Data
-
As sequencing facility provdes data, files are downloaded to our local NAS (owl), in the correct species subdirectory within
nightingales
. http://owl.fish.washington.edu/nightingales/ -
MD5 checksums are generated and compared to those supplied by the sequencing facility.
-
Append the generated MD5 checksums to the
checksums.md5
file. If that file does not yet exist, create it, and add the generated checksums to the newchecksums.md5
file. -
The Nightingales Google Spreadsheet is updated.
-
Each library (i.e. each sample with a unique sequencing barcode) is entered in its own row.
-
SeqID
is the base name of the sequencing file (i.e. no file extensions like ".fq.gz" or ".gz") -
Each library receives a unique, incremented
Library_ID
number. -
Each library receives a
Library_name
; this may or may not be unique. -
Update the Nightingales Google Fusion Table with new information from the Nightingales Google Spreadsheet. This is accomplished by:
- Deleting all rows in the Nightingales Google Fusion Table (Edit > Delete all rows)
- Importing data from the Nightingales Google Spreadsheet (File > Import more rows...)
Backup
- The Google Docs spreadsheet Nightingales Google Spreadsheet is backed up on a regular basis? by downloading tab-delimited file and pushing to LabDocs Repository, with the file name
Nightingales.tsv
SRA Upload ...
Raw Data
-
As sequencing facility provides data, files are downloaded to our local NAS (owl), in the root
phainopepla
directory. http://owl.fish.washington.edu/phainopepla/ These data are organized by species, then by mass spectrometer run date (e.g. YYYY-MM-DD). For each run date, allRAW
files (including blanks, sample, and QC files) should be included in the directory with their original names. Inside of the YYYY-MM-DD directory there should be a Readme file with the following information: Description of each file (eg. treatment, blank, etc), experimental design, link to more information. -
The Spreadsheet is then updated. Each "mass spectrometer run date" will be a new row in the sheet.
- Before histology cassettes are sent off for processing, fill out the Histology-databank with all relevant information at the organism level except for image-location. Reserve space for blocks and slides by adding block-location and slide-location information.
- After histology blocks are returned, image each organism's tissue. Use the following convention for saving images: FULLTIMESTAMP-[unique-organism-ID]-[magnification].jpeg
- Add image-location information to the spreadsheet
The goal for data archiving is to make your research easily understandable and reproducible in the future. It is therefore incumbent upon the researcher that, by the end of a project, care and effort is given to providing a highly organized and traceable accounting of the research that is archived in perpetuity. At a minimum, this archive should include: raw and full processed data, complete metadata, all computer code, and any research products (manuscripts, published articles, figures, etc.). You will find that creating a usable data archive is much easier to do as you go, rather than waiting until the end of your project!
- GitHub -> Zenodo.
- Figshare
- UW ResearchWorks
- Open Science Framework
Finally, data will be most usable if it is as flexible as possible. So an excel spreadsheet with different information on different tabs is not very flexible. Much better to have a text file, with the data in “long form”. This means rather than have a ton of columns, have a ton of rows.
see
Broman KW, Woo KH. (2017) Data organization in spreadsheets. PeerJ Preprints 5:e3183v1 https://doi.org/10.7287/peerj.preprints.3183v