FAIR principles

Data arising from research projects should adhere to FAIR principles. It should be:

  • Findable
  • Accessible
  • Interoperable
  • Reusable/Reproducible

Many funding bodies, including RCUK funders, the Wellcome Trust and a number of other charities now require the inclusion of a Data Sharing or Data Management Plan as part of grant proposals. Even if it is not mandatory for your funder, a data management plan can form the starting point for formalising a checklist of what data you will generate (types, volumes), how and where you will store it (and for how long) and how you will share and disseminate it (who to, which parts, and when). The Data Sharing/ Management plan can help you to write an effective impact statement, as it should help you consider all the ways by which you will disseminate the results of your research, and can also help with effective grant costing, since it will help you to consider costs associated with storing your data for the time beyond the end of the grant expected by the funder, building infrastructure for managing your data and sharing it, and the time required to prepare and submit it to suitable repositories.

Different funders, different requirements, same principles

Although funders may have different practical requirements for Data sharing/management plans (mandatory, optional, free-text pages or form), they all ask for answers to the same basic questions, in order to ensure that the data arising from your research project are stored appropriately and securely, and are disseminated effectively to the public domain in standard formats, together with sufficient supporting information (metadata) to ensure that they are re-useable.

A check-list table showing the basic requirements for different common College funders is available from The Digital Curation Centre. All RCUK funders, the Wellcome Trust and a number of others all offer specific guidance for their proposals, and some provide a template to complete (e.g. MRC). The JISC-funded Digital Curation Centre also offers an online tool that has standardised templates for all major funders, pre-loaded with the questions to address for those funders, and now also customised versions for Imperial College that offer some pre-completed general policy fields.. We are very familiar with and can offer help with completion of DMP-Online forms for projects generation bio- or bio-medical data and software.

The College has recently produced a great deal of information on research data management and data sharing plans on the College web-site and we refer you there for general information.

data standards accordian

Services we offer

We can help you to prepare individually tailored data management plans for your grant proposals, however to do this effectively, we will need to see your case for support. When requesting help with data plans, don’t forget to send the following information:

  • Funder
  • Project duration and initial start date
  • Submission deadline
  • Your internal submission deadlines
  • A draft case for support so we have an idea of the project as a whole and can glean the types of data that you will generate, and projected volumes
  • A draft data management plan if you have started to complete one

Once we have heard from you, we will be able to help with:

  • data volume calculations based on your experimental design and types
  • suggestions of suitable standard data formats for sharing/ storage/ publication depending on your experimental types
  • suggestions of suitable metadata standards (and later on, how to use/adhere to them) for your experimental types
  • suggestions of suitable public repositories for submission of your data (and associated costs for preparing and submitting your data if required)
  • costs associated with storage of your active data during project lifetime (based on volume) and later costs for longer term storage and/or archiving if required
  • general information on hardware security for our systems (physical security, redundancy, back-up policies etc.)
  • suggestions (and associated costs if required) for additional mechanisms for data dissemination e.g. hosting project-based web-site with dynamic data searching and/or visualisation
  • suggestions and costings for project-based methods for organising, searching and sharing complex project-based data internally, or later externally – e.g. a project database (if required). This will also include specific security information for that system’s design – e.g. different authority levels, passwords, encryption etc.
  • We can also advise on other College resources that may be useful for your project.

 

FAIR principles

Data arising from research projects should adhere to FAIR principles. It should be:

  • Findable
  • Accessible
  • Interoperable
  • Reusable/Reproducible

Your data sharing plan should help you to achieve this, by ensuring that your data are appropriately annotated with the necessary metadata, in standard formats and submitted (where appropriate) to a public repository in a timely fashion. You can find more information about the principles of The FAIR data Guiding Principles from the FAIR Data Publishing Group.

There are a number of technical areas in a data management plan that require specific input about the types and volumes of data that you intend to generate, the formats you will store the data in, the associated standards that will be adhered to within your metadata to ensure that your data are understandable and fully re-useable, and the appropriate public repositories that you may use to disseminate your data in the public domain. These areas tend to be highly discipline-specific. The Life Sciences and Biomedical data areas are particularly rich in the numbers of available public repositories, data standards and recognised data formats. If you are not sure which ones are appropriate for your datasets, we can help.

Officially, information on standards, databases and ontologies for Life Sciences is collated together and regularly updated under the Biosharing banner which takes information from several sources including the Nucleic Acids Research Database issue and the MIBBI set of common data standards. You may find this a useful (if somewhat dense) resource, but it can appear complex at first sight as it also contains information on deprecated standards and descriptions and appears more geared towards developers than users at the moment.

Some of the most commonly used databases, formats and standards for particular experimental types are outlined below. This is just an example selection so if you need help with your specific projects, please get in touch. 

 

Top tips

Write early - When writing a grant, it is usually faster to start filling in the data management plan as you write the case-for-support, as you will need much of the same information about the data you expect to generate for both documents. Leaving it to the last minute tends to lead to missed opportunities both in terms of remembering to include costs for data management/storage/archiving/data curation and submission, and maximising impact statements with respect to disseminating your data/results.

Software are ‘data’ too - There will be some cases where your project is not generating new data as such - for instance a Wellcome Trust Bio-resources or a BBSRC Tools and Resources Development Fund project that are producing new infrastructures and/or software. In these cases you will still need to explain your plans for storing and keeping any third party data secure - for instance input files submitted to a web server by users. New software and databases produced as output from a project should be referenced in the data management plan and information on their storage and distribution should be included (‘Software as Data’). More information on ways of sharing and publishing software and models are included in a separate section.

What are ‘Metadata anyway’? In this context, metadata is the additional structured information about your dataset that explains what the dataset ‘is’, and allows it to be understood and re-used by others. Metadata are context-specific and can be minimal or very rich, depending on what is required for that dataset. For instance it may contain information about:

  • the bio-specimen from which a sample is generated (e.g. species, taxonomy, gender, age, tissue, cell-type and the growth conditions
  • the experimental protocol used to extract a sample on which to work (e.g. standard operating procedure used, chemical vendor/batch, conditions)
  • the experimental design of an assay
  • auto-generated information from instrumentation used to make a measurement/generate a dataset (e.g. vendor, model, version, software version) also perhaps manually added experimental parameters/conditions about the assay
  • analysis methods - quality assurance methods used, normalisation methods, software used (versions, parameters)

Metadata standards

Commonly used minumum metadata standards
Type of experiment/datasetStandard nameAcronymMore information...
 microarray Minimum information (MI) about a microarray experiment MIAME MIAME
 proteomics MI about a proteomics experiment MIAPE -* e.g. MIAPE-MS, MIAPE-Quant Different MIAPE extensions for different proteomics methods
See full listing
 metabolomics Core Information for Metabolomics Reporting CIMR CIMR
RNAi MI about an RNAi experiment MIARE MIARE

Genome/metagenome
MI about a Genome (or metagenome) experiment MIGS/MIMS MIGS/MIMS
Generic NGS including RNA-Seq, ChIP-Seq MI about a high throughput SEQuencing Experiment MINSEQE MINSEQE
Glycomics array MI required for a glycomics experiment MIRAGE MIRAGE
 Simulation MI about a simulation project MIASE MIASE
 Bio-Model MI Required In the Annotation of Models MIRIAM MIRIAM
 
Summary of the table's contents

There are so-called Minimum Metadata reporting standards – which aim to list the most important metadata fields needed to accompany a dataset of a certain experimental type in order to make it understandable and re-useable. In many cases, your dataset will need to be complaint with an appropriate data standard before it can be submitted to a public repository (=’public database’). One of the oldest established standards is the MIAMI standard for microarray data (Minimum Information About a Microarray Experiment).

There are currently over 80 minimum metadata standards for different types of biological data, and some are more mature than others (and stable) but the good news is that a relatively few common types serve for most of the more common biological experiment types. 

File formats

There are a large number of different file formats in common use for biological data, and some are more stable than others. Generally you should stick to open standard file formats rather than proprietary (i.e. those from commercial vendors) for storing data, since commercial formats may require access to specific versions of commercial software in order to be readable – and this may not be possible in the longer term. Generally, public biological data repositories will only accept data submitted in a specific data format. A few common open file formats are shown below, together with the type of experiment they originate from. Example file formats are also covered in more detail in our help pages.

Selected primary data formats for data sharing/submission
Experiment typeDescriptionFilenameType/use
Microarray Affymetrix  cel  Tab-delimited text 
  Other microarray data formats mev, Stanford  Can contain data from single or many chips. tab-delimited text, but different column orders, degree of commenting
  Simple Omnibus Format in Text SOFT GEO microarray data exchange format – line based plain text
 Next generation sequencing  Binary alignment  BAM Compressed (binary) version of SAM
  Sequence alignment/map SAM Created by alignment programs
  Defining annotation lines on a reference sequence BED For visualising annotations in genome browser
  ‘wiggle’ format for continuous-valued data in a track format, also binary compressed version (BigWIG) WIG, BIGWIG  e.g. visualisation of GC percent, probability scores, and transcriptome data on genome sequence
  Contains sequence and quality scores FASTQ Fasta format sequence and quality data
  Variant calling format
(variant positions in genome)
VCF/BCF Text - Often binary format
  Reference-based compression CRAM Tuneable binary format for multiple sequences
  General feature format   GFF Placing features on a genome (reference) sequence
Medical imaging Open file format for medical imaging  DICOM  
Confocal microscopy Tagged image file format (Generic) TIFF

Information not changed when format created

  Joint Photographers Experts Group image format  JPEG Uses lossy image compression – different compression ratios available
  Multipage TIFF with OME XML data block OME-TIFF Encodes additional metadata
  Proprietary image formats containing microscope-specific metadata Zeiss LSM, Leica LEI  Instrument or software-specific
Super-resolution microscopy Tagged spot file format tsf Binary format for that methods that generate images by locating the position of single fluorescent emitters
Metabolomics - Mass Spectroscopy Network Common Data Format netCDF Machine independent array-oriented binary data format
  MS and MS/MS proteomics data mznld open data format for storage and exchange of mass spectroscopy data
  Proprietary examples – Thermo,
Bruker, ABI/Agilent
RAW, Baf, wiff  
Metabolomics - NMR Self-defining Text Archival and Retrieval format  NMR-STAR Chemical shift file
 
Summary of the table's contents

 

Public bio-data repositories and databases

The NAR online Molecular Biology Database Collection currently lists more than 1550 different databases 

Some are organism (or even gene-) specific, some contain secondary data – mined from literature and curated, and there are repositories/databases available for data arising from genomics (and metagenomics), transcriptomics, proteomics, metabolomics, protein structure and imaging studies. There are also public repositories for some types of bio-models. If you are not sure which repository/database is right for your data, or would like help in preparing your datasets for submission, please get in touch.

Commonly used public bio-data repositories
RepositoryPrimary useHome
European Nucleotide Archive (ENA) DNA sequence with/without annotation https://www.ebi.ac.uk/ena
http://www.ebi.ac.uk/ena/submit
Short Read Archive (SRA, part of ENA) NGS raw data (reads) http://www.ebi.ac.uk/ena/submit
ArrayExpress Transcriptomics – array based and RNA-Seq https://www.ebi.ac.uk/arrayexpress/
GEO Transcriptomics – array based and RNA-Seq http://www.ncbi.nlm.nih.gov/geo/
UNIPROT Protein sequence with annotation http://www.uniprot.org/
European Genome Phenome Archive (EGA) Genomic studies where access to datasets is controlled by an ACCESS COMMITTEE https://www.ebi.ac.uk/ega/home
dbSNP Small genetic variations (SNP) http://www.ncbi.nlm.nih.gov/snp
PRIDE Proteomics data http://www.ebi.ac.uk/pride/
MetaboLights Metabolomics and related data http://www.ebi.ac.uk/metabolights/
BioModels Computational models of biological processes https://www.ebi.ac.uk/biomodels-main/
 
Summary of the table's contents

A word on DOIs - Some types of bio-data still have no established public repositories (e.g. kinetic data). These datasets can still be published with a persistent data identifier (DOI) on a more generalised site, as well as embedding them within supplementary materials for a publication – which may be persistent, but are often difficult to search. A DOI can be used to give a stable identification for datasets as well as its more common use for publications. The College has recently arranged the central ability to mint DOI’s for datasets, as part of its Open Access Policy but this is still relatively new. If you would like to explore how to use this for your biological datasets, please get in touch as we may be able to help.

Other relevant Central College resources - The College version of Symplectic has recently been extended to also allow you to track your dataset publications, and can be linked to your ORCID identifier to assist with unambiguous searching and identification. We recommend that if you are not already familiar with ORCID and updates to Symplectic, that you check out the web pages on ORCID:

The College offers help pages on using a number of generalised repositories including Figshare, Zenodo for bio-datasets that aren’t suitable for an established biological data repository.