Research data management

Good research data management is key to making research reproducible and accessible to others. But where to start? On this page we give you an overview of the various tasks typically involved with research data management, and have gathered resources where you can get further help (Research data management resources) and training (Research data management training). 

Image
rdm_lifecycle_wide.png

Icons from Streamlinehq, CC BY 4.0 Licence.

The research data lifecycle

The research data lifecycle gives an overview of the different stages you go through with your data during a research project: plan, collect, process & analyse, preserve & share, and find & reuse. Each stage is associated with different data management tasks. Below we describe some typical data management tasks you would be involved in for each phase:

Phases of the research data lifecycle

    • Writing a data management plan: The data management plan is your roadmap to ensuring good data management during each phase of the data lifecycle. Data management plans can come in several formats, but should explain how your data meet the FAIR standard - data should be Findable, Accessible, Interoperable and Reusable.
    • Reviewing data management plan: Just like writing a project proposal, it is wise to have someone else review your data management plan. Several institutions offer data management reviews. Learn if this is available at your institution in the resources table.
    • Estimating costs and resources needed: The data management plan will help you estimate costs and resources needed for your data management, such as costs for storing data in a repository. Some organisations offer funds for certain research data management costs, for example the Swiss National Science Foundation and the European Research Council.
    • Organising files and folders: Creating a good organisational structure for your project data and output is important. This includes creating a naming convention for your files. If you do a lot of lab work, you should consider using an Electronic Lab Notebook, which simplify a lot of organisation and storage task. Check if your institution offer support for an Electronic Lab Notebook in the resources section.
    • Creating metadata: Metadata are data that describe other data. They help other researchers understand your data. Metadata can take many forms depending on your project - ranging from a README file associated with your data to automatic metadata being generated for each data file. If possible, it is good to adhere to a metadata standard for your field.
    • Storing and backup: Your data should be safely stored and backed-up (ideally following the 3-2-1 rule: 3 copies of data on 2 different media and 1 copy offsite). If you are working with sensitive data (such as health data), it is particularly important to consider whether your storage solution offer the right data protection - this may not be the case for commercial offerings such as Google Drive.
    • Documenting your workflow: All processing and analyses of data should be clearly explained (typically in a README file). It is best to avoid manual processing of data, and instead do all processing using automated processes running from a script (e.g. using programs such as R or python). This allows others (and yourself!) to reproduce your workflow.
    • Versioning: Versioning allows you to track the history of your files in detail and to go back to an earlier version of a file if required. Versioning can be done manually by saving files with a unique timestamp or version number (e.g. "v.1", "v.2"). There are also software that can do this automatically for you, such as Git, implemented in for example GitLab and GitHub. You can check if your institution runs a GitLab instance in the resources table.
    • Review: Regularly review your data, code and documentation to ensure it is up to date. You can do this yourself by attempting to reproduce your own findings starting with the raw data and following through each step of your analysis. This is an important sanity check to make sure that your results are not arising from peculiarities in your historical workflow (e.g. the particular order code segments were run during exploration of the data). If you can, have someone else review your documentation and code.
    • Select data to preserve: Not all data generated during a project may be important to preserve for the long-term. Furthermore, you must ensure that you are the owner of the data or have obtained the necessary rights to share the data. Sensitive data (e.g. containing personal information) usually has to be anonymised before it is shared. If you are unsure if your data needs to be anonymised, contact the Data Protection Officer or legal representative at your institute.
    • Ensure sustainable file formats: If necessary, convert your files to open formats that are recommended for long-term storage, such as csv, txt and tif instead of xslx, doc and jpg.
    • Identify a suitable repository: There are many available data repositories. When considering which one to choose, you might want to consider: funding requirements, file size limits, access restrictions, costs, and whether it is widely used in your discipline and/or at your institution. Institution-run repositories are available for Eawag, WSL, and  PSI. Storing your data in a repository instead of in supplementary files hosted with a journal ensures long-term preservation and usually more flexibility in allowed file types and sizes.
    • Find datasets: Search for datasets in the subject-appropriate repository (e.g. the Protein Data Bank or Materials Cloud), or by using a general dataset search tool such as DataCite Commons.
    • Understand the usage rights of data: Make sure that the data is shared under conditions that allows you to use it in the way you intended. This information should be available with the dataset license. A great resource to understand what rights different data licenses give you is provided by the DMLawTool. Any datasets you reuse should be cited with reference to the dataset itself (as opposed to the accompanying article).
    • Validate and clean data: Data may be of different quality. It is important to validate the data to ensure that it holds the quality your require for your research project, and if necessary, clean the data. Of course, any modifying of the data should be documented, as if you were using your own data.