The wonderful world of OpenScience

Séminaire Scientifique et Technique de l’UR PROSE

Cédric Midoux

PROSE

June 30, 2023

The fourth paradigm of science

Schleder et al. (2019)

The data deluge

Innovation’s march

In the past ...

Science today

In the past Science today

The ravages of time

Reproducibility ?

Baker (2016)

Reproducibility ?

Legend

Classification
- BC Paper where the results are backed by code.
- NC Paper excluded due to results not being backed by code.
- HW Paper excluded due to replication requiring special hardware.
- EX Paper excluded due to overlapping author lists.
Code Location
- Article Code is found from link in the article itself.
- Web Code is found from a web search.
- EMyes Code is provided by author after email request.
- EMno Author responds that the code cannot be provided.
- EMØ Author does not respond to email request within 2 months.
Build Results
- OK≤30 We succeed in building the system in ≤30 minutes.
- OK>30 We succeed in building the system in >30 minutes.
- OK>Author We fail to build, but the author says the code builds with reasonable effort.
- Fails We fail to build, and the author doesn’t respond to survey or says code may have problems building.

Threats to reproducible science

A data management horror story

State of play …

Data deluge
Reproducibility crisis
Ethics crisis
- P-hacking
- Publish or Perish
Scientific-political crisis
- Research funding
- Private research
- Academic publishing company

“UNESCO Recommendation on Open Science” (2021)

According to the UNESCO Recommendation, open science is a set of principles and practices that aim to make scientific research from all fields accessible to everyone for the benefit of scientists and society as a whole. The Recommendation aims to ensure not only that scientific knowledge is accessible but also that the production of that knowledge itself is inclusive, equitable and sustainable.

By promoting science that is more accessible, inclusive and transparent, open science furthers the right of everyone to share in scientific advancement and its benefits, as stated in Article 27.1 of the Universal Declaration of Human Rights.

FAIR Guiding Principles

Cost of not having FAIR research data

Following this approach, we found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year

European Commission and Directorate-General for Research and Innovation (2019)

Research data

Factual records

Primary sources for scientific research

Necessary to validate research findings

This Recommendation principally concerns research data in a digital, computer-readable format.

Open Data 5★

OL★ : Open License
RE★ : machine REadable
OF★ : Open Format
URI★ : Uniform Resource Identifier
LD★ : Linked Data

Metadata

Data on data.

Date experiment was done
Time a measurement was made
Number of repeated measurements
Who conducted the experiment
Dosage of treatment
How many subjects were in study
How many subjects dropped out of study
Experimental design
…

WHO? WHAT? WHEN? WHERE? HOW? WHY?

Electronic Lab Notebook

Keep track of your experiments and collaborate with your team easily!

Lab notebook for experiments
Use templates for your experiments
Add steps to your protocols
Draw doodles & attach documents
Management of schedules and reservations
Database for lab equipment, storage, …
Todolist
Timestamp legally your experiments

Live Demo

Controlled vocabulary

Community need standards

In essence, a standard is an agreed way of doing something. A standard provides the requirements, specifications, guidelines or characteristics that can be used for the description, interoperability, citation, sharing, publication, or preservation of all kinds of digital objects such as data, code, algorithms, workflows, software, or papers.

Create your own metadata standards

Metadata standards - FAIRsharing

Metadata standards - MIxS

Open file format

Open File Formats are file formats that are published and freely available for anyone to use. A file format is a standard way of encoding storage of computer information. Open file formats can be contrasted with proprietary, protected file formats. Open file formats are often recommended for preservation purposes because they typically do not require special software to open.

	Open	Closed
Textes	`txt`, `odf`, `rtf`	`doc`, `pages`
Images	`png`, `jpg`, `gif`, `svg`	`tiff`
Spreadsheets	`csv`, `ods`	`xls`
Archives	`tar`, `zip`	`rar`

Not all proprietary formats are closed. For example, Adobe’s .pdf format has become an ISO standard. Anyone can open a PDF file.

Open file format - The game

Personal data

Personal data is “any information relating to an identified or identifiable person”.

directly / indirectly
from data alone / from metadata / from cross-referencing data
patients / agents / customers / …

Take particular care with sensitive data!

Compliance procedures

Tidydata

Tidydata

Tidydata - Example

Excel error plague

The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

Storing data - Challenges - the 6 V

Large and growing Volumes of data
Wide Variety of information
Velocity in data acquisition frequency
Guarantee of the Value and Veracity of the information
Take advantage of the Valorization (intellectual, scientific, social, economic, …) of data

Storing data - Recommendations

Plan how the data will be described, structured and organised
Always store data with metadata
As soon as possible, use a persistent identifier
Include the costs of data storage in the funding plan
Identify the person(s) responsible for the data.

Storing data

Organised working environment

Structuring folders and files in a tree structure
Use naming conventions clear, coherent and shared
Explicitly define and track versions of tools and databases
Control file-system permissions
Ensure file integrity (md5sum)

Backup

Cybersecurity - Passwords

Newsletter “Sécurité informatique” by Cédric Goby

Cybersecurity - Examples

;document.getElementById("tweet-53190").innerHTML = tweet["html"];

Open Source Software

Access to code
Right to use, study, change and distribute it

Free ≠ Open source
Ensures the longevity of software
Not be captive to a development company

Beyond reproducibility: transparency in research

Explaining to justify and understand

Redo to check, correct and reuse

Obliges you to check your work (share data + code)
Your future yourself will thank you
And your colleagues too
By being reproducible, you strengthen your credibility and reputation
Reproducibility fosters confidence in the scientific process

You’re contributing to faster scientific progress

You don’t lose time …

For that you need to code a little …

What do we need to make research reproducible?

Data in some coherent format
Programming language (R, Python)
Text, figures and code in same environment (litterate programming)
Continuous and transparent editions and updates (version control)

Notebook

Unify in a single document :
- Context details
- Code
- Computations and results
- Interpretations

Ensures the consistency of analyses and improves traceability

Generates an exportable document (e.g. html) for improved portability and readability

Notebook - RMarkdown

Notebook - Jupyter Notebook

Tidyverse

Distributed version control (`git`)

Record changes made in set of files
Track history and review any changes
Back to earlier versions
Collaborative work on parallel features

It works with scripts & codes, protocols & documentation, reports, any documents !

What is a commit ?

With a visual interface

Git branching

Online remote repositories (eg: GitLab)

Sharing code with others
Contribute
Online backup server
Monitoring the project’s progress (Issues)
Run code (CI/CD)
Host website (Pages)
…

Why not keep it all?

Data storage has a human, financial and ecological cost.
What are the legal obligations for which data?
Technological obsolescence (supports, formats, doc)

What should we keep?

Why should I keep it?

For how long?

Where should I keep it?

And how?

The two major use cases and drivers for what to keep are Research Integrity and Reproducibility (availability of the data supporting the findings in research) ; and the Potential for Reuse (availability of data for sharing with other users)

Beagrie (2019)

Long-term archiving

Preservation of research unit archives

“Deuxième Plan national pour la science ouverte” (2021)

LPRN 2016 & Décret 2021-1572

[I.] Lorsqu’un écrit scientifique issu d’une activité de recherche financée au moins pour moitié par des dotations de l’État, (…) son auteur dispose, (…) du droit de mettre à disposition gratuitement dans un format ouvert, par voie numérique, sous réserve de l’accord des éventuels coauteurs, la version finale de son manuscrit acceptée pour publication, (…) à l’expiration d’un délai courant à compter de la date de la première publication. Ce délai est au maximum de six mois pour une publication dans le domaine des sciences, de la technique et de la médecine (…).

[II.] Dès lors que les données issues d’une activité de recherche financée au moins pour moitié par des dotations de l’Etat, (…) ne sont pas protégées par un droit spécifique ou une réglementation particulière et qu’elles ont été rendues publiques (…) leur réutilisation est libre.

[Art. 1] L’intégrité scientifique se définit comme l’ensemble des règles et valeurs qui doivent régir les activités de recherche pour en garantir le caractère honnête et scientifiquement rigoureux.

[Art. 2] Les établissements publics et fondations reconnues d’utilité publique promeuvent la diffusion des publications en accès ouvert et la mise à disposition des méthodes et protocoles, des données et des codes sources associés aux résultats de la recherche afin d’en garantir la traçabilité et la reproductibilité.

[Art. 6] Ils veillent à la mise en œuvre par leur personnel de plans de gestion de données et contribue aux infrastructures qui permettent la conservation, la communication et la réutilisation des données et des codes sources.

Olivier et al. (2022)

Data repositories

Why use a repository?

Submit, share, re-use and archive data with FAIR principles
Link metadata
Provides a PID
Increases the visibility of your research
Obligations of funders / publishers

Disciplinary repository

Institutional repository

Recherche Data Gouv - Organization

Recherche Data Gouv - Content

Persistent identifier

Permanent identification
Identification and referencing
Interoperability
Aggregating scientific production and improving visibility
Distributed by a trusted organisation

For Object

Data and papers
DOI, ISBN, SWHID, …

For Contributors

People and organisations
ORCiD, idHAL, PID, …
French MESR will include ORCiD in agents records.

Persistent identifier

License

Without a licence, data is not truly open.

Allows users to be granted specific rights of use in advance
May include restrictions on use
It is necessary to use one in all cases to clearly display the associated rights

LPRN Guidelines

DataPaper

HAL

HAL INRAE is the open access repository, visible by everyone, for depositing and consulting the scientific production.

Help promote open access to scientific and technical information
Make INRAE researchers’ results as accessible as possible
Increase the visibility of INRAE research

Reuse data

Find Datasets

By publications
By repositories
By DataPaper
By social networks
By visualization

Forging new collaborations

Citations

Essential for linking data to the scientific publications that use them
Always cite the datasets used and their version
DOI Citation Formatter
DataCite

Research Data Lifecycle & DMP

What is a DMP?

Un Data Management Plan (DMP) est un document formalisé explicitant la manière dont seront obtenues, documentées, analysées, disséminées et archivées les données produites au cours et à l’issue d’un processus ou d’un projet de recherche.

Il est un outil pour gérer les données tout au long du projet en intégrant la notion de cycle de vie.

La gestion des données n’est pas une fin en soi, mais le moyen de conduire à la découverte de connaissances et d’innovations par l’intégration et la réutilisation des connaissances produites.

Reymonet et al. (2018)

PGD

Plan de Gestion de Données

PGD

Pour Générer du Dialogue

DMP - Why, Who and When ?

Why ?

Plan the management of project data (obviously)
Describe how the data is obtained
Ensure that the data is understandable
Clarify the legal and ethical framework
Providing appropriate data storage
Define everyone’s responsibilities

Who ?

Project Coordination Team (and associated members)

When ?

Generally three releases (6 months, mid-project, end of project)

DMP - How?

Many templates
- ANR, INRAE, European Research Council, …
Many tools
- OPIDoR, DSW, ARGOS, Word/Nextcloud, ….
- Machine Actionnable DMP
- Comments & Guidance

DMP - Project / Structure

Project DMP

Defined in terms of the duration and scope of the project

Structure DMP

Defined for the scope of the structure to harmonise and document practices, in a more modular way

INRAE Project template

Information concerning the management plan
Information on the research project
Brief presentation of project data
Description and organisation of data
Intellectual property rights
Data Sensitivity
Data storage and backup during the project
Access and sharing of data at the end of the project
Data archiving and conservation after the end of the project

1. Information concerning the management plan

Author of the DMP
Affiliation of the author of the DMP
Date of creation of DMP
Current version: (n°, date)

2. Information on the research project

Identifier of the call for proposal
Project funder(s)
Name of research programme
Reference of funding agreement
Project acronym
Name of research project
Project leader institution, coordinator & beneficiary (name, country)
Other partners
Unit to which project leader belongs
Project dates and duration

3. Brief presentation of project data

Type, scope, scale
Origin
Associated publications

4. Description and organisation of data

What methods and tools are used to acquire and process data?
Documentation associated with the data
What types of metadata will be produced to accompany the data?
What standards or taxonomies will be used to describe the data?
How will the metadata be produced?
How will the data files be managed and organised during the project: control of versions, conventions for naming files, organisation of files, …
What is the quality control procedure of the data?
Enclose the quality insurance plan if possible

5. Intellectual property rights

Who owns the rights on data and other information created during the project?
Will material protected by specific rights be used during the project? In this case, who will deal with the formalities required, obtain the authorisations for use and possible dissemination?

6. Data Sensitivity

Identification of the data sensitivity Level
What are the measures taken and the norms that must be met to guarantee the security of sensitive data?
If there is personal data, what measures are envisaged to protect it during the project or in the context of re-use?

7. Data storage and backup during the project

Have the information systems used been subjected to a risk analysis or certification?
What types of physical media are used to store data during the project?
What security measures are in place during the data transfer stages of the project?
What is the estimated amount of data?
Where will the data be located geographically?
Does the entity physically hosting the data have a security policy for its information system and security assurance plan?
Security - Confidentiality: will the data de exchanged or shared with third parties?
How are rights of access to data determined during the research project?
Security – Integrity – Traceability: what measures of protection will be taken to monitor data production and analysis during the project?

9. Data archiving and conservation after the end of the project

What data will be conserved in the medium and long term and what data will be destroyed?
On what permanent archive platform will the data that are to be conserved long-term be archived?
What procedures will be set up for long-term conservation?
What is the duration of data conservation?
Who will be responsible for long-term conservation?
Name an individual contact
What will be the volume of these data?
What funding guarantees will cover the costs of long-term conservation?

Is it all good?

Ready to go ?

Check-list

To read

Arnould, Pierre-Yves, and Marie-Christine Jacquemot-Perbal. 2016. “Guide de bonnes pratiques. Gestion et valorisation des données de la recherche.” Research Report. OTELo ; INIST-CNRS. https://hal.science/hal-01275841.

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Beagrie, Neil. 2019. “What to Keep: A Jisc Research Data Study.” Jisc. https://repository.jisc.ac.uk/id/eprint/7262.

CIRAD-DGDRS-DIST-FRA, ed. 2017. “Le Cycle de Vie Des Données. Intégrer La Gestion de Données Scientifiques Aux Activités de Recherche.” CIRAD. https://agritrop.cirad.fr/594579/.

Collberg, Christian, and Todd A. Proebsting. 2016. “Repeatability in Computer Systems Research.” Communications of the ACM 59 (3): 62–69. https://doi.org/10.1145/2812803.

“Deuxième Plan national pour la science ouverte.” 2021. Ministère de l’Enseignement supérieur, de la Recherche et de l’Innovation. https://www.ouvrirlascience.fr/wp-content/uploads/2021/06/Deuxieme-Plan-National-Science-Ouverte_2021-2024.pdf.

European Commission and Directorate-General for Research and Innovation. 2019. Cost-Benefit Analysis for FAIR Research Data : Cost of Not Having FAIR Research Data. Publications Office. https://doi.org/10.2777/02999.

Gibney, Elizabeth, and Richard Van Noorden. 2013. “Scientists Losing Data at a Rapid Rate.” Nature, December. https://doi.org/10.1038/nature.2013.14416.

Hart, Edmund M., Pauline Barmby, David LeBauer, François Michonneau, Sarah Mount, Patrick Mulrooney, Timothée Poisot, Kara H. Woo, Naupaka B. Zimmerman, and Jeffrey W. Hollister. 2016. “Ten Simple Rules for Digital Data Storage.” Edited by Scott Markel. PLOS Computational Biology 12 (10): e1005097. https://doi.org/10.1371/journal.pcbi.1005097.

Hey, Tony, Stewart Tansley, Kristin Tolle, and Jim Gray. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research. https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/.

Lister, Allyson, and Susanna-Assunta Sansone. 2023. “FAIRsharing in a Nutshell.” Zenodo. https://doi.org/10.5281/zenodo.7737367.

Michener, William K., James W. Brunt, John J. Helly, Thomas B. Kirchner, and Susan G. Stafford. 1997. “Nongeospatial Metadata for the Ecological Sciences.” Ecological Applications 7 (1): 330–42. https://doi.org/10.1890/1051-0761(1997)007[0330:nmftes]2.0.co;2.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1). https://doi.org/10.1038/s41562-016-0021.

Murphy, Denis J. 2014. “Using Modern Plant Breeding to Improve the Nutritional and Technological Qualities of Oil Crops.” OCL 21 (6): D607. https://doi.org/10.1051/ocl/2014038.

OCDE. 2020. Enhanced Access to Publicly Funded Data for Science, Technology and Innovation. https://doi.org/https://doi.org/https://doi.org/10.1787/947717bc-en.

Olivier, Philippe, Stephanie Rennes, Dimitri Szabo, and Anne-Sophie Martel. 2022. “Ouverture des données : … aussi ouvert que possible ... aussi fermé que nécessaire.” https://doi.org/10.17180/991x-t610.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Quintana, Daniel. 2022. “Five Things about Open and Reproducible Science That Every Early Career Researcher Should Know.” Open Science Framework, August. https://doi.org/10.17605/OSF.IO/DZTVQ.

Réseau Qualinous, Véronique Batifol, Laurent Burnel, Aurélie Cardona, and François Johany. 2021. “Affiche "Cycle de vie des données : un outil pour améliorer la gestion, la mise en qualité et l’ouverture des données".” https://doi.org/10.15454/hsc3-b796.

Reymonet, Nathalie, Magalie Moysan, Aurore Cartier, and Renaud Délémontez. 2018. “Réaliser un plan de gestion de données "FAIR" : modèle.” https://archivesic.ccsd.cnrs.fr/sic_01690547.

Russo, Francesco, Dario Righelli, and Claudia Angelini. 2016. Advantages and Limits in the Adoption of Reproducible Research and r-Tools for the Analysis of Omic Data. Edited by Claudia Angelini, Paola MV Rancoita, and Stefano Rovetta. Cham: Springer International Publishing.

Schleder, Gabriel R, Antonio C M Padilha, Carlos Mera Acosta, Marcio Costa, and Adalberto Fazzio. 2019. “From DFT to Machine Learning: Recent Approaches to Materials Science–a Review.” Journal of Physics: Materials 2 (3): 032001. https://doi.org/10.1088/2515-7639/ab084b.

Sébire, Fanny. 2023. “Check-list de l’Institut Pasteur pour des bonnes pratiques de gestion des données de recherche.” https://hal.science/hal-04123336.

The Turing Way Community. 2022. “The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research.” Zenodo. https://doi.org/10.5281/ZENODO.3233853.

“UNESCO Recommendation on Open Science.” 2021. UNESCO. https://doi.org/10.54677/mnmh8546.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, and Jenny Bryan. 2023. R Packages. 2nd ed. O’Reilly Media. https://r-pkgs.org/.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.18.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Yilmaz, Pelin, Renzo Kottmann, Dawn Field, Rob Knight, James R Cole, Linda Amaral-Zettler, Jack A Gilbert, et al. 2011. “Minimum Information about a Marker Gene Sequence (MIMARKS) and Minimum Information about Any (x) Sequence (MIxS) Specifications.” Nature Biotechnology 29 (5): 415–20. https://doi.org/10.1038/nbt.1823.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1). https://doi.org/10.1186/s13059-016-1044-7.

The wonderful world of OpenScience

The fourth paradigm of science

The data deluge

Innovation’s march

The ravages of time

Reproducibility ?

Reproducibility ?

Threats to reproducible science

A data management horror story

State of play …

“UNESCO Recommendation on Open Science” (2021)

FAIR Guiding Principles

Cost of not having FAIR research data

Research data

Open Data 5★

Metadata

Data on data.

WHO? WHAT? WHEN? WHERE? HOW? WHY?

Electronic Lab Notebook

Controlled vocabulary

Community need standards

Create your own metadata standards

Metadata standards - FAIRsharing

Metadata standards - MIxS

Open file format

Open file format - The game

Personal data

Take particular care with sensitive data!

Compliance procedures

Tidydata

Tidydata

Tidydata - Example

Excel error plague

Storing data - Challenges - the 6 V

Storing data - Recommendations

Storing data

Organised working environment

Backup

Cybersecurity

Cybersecurity - Passwords

Cybersecurity - Examples

Open Source Software

Beyond reproducibility: transparency in research

You’re contributing to faster scientific progress

You don’t lose time …

For that you need to code a little …

What do we need to make research reproducible?

Notebook

Ensures the consistency of analyses and improves traceability

Generates an exportable document (e.g. html) for improved portability and readability

Notebook - RMarkdown

Notebook - Jupyter Notebook

Tidyverse

Distributed version control (git)

It works with scripts & codes, protocols & documentation, reports, any documents !

What is a commit ?

With a visual interface

Git branching

Online remote repositories (eg: GitLab)

Why not keep it all?

What should we keep?

Why should I keep it?

For how long?

Where should I keep it?

And how?

Long-term archiving

Preservation of research unit archives

“Deuxième Plan national pour la science ouverte” (2021)

LPRN 2016 & Décret 2021-1572

LPRN 2016 & Décret 2021-1572

Data repositories

Why use a repository?

Disciplinary repository

Institutional repository

Recherche Data Gouv - Organization

Recherche Data Gouv - Content

Persistent identifier

For Object

For Contributors

Persistent identifier

Distributed version control (`git`)