Metabarcoding analyses:
from sequences to plots.

Formation PROSE

Cédric Midoux

PROSE & MaIAGE

April 24, 2024

Introduction to amplicon analyses

Road to Metagenomics

Escobar-Zepeda, Vera-Ponce de León, and Sanchez-Flores (2015)

Meta-omics using next-generation sequencing (NGS)

Meta-omics using next-genertation sequencing (NGS)

Strengths

Detect subdominant microorganisms present in complex samples → microbial inventories
Get (approximate) relative abondances of different taxa in samples
Analyze and compare many taxa (hundreds) at the same time
Taxonomic profiles of the communities (usually up to genus level, and sometimes up to species or strain)
Low cost

Weaknesses

Compositional data, many biases → no absolute quantification
Exact identification of the organisms difficult
Hard to distinguish live and dead fractions of the communities
No functional view of the ecosystem

Choice of a marker gene

The perfect / ideal gene marker:

is ubiquist
is conserved among taxa
is enough divergent to distinguish stains
is not submitted to lateral transfer
has only one copy in genome
has conserved regions to design specific primers
is enough characterized to be present in databases for taxonomic affiliation

The gene encoding the small subunit of the ribosomal RNA

The most widely used gene in molecular phylogenetic studies
Ubiquist gene: 16S rDNA in prokayotes ; 18S rDNA in eukaryotes
Gene encoding a ribosomal RNA : non-coding RNA (not translated), part of the small subunit of the ribosome which is responsible for the translation of mRNA in proteins
Not submitted to lateral gene transfer
Availability of databases facilitating comparison
- Silva v138.1 - 2021: available SSU/LSU sequences to over 10,700,000

The 16S resolution

16S rRNA copy number

Median of the number of 16S rRNA copies in 3,070 bacterial species according to data reported in rrnDB database – 2018

16S rRNA copy variation

[B] The positions of sequence variation within 16S and 23S rRNA are shown along the gene organization of rrn operons. A total of 33 and 77 differences were identified in 16S rRNA and 23S rRNA, respectively.

[C] The number of bases that are different from the conserved sequence are shown for 16S and 23S rRNA for each rrn operon

16S rRNA copy variation

Only a minority of bacterial genomes harbors identical 16S rRNA gene copies
Sequence diversity increases with increasing copy numbers
While certain taxa harbor dissimilar 16S rRNA genes, others contain sequences common to multiple species

Planning an experiment

Challenges

Sequencing technologies

Illumina technology

Illumina technology

Sequencing biases

Contamination between samples during the same run
Contamination during successive runs (residual contaminants)
Variability between runs: take into account for experimental plan
Variability inside run: add some controls

Synthetis of biases

Synthesis

Bioinformatics

A pile of pipelines

Conclusion 1: sequencing data do not contain exactly what you sampled …

Summary

Conclusion 2: … but you now know how to deal with

FROGS
Find, Rapidly, OTUs with Galaxy Solution

FROGS team

FROGS is a INRAE development project

FROGS articles

How to use FROGS

Command line

remove_chimera.py 
  --input-biom clustering.biom \ 
  --input-fasta clustering.fasta \
  --non-chimera remove_chimera.fasta \
  --out-abundance remove_chimera.biom \
  --summary remove_chimera.html

Galaxy instances via web

FROGS docs and help

Website: https://frogs.toulouse.inrae.fr
Github: https://github.com/geraldinepascal/FROGS.git
Newsletter: subscription request at frogs-support@inrae.fr
Need help
- frogs-support@inrae.fr for generic questions
- help-migale@inrae.fr for bugs/quotas/errors with Galaxy Migale instance

TP1: Introduction to Galaxy

Sequencing data

FASTQ format

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

Meaning

@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ

Quality score encoding

Quality score

Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing

FASTQ compression

Compression is essential to deal with FASTQ files (reduce disk storage)
extension: file.fastq.gz
Tools are (almost all) able to deal with compressed files

Quality control

One of the most easy step in bioinformatics …
… but one of the most important
check if everything is ok
Indicates if/how to clean reads
Shows possible sequencing problems
The results must be interpreted in relation to what has been sequenced

Reads are not perfect

Why QC’ing your reads?

Try to answer to (not always) simple questions:

Are data conform to the expected level of performance?
- Size / Number of reads / Quality
Residual presence of adapters or indexes?
(Un)expected techincal biases?
(Un)expected biological biases?

Warning

QC without context leads to misinterpretation!

TP2: Quality control

Preprocess

What FROGS preprocess does?

Merging of R1 and R2 reads with vsearch Rognes et al. (2016), flash Magoč and Salzberg (2011) or pear Zhang et al. (2013) (only in command line)
Deletes sequences without good primers
Finds and removes adapter sequences with cutadapt
Deletes sequence with not expected lengths
Deletes sequences with ambiguous bases (N)
Dereplication
removing homopolymers (size = 8) for 454 data
quality filter for 454 data

Merging of paired-end reads

TP FROGS preprocess

Clustering

Sequencing data are noised

How to deal with these noised sequences?

Comparison all against all
- Very accurate
- Requires a lot of memory and/or time
Clustering
- closed-reference / open-reference
- de novo
Denoising

Vocabulary

A lot of terms for features built by softwares
- OTUs, zOTUS, ASVs, ESVs…
A recent review establishes the vocabulary Hakimzadeh et al. (2023)
- OTUs / ASVs / swarm clusters

ASVs are identical denoised reads with as few as 1 base pair difference between variants, representing an inference of the biological sequences prior to amplification and sequencing errors.

OTUs are formed with a % threshold clustering
Swarm clusters are a third feature type

OTU paradigm

Operational Taxonomic Unit

Operational Taxonomic Units

Operational Taxonomic Units

ASV paradigm

Amplicon Sequence Variants

ASV are inferred by a de novo process in which biological sequences are discriminated from errors on the basis of the expectation that biological sequences are more likely to be repeatedly observed than are error-containing sequences

ASV resolution

ASV resolution changes the composition for these samples

Swarm: A smart idea

Why Swarm?

Fixed clustering threshold is a real problem
OTUs construction is input-order depenent

`d`: the small local linking threshold

Swarm steps

Swarm

A robust and fast clustering method for amplicon-based studies
The purpose of swarm is to provide a novel clustering algorithm to handle large sets of amplicons
swarm results are resilient to input-order changes and rely on a small local linking threshold d, the maximum number of differences between two amplicons
swarm forms stable high-resolution clusters, with a high yield of biological information
Default: forms a lot of low-abundant OTUs that are in fact artifacts and need to be removed
Swarm (fastidious method + d=1) clusters + filters → ASVs

TP FROGS clustering

Chimera removal

Chimera detection strategies

Reference based: against a database of «genuine» sequences
- dependant of the references used
De novo: against abundant sequences in the samples
FROGS uses vsearch Rognes et al. (2016) as chimera removal tool

Sample-cross validation

FROGS adds a sample-cross validation

TP FROGS remove chimera

Abundance/Prevalence filters

How to filter clusters?

Low abundant sequences
Clusters not shown in few replicates
Contamination

TP FROGS cluster filters

Taxonomic affiliation

The FROGS databanks

Affiliation with blast
- Check Blast metrics to avoid concluding too fast
- Take care of the reference databank used!
Command line: you can use your own databank
Galaxy
- You have access to several databanks
- Admins have to add your databank

The multi-affiliations

FROGS gives all identical hits

Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus xylosus
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus saprophyticus

Strictly identical (V1-V3 amplification) on 499 nucleotides

FROGS can’t decide if it’s one or another
You have to check if you can choose between multi-affiliations

Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Multi-affiliations

To help you

https://shiny.migale.inrae.fr/app/affiliationexplorer
a very user-friendly Shiny web app, allowing users to modify very simply the affiliations from a FROGS abundance file

TP FROGS Taxonomic affiliation

Affiliation filters

Remaining contamination?
Want to analyse only the Firmicutes?
2 modes
- Deleting: remove ASVs
- Hiding: only the affiliation is modified, not the abundance

Phylogenetic tree

FROGS tree

This tool builds a phylogenetic tree thanks to affiliations of ASVs contained in the BIOM file
Needed to compute beta-diversity indices based on phylogenetic distances
Interesting to explore poor-characterized environments

TP FROGS Tree

Conclusion of bioinformatics analysis

FROGS Procedure

FROGS Output

BIOM file
- Count matrix for each ASV for each sample
- Taxonomic affiliation table for each ASV
Multiaffiliation table
FASTA file (Representative sequence for each ASV)
Newick file (Phylogenetic tree)
Many HTML report

Metadata table for each sample to be added

Road to statistics

Analysis of community composition

Easy16S

a user-friendly Shiny web-service for exploration and visualization of microbiome data
Documentation: https://easy16s.migale.inrae.fr
Server: https://shiny.migale.inrae.fr/app/easy16S

Data import

Demo dataset
Build from FROGS
- BIOM
- Metadata
- Phylogenetic tree
RData/RDS

Preprocess data

Samples names
Sample variables (metadata)
Depth (samples sums)
Aggregate taxa
Spread taxonomy
Rarefaction
Transform the abundances using fun

Data structure

Summary

Table

phyloseq data structure
- OTU Table
- Taxomony Table
- Agglomerate OTU Table
- Sample Data Table
  - metadata with esquisse
- a phylogenetic tree (not viewable)
- a set of reference sequences (not viewable)

Barplot

Rarefaction curves

Heatmap composition

Biodiversity indices

Different kinds of biodiversity indices…
- α-diversity: diversity within a community
- β-diversity: diversity between communities
- γ-diversity: diversity at the landscape scale (blurry for bacterial communities)

Based on different types of data

Presence/Absence (qualitative) vs. Abundance (quantitative)
- Presence/Absence gives less weight to dominant species
- is more sensitive to differences in sampling depths
- emphasizes difference in taxa diversity rather than differences in composition
Compositional vs. Phylogenetic
- Compositional does not require a phylogenetic tree
- is more sensitive to erroneous otu picking
- gives the same importance to all otus

α-diversities - indices

Note \(c_i\) the number of species observed \(i\) times (\(i = 1, 2, \ldots\)) and \(p_s\) the proportion of species \(s\) (\(s = 1, \ldots, S\))

Species richness: number of observed otus
- \(S_{rich} = \sum_s1_{p_s>0}=\sum_{i}c_i\)

Chao1: number of observed otu + estimate of the number of unobserved otus
- \(S_{Chao} = S_{rich}+\hat{c}_0\)

Shannon entropy: the width of the otu relative abundance distribution. Roughly, it reflects our (in)ability to predict the otu of a randomly picked bacteria
- \(S_{Shan}=-\sum_sp_s\log(p_s)\leq\log(S)\)

Inverse Simpson: inverse of the probability that two bacteria picked at random belong to the same otu.
- \(S_{InvSimp}=\frac{1}{p^2_1 + \ldots + p^2_S}\leq S\)

α-diversities

β-dissimilarities - compositional

Note \(\color{red}{n_s^1}\) the count of species \(s\) (\(s = 1, \ldots , S\) ) in \(\color{red}{community \space 1}\) and \(\color{blue}{n_s^2}\) the count in \(\color{blue}{community \space 2}\).

Jaccard: Fraction of species specific to either 1 or 2
Bray-Curtis: Fraction of the community specific to 1 or to 2

β-dissimilarities - phylogenetic

For each branch \(e\), note \(l_e\) its length and \(\color{red}{p_e}\) (resp. \(\color{blue}{q_e}\)) the fraction of \(\color{red}{community \space 1}\) (resp. \(\color{blue}{community \space 2}\)) below branch \(e\).

Unifrac: Fraction of tree specific to either 1 or 2
Weighted Unifrac: Fraction of the diversity specific to 1 or to 2

β-dissimilarities - Compo vs. Quali

Tip

Different distances capture different features of the samples.
There is no “one size fits all”!

Community structure

Ordination

Principal Component Analysis (PCA)

Each community is described by otus abundances
Otus abundance maybe correlated
PCA finds linear combinations of otus that
- are uncorrelated
- capture well the variance of community composition

But variance is not a very good measure of β-diversity.

MultiDimensional Scaling (MDS)

Start from a distance matrix \(D = (d_{ij})\)
Project the communities \(Com_i \mapsto X_i\) in a euclidian space such that distances are preserved \(\lVert X_i - X_j \rVert \approx d_{ij}\)

MultiDimensional Scaling (MDS)

Hierarchical Clustering

Merge closest communities (according to some distance)
Update distances between sets of communities using linkage function
Repeat until all communities have been merged

Linkage function

complete
ward.D2
single
…

Clustering

Diversity Partitioning

Objectives

Test composition differences of communities from different groups using a distance matrix
Compare within group to between group distances

Permutational Multivariate ANOVA

Test differences in the community composition of communities from different groups using a distance matrix
Permutation is simply the act of rearranging objects
ANOVA-like test statistics from a matrix of resemblances

Does weaning affect community composition?
Are groups A and B different?

Permutational Multivariate ANOVA

Assumptions

Objects in the data set are exchangeable under the null hypothesis
Exchangable objects (sites, samples, observations, etc) are independent
Exchangable objects have similar multivariate disperson (i.e. each group has a similar degree of multivariate scatter)

Limitations

If groups have different dispersions, p-value are not adequate
p-values computed using permutations, permutations must respect the design

Differential abundance

Comparisons at the global level: is there structure in the data?

We know that groups A and B are different.
How do they differ (in terms of taxa)?

Differential abundance

Specificity

Negative binomial models were developed for transcriptomics data
Normalization assumes that most transcripts are not DA
Reasonable for comparison before/after antibiotic intervention
Not so when comparing Soil against Seawater
Not clear how negative binomial models cope with this sparsity

Differential abundance

Settings

Select a experimental design and a contrast.

Axis

large magnitude fold-changes (log2, x axis)
high statistical significance (-log10 of p-value, y axis)

Volcano plot

Conclusion

Any questions?

Bibliography

Bernard, Maria, Olivier Rué, Mahendra Mariadassou, and Géraldine Pascal. 2021. “FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers.” Briefings in Bioinformatics 22 (6). https://doi.org/10.1093/bib/bbab318.

Bharti, Richa, and Dominik G Grimm. 2019. “Current challenges and best-practice protocols for microbiome analysis.” Briefings in Bioinformatics 22 (1): 178–93. https://doi.org/10.1093/bib/bbz155.

Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. “The Truth about Metagenomics: Quantifying and Counteracting Bias in 16S rRNA Studies.” BMC Microbiology 15 (1): 1–14.

Cruaud, Perrine, Jean-Yves Rasplus, Lillian Jennifer Rodriguez, and Astrid Cruaud. 2017. “High-Throughput Sequencing of Multiple Amplicons for Barcoding and Integrative Taxonomy.” Scientific Reports 7 (1): 41948.

Escobar-Zepeda, Alejandra, Arturo Vera-Ponce de León, and Alejandro Sanchez-Flores. 2015. “The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics.” Frontiers in Genetics 6: 348.

Escudié, Frédéric, Lucas Auer, Maria Bernard, Mahendra Mariadassou, Laurent Cauquil, Katia Vidal, Sarah Maman, Guillermina Hernandez-Raquet, Sylvie Combes, and Géraldine Pascal. 2018. “FROGS: Find, Rapidly, OTUs with Galaxy Solution.” Bioinformatics 34 (8): 1287–94. https://doi.org/10.1093/bioinformatics/btx791.

Espejo, Romilio T, and Nicolás Plaza. 2018. “Multiple Ribosomal RNA Operons in Bacteria; Their Concerted Evolution and Potential Consequences on the Rate of Evolution of Their 16S rRNA.” Frontiers in Microbiology 9: 1232.

Hakimzadeh, Ali, Alejandro Abdala Asbun, Davide Albanese, Maria Bernard, Dominik Buchner, Benjamin Callahan, J Gregory Caporaso, et al. 2023. “A Pile of Pipelines: An Overview of the Bioinformatics Software for Metabarcoding Data Analyses.” Molecular Ecology Resources.

Maeda, Michihisa, Tomohiro Shimada, and Akira Ishihama. 2015. “Strength and Regulation of Seven rRNA Promoters in Escherichia Coli.” PLoS One 10 (12): e0144697.

Magoč, Tanja, and Steven L Salzberg. 2011. “FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies.” Bioinformatics 27 (21): 2957–63.

Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2015. “Swarm V2: Highly-Scalable and High-Resolution Amplicon Clustering.” PeerJ 3: e1420.

Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4: e2584.

Salter, Susannah J, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nicholas J Loman, and Alan W Walker. 2014. “Reagent and Laboratory Contamination Can Critically Impact Sequence-Based Microbiome Analyses.” BMC Biology 12 (1). https://doi.org/10.1186/s12915-014-0087-z.

Stoler, Nicholas, and Anton Nekrutenko. 2021. “Sequencing error profiles of Illumina sequencing instruments.” NAR Genomics and Bioinformatics 3 (1). https://doi.org/10.1093/nargab/lqab019.

Tan, Yung Cheng, Asqwin Uthaya Kumar, Ying Pei Wong, and Anna Pick Kiong Ling. 2022. “Bioinformatics Approaches and Applications in Plant Biotechnology.” Journal of Genetic Engineering and Biotechnology 20 (1): 1–13.

Whon, Tae Woong, Won-Hyong Chung, Mi Young Lim, Eun-Ji Song, Pil Soo Kim, Dong-Wook Hyun, Na-Ri Shin, Jin-Woo Bae, and Young-Do Nam. 2018. “The Effects of Sequencing Platforms on Phylogenetic Resolution in 16 s rRNA Gene Profiling of Human Feces.” Scientific Data 5 (1): 1–15.

Yarza, Pablo, Pelin Yilmaz, Elmar Pruesse, Frank Oliver Glöckner, Wolfgang Ludwig, Karl-Heinz Schleifer, William B Whitman, Jean Euzéby, Rudolf Amann, and Ramon Rosselló-Móra. 2014. “Uniting the Classification of Cultured and Uncultured Bacteria and Archaea Using 16S rRNA Gene Sequences.” Nature Reviews Microbiology 12 (9): 635–45.

Zhang, Jiajie, Kassian Kobert, Tomáš Flouri, and Alexandros Stamatakis. 2013. “PEAR: A Fast and Accurate Illumina Paired-End reAd mergeR.” Bioinformatics 30 (5): 614–20.

Metabarcoding analyses:from sequences to plots.

Introduction to amplicon analyses

Road to Metagenomics

Meta-omics using next-generation sequencing (NGS)

Meta-omics using next-genertation sequencing (NGS)

Strengths

Weaknesses

Choice of a marker gene

The gene encoding the small subunit of the ribosomal RNA

The 16S resolution

16S rRNA copy number

16S rRNA copy variation

16S rRNA copy variation

Planning an experiment

Challenges

Sequencing technologies

Sequencing technologies

Illumina technology

Illumina technology

Illumina technology

Sequencing biases

Synthetis of biases

Synthesis

Bioinformatics

A pile of pipelines

Conclusion 1: sequencing data do not contain exactly what you sampled …

Summary

Conclusion 2: … but you now know how to deal with

FROGS Find, Rapidly, OTUs with Galaxy Solution

FROGS team

FROGS articles

How to use FROGS

FROGS docs and help

Sequencing data

FASTQ format

Quality score encoding

Quality score

FASTQ compression

Quality control

Quality control

Reads are not perfect

Why QC’ing your reads?

Preprocess

What FROGS preprocess does?

Merging of paired-end reads

Clustering

Sequencing data are noised

How to deal with these noised sequences?

Vocabulary

OTU paradigm

Operational Taxonomic Units

Operational Taxonomic Units

Operational Taxonomic Units

ASV paradigm

ASV resolution

Swarm: A smart idea

Why Swarm?

d: the small local linking threshold

Swarm steps

Swarm

Chimera removal

Chimera removal

Chimera detection strategies

Sample-cross validation

Abundance/Prevalence filters

How to filter clusters?

Taxonomic affiliation

The FROGS databanks

The multi-affiliations

To help you

Affiliation filters

Phylogenetic tree

FROGS tree

Conclusion of bioinformatics analysis

FROGS Procedure

FROGS Output

Road to statistics

Analysis of community composition

Easy16S

Data import

Metabarcoding analyses:
from sequences to plots.

FROGS
Find, Rapidly, OTUs with Galaxy Solution

`d`: the small local linking threshold