Metabarcoding analyses:
from sequences to plots.

Formation PROSE

Cédric Midoux

PROSE & MaIAGE

April 24, 2024

Introduction to amplicon analyses

Road to Metagenomics

Escobar-Zepeda, Vera-Ponce de León, and Sanchez-Flores (2015)

Meta-omics using next-generation sequencing (NGS)

Meta-omics using next-genertation sequencing (NGS)

Strengths

  • Detect subdominant microorganisms present in complex samples → microbial inventories
  • Get (approximate) relative abondances of different taxa in samples
  • Analyze and compare many taxa (hundreds) at the same time
  • Taxonomic profiles of the communities (usually up to genus level, and sometimes up to species or strain)
  • Low cost

Weaknesses

  • Compositional data, many biases → no absolute quantification
  • Exact identification of the organisms difficult
  • Hard to distinguish live and dead fractions of the communities
  • No functional view of the ecosystem

Choice of a marker gene

The perfect / ideal gene marker:

  • is ubiquist
  • is conserved among taxa
  • is enough divergent to distinguish stains
  • is not submitted to lateral transfer
  • has only one copy in genome
  • has conserved regions to design specific primers
  • is enough characterized to be present in databases for taxonomic affiliation

The gene encoding the small subunit of the ribosomal RNA

  • The most widely used gene in molecular phylogenetic studies
  • Ubiquist gene: 16S rDNA in prokayotes ; 18S rDNA in eukaryotes
  • Gene encoding a ribosomal RNA : non-coding RNA (not translated), part of the small subunit of the ribosome which is responsible for the translation of mRNA in proteins
  • Not submitted to lateral gene transfer
  • Availability of databases facilitating comparison
    • Silva v138.1 - 2021: available SSU/LSU sequences to over 10,700,000

The 16S resolution

16S rRNA copy number

Median of the number of 16S rRNA copies in 3,070 bacterial species according to data reported in rrnDB database – 2018

16S rRNA copy variation

[B] The positions of sequence variation within 16S and 23S rRNA are shown along the gene organization of rrn operons. A total of 33 and 77 differences were identified in 16S rRNA and 23S rRNA, respectively.

[C] The number of bases that are different from the conserved sequence are shown for 16S and 23S rRNA for each rrn operon

16S rRNA copy variation

  • Only a minority of bacterial genomes harbors identical 16S rRNA gene copies
  • Sequence diversity increases with increasing copy numbers
  • While certain taxa harbor dissimilar 16S rRNA genes, others contain sequences common to multiple species

Planning an experiment

Challenges

Sequencing technologies

Sequencing technologies

Illumina technology

Illumina technology

Illumina technology

Sequencing biases

  • Contamination between samples during the same run
  • Contamination during successive runs (residual contaminants)
  • Variability between runs: take into account for experimental plan
  • Variability inside run: add some controls

Synthetis of biases

Synthesis

Bioinformatics

A pile of pipelines

Conclusion 1: sequencing data do not contain exactly what you sampled …

Summary

Conclusion 2: … but you now know how to deal with

FROGS
Find, Rapidly, OTUs with Galaxy Solution

FROGS team

  • FROGS is a INRAE development project

FROGS articles

Escudié et al. (2018)

Bernard et al. (2021)

How to use FROGS

  • Command line
remove_chimera.py 
  --input-biom clustering.biom \ 
  --input-fasta clustering.fasta \
  --non-chimera remove_chimera.fasta \
  --out-abundance remove_chimera.biom \
  --summary remove_chimera.html
  • Galaxy instances via web

FROGS docs and help

TP1: Introduction to Galaxy

Sequencing data

FASTQ format

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

Meaning

@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ

Quality score encoding

Quality score

Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing

FASTQ compression

  • Compression is essential to deal with FASTQ files (reduce disk storage)
  • extension: file.fastq.gz
  • Tools are (almost all) able to deal with compressed files

Quality control

Quality control

  • One of the most easy step in bioinformatics …
  • … but one of the most important
  • check if everything is ok
  • Indicates if/how to clean reads
  • Shows possible sequencing problems
  • The results must be interpreted in relation to what has been sequenced

Reads are not perfect

Why QC’ing your reads?

Try to answer to (not always) simple questions:

  • Are data conform to the expected level of performance?
    • Size / Number of reads / Quality
  • Residual presence of adapters or indexes?
  • (Un)expected techincal biases?
  • (Un)expected biological biases?

Warning

QC without context leads to misinterpretation!

TP2: Quality control

Preprocess

What FROGS preprocess does?

  • Merging of R1 and R2 reads with vsearch Rognes et al. (2016), flash Magoč and Salzberg (2011) or pear Zhang et al. (2013) (only in command line)
  • Deletes sequences without good primers
  • Finds and removes adapter sequences with cutadapt
  • Deletes sequence with not expected lengths
  • Deletes sequences with ambiguous bases (N)
  • Dereplication
  • removing homopolymers (size = 8) for 454 data
  • quality filter for 454 data

Merging of paired-end reads

TP FROGS preprocess

Clustering

Sequencing data are noised

How to deal with these noised sequences?

  • Comparison all against all
    • Very accurate
    • Requires a lot of memory and/or time
  • Clustering
    • closed-reference / open-reference
    • de novo
  • Denoising

Vocabulary

  • A lot of terms for features built by softwares
    • OTUs, zOTUS, ASVs, ESVs…
  • A recent review establishes the vocabulary Hakimzadeh et al. (2023)
    • OTUs / ASVs / swarm clusters

ASVs are identical denoised reads with as few as 1 base pair difference between variants, representing an inference of the biological sequences prior to amplification and sequencing errors.

  • OTUs are formed with a % threshold clustering
  • Swarm clusters are a third feature type

OTU paradigm

  • Operational Taxonomic Unit

Operational Taxonomic Units

Operational Taxonomic Units

Operational Taxonomic Units

ASV paradigm

  • Amplicon Sequence Variants

ASV are inferred by a de novo process in which biological sequences are discriminated from errors on the basis of the expectation that biological sequences are more likely to be repeatedly observed than are error-containing sequences

ASV resolution

  • ASV resolution changes the composition for these samples

Swarm: A smart idea

Why Swarm?

  • Fixed clustering threshold is a real problem
  • OTUs construction is input-order depenent

d: the small local linking threshold

Swarm steps

Swarm

  • A robust and fast clustering method for amplicon-based studies
  • The purpose of swarm is to provide a novel clustering algorithm to handle large sets of amplicons
  • swarm results are resilient to input-order changes and rely on a small local linking threshold d, the maximum number of differences between two amplicons
  • swarm forms stable high-resolution clusters, with a high yield of biological information
  • Default: forms a lot of low-abundant OTUs that are in fact artifacts and need to be removed
  • Swarm (fastidious method + d=1) clusters + filters → ASVs

TP FROGS clustering

Chimera removal

Chimera removal

Chimera detection strategies

  • Reference based: against a database of «genuine» sequences

    • dependant of the references used
  • De novo: against abundant sequences in the samples

  • FROGS uses vsearch Rognes et al. (2016) as chimera removal tool

Sample-cross validation

  • FROGS adds a sample-cross validation

TP FROGS remove chimera

Abundance/Prevalence filters

How to filter clusters?

  • Low abundant sequences
  • Clusters not shown in few replicates
  • Contamination

TP FROGS cluster filters

Taxonomic affiliation

The FROGS databanks

  • Affiliation with blast
    • Check Blast metrics to avoid concluding too fast
    • Take care of the reference databank used!
  • Command line: you can use your own databank
  • Galaxy

The multi-affiliations

  • FROGS gives all identical hits
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus xylosus
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus saprophyticus

Strictly identical (V1-V3 amplification) on 499 nucleotides

  • FROGS can’t decide if it’s one or another
  • You have to check if you can choose between multi-affiliations
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Multi-affiliations

To help you

TP FROGS Taxonomic affiliation

Affiliation filters

  • Remaining contamination?
  • Want to analyse only the Firmicutes?
  • 2 modes
    • Deleting: remove ASVs
    • Hiding: only the affiliation is modified, not the abundance

Phylogenetic tree

FROGS tree

  • This tool builds a phylogenetic tree thanks to affiliations of ASVs contained in the BIOM file
  • Needed to compute beta-diversity indices based on phylogenetic distances
  • Interesting to explore poor-characterized environments

TP FROGS Tree

Conclusion of bioinformatics analysis

FROGS Procedure

FROGS Output

  • BIOM file
    • Count matrix for each ASV for each sample
    • Taxonomic affiliation table for each ASV
  • Multiaffiliation table
  • FASTA file (Representative sequence for each ASV)
  • Newick file (Phylogenetic tree)
  • Many HTML report
  • Metadata table for each sample to be added

Road to statistics

Analysis of community composition

Easy16S

Data import

  • Demo dataset
  • Build from FROGS
    • BIOM
    • Metadata
    • Phylogenetic tree
  • RData/RDS

Preprocess data

  • Samples names
  • Sample variables (metadata)
  • Depth (samples sums)
  • Aggregate taxa
  • Spread taxonomy
  • Rarefaction
  • Transform the abundances using fun

Data structure

Summary

Table

  • phyloseq data structure
    • OTU Table
    • Taxomony Table
    • Agglomerate OTU Table
    • Sample Data Table
      • metadata with esquisse
    • a phylogenetic tree (not viewable)
    • a set of reference sequences (not viewable)

Barplot

Rarefaction curves

Heatmap composition

Biodiversity indices

Biodiversity indices

  • Different kinds of biodiversity indices…
    • α-diversity: diversity within a community
    • β-diversity: diversity between communities
    • γ-diversity: diversity at the landscape scale (blurry for bacterial communities)

Based on different types of data

  • Presence/Absence (qualitative) vs. Abundance (quantitative)
    • Presence/Absence gives less weight to dominant species
    • is more sensitive to differences in sampling depths
    • emphasizes difference in taxa diversity rather than differences in composition
  • Compositional vs. Phylogenetic
    • Compositional does not require a phylogenetic tree
    • is more sensitive to erroneous otu picking
    • gives the same importance to all otus

α-diversities - indices

Note \(c_i\) the number of species observed \(i\) times (\(i = 1, 2, \ldots\)) and \(p_s\) the proportion of species \(s\) (\(s = 1, \ldots, S\))

  • Species richness: number of observed otus
    • \(S_{rich} = \sum_s1_{p_s>0}=\sum_{i}c_i\)
  • Chao1: number of observed otu + estimate of the number of unobserved otus
    • \(S_{Chao} = S_{rich}+\hat{c}_0\)
  • Shannon entropy: the width of the otu relative abundance distribution. Roughly, it reflects our (in)ability to predict the otu of a randomly picked bacteria
    • \(S_{Shan}=-\sum_sp_s\log(p_s)\leq\log(S)\)
  • Inverse Simpson: inverse of the probability that two bacteria picked at random belong to the same otu.
    • \(S_{InvSimp}=\frac{1}{p^2_1 + \ldots + p^2_S}\leq S\)

α-diversities

β-dissimilarities - compositional

Note \(\color{red}{n_s^1}\) the count of species \(s\) (\(s = 1, \ldots , S\) ) in \(\color{red}{community \space 1}\) and \(\color{blue}{n_s^2}\) the count in \(\color{blue}{community \space 2}\).

  • Jaccard: Fraction of species specific to either 1 or 2
  • Bray-Curtis: Fraction of the community specific to 1 or to 2

β-dissimilarities - phylogenetic

For each branch \(e\), note \(l_e\) its length and \(\color{red}{p_e}\) (resp. \(\color{blue}{q_e}\)) the fraction of \(\color{red}{community \space 1}\) (resp. \(\color{blue}{community \space 2}\)) below branch \(e\).

  • Unifrac: Fraction of tree specific to either 1 or 2
  • Weighted Unifrac: Fraction of the diversity specific to 1 or to 2

β-dissimilarities - Compo vs. Quali

Tip

Different distances capture different features of the samples.
There is no “one size fits all”!

Community structure

Ordination

Principal Component Analysis (PCA)

  • Each community is described by otus abundances
  • Otus abundance maybe correlated
  • PCA finds linear combinations of otus that
    • are uncorrelated
    • capture well the variance of community composition

But variance is not a very good measure of β-diversity.

MultiDimensional Scaling (MDS)

  • Start from a distance matrix \(D = (d_{ij})\)
  • Project the communities \(Com_i \mapsto X_i\) in a euclidian space such that distances are preserved \(\lVert X_i - X_j \rVert \approx d_{ij}\)

MultiDimensional Scaling (MDS)

Hierarchical Clustering

  • Merge closest communities (according to some distance)
  • Update distances between sets of communities using linkage function
  • Repeat until all communities have been merged

Linkage function

  • complete
  • ward.D2
  • single

Clustering

Diversity Partitioning

Objectives

  • Test composition differences of communities from different groups using a distance matrix
  • Compare within group to between group distances

Permutational Multivariate ANOVA

  • Test differences in the community composition of communities from different groups using a distance matrix
  • Permutation is simply the act of rearranging objects
  • ANOVA-like test statistics from a matrix of resemblances

Does weaning affect community composition?
Are groups A and B different?

Permutational Multivariate ANOVA

Assumptions

  • Objects in the data set are exchangeable under the null hypothesis
  • Exchangable objects (sites, samples, observations, etc) are independent
  • Exchangable objects have similar multivariate disperson (i.e. each group has a similar degree of multivariate scatter)

Limitations

  • If groups have different dispersions, p-value are not adequate
  • p-values computed using permutations, permutations must respect the design

Differential abundance

  • Comparisons at the global level: is there structure in the data?

We know that groups A and B are different.
How do they differ (in terms of taxa)?

Differential abundance

Specificity

  • Negative binomial models were developed for transcriptomics data
  • Normalization assumes that most transcripts are not DA
  • Reasonable for comparison before/after antibiotic intervention
  • Not so when comparing Soil against Seawater
  • Not clear how negative binomial models cope with this sparsity

Differential abundance

Settings

  • Select a experimental design and a contrast.

Axis

  • large magnitude fold-changes (log2, x axis)
  • high statistical significance (-log10 of p-value, y axis)

Volcano plot

Conclusion

Any questions?

Bibliography

Bernard, Maria, Olivier Rué, Mahendra Mariadassou, and Géraldine Pascal. 2021. FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers.” Briefings in Bioinformatics 22 (6). https://doi.org/10.1093/bib/bbab318.
Bharti, Richa, and Dominik G Grimm. 2019. Current challenges and best-practice protocols for microbiome analysis.” Briefings in Bioinformatics 22 (1): 178–93. https://doi.org/10.1093/bib/bbz155.
Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. “The Truth about Metagenomics: Quantifying and Counteracting Bias in 16S rRNA Studies.” BMC Microbiology 15 (1): 1–14.
Cruaud, Perrine, Jean-Yves Rasplus, Lillian Jennifer Rodriguez, and Astrid Cruaud. 2017. “High-Throughput Sequencing of Multiple Amplicons for Barcoding and Integrative Taxonomy.” Scientific Reports 7 (1): 41948.
Escobar-Zepeda, Alejandra, Arturo Vera-Ponce de León, and Alejandro Sanchez-Flores. 2015. “The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics.” Frontiers in Genetics 6: 348.
Escudié, Frédéric, Lucas Auer, Maria Bernard, Mahendra Mariadassou, Laurent Cauquil, Katia Vidal, Sarah Maman, Guillermina Hernandez-Raquet, Sylvie Combes, and Géraldine Pascal. 2018. FROGS: Find, Rapidly, OTUs with Galaxy Solution.” Bioinformatics 34 (8): 1287–94. https://doi.org/10.1093/bioinformatics/btx791.
Espejo, Romilio T, and Nicolás Plaza. 2018. “Multiple Ribosomal RNA Operons in Bacteria; Their Concerted Evolution and Potential Consequences on the Rate of Evolution of Their 16S rRNA.” Frontiers in Microbiology 9: 1232.
Hakimzadeh, Ali, Alejandro Abdala Asbun, Davide Albanese, Maria Bernard, Dominik Buchner, Benjamin Callahan, J Gregory Caporaso, et al. 2023. “A Pile of Pipelines: An Overview of the Bioinformatics Software for Metabarcoding Data Analyses.” Molecular Ecology Resources.
Maeda, Michihisa, Tomohiro Shimada, and Akira Ishihama. 2015. “Strength and Regulation of Seven rRNA Promoters in Escherichia Coli.” PLoS One 10 (12): e0144697.
Magoč, Tanja, and Steven L Salzberg. 2011. “FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies.” Bioinformatics 27 (21): 2957–63.
Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2015. “Swarm V2: Highly-Scalable and High-Resolution Amplicon Clustering.” PeerJ 3: e1420.
Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4: e2584.
Salter, Susannah J, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nicholas J Loman, and Alan W Walker. 2014. “Reagent and Laboratory Contamination Can Critically Impact Sequence-Based Microbiome Analyses.” BMC Biology 12 (1). https://doi.org/10.1186/s12915-014-0087-z.
Stoler, Nicholas, and Anton Nekrutenko. 2021. Sequencing error profiles of Illumina sequencing instruments.” NAR Genomics and Bioinformatics 3 (1). https://doi.org/10.1093/nargab/lqab019.
Tan, Yung Cheng, Asqwin Uthaya Kumar, Ying Pei Wong, and Anna Pick Kiong Ling. 2022. “Bioinformatics Approaches and Applications in Plant Biotechnology.” Journal of Genetic Engineering and Biotechnology 20 (1): 1–13.
Whon, Tae Woong, Won-Hyong Chung, Mi Young Lim, Eun-Ji Song, Pil Soo Kim, Dong-Wook Hyun, Na-Ri Shin, Jin-Woo Bae, and Young-Do Nam. 2018. “The Effects of Sequencing Platforms on Phylogenetic Resolution in 16 s rRNA Gene Profiling of Human Feces.” Scientific Data 5 (1): 1–15.
Yarza, Pablo, Pelin Yilmaz, Elmar Pruesse, Frank Oliver Glöckner, Wolfgang Ludwig, Karl-Heinz Schleifer, William B Whitman, Jean Euzéby, Rudolf Amann, and Ramon Rosselló-Móra. 2014. “Uniting the Classification of Cultured and Uncultured Bacteria and Archaea Using 16S rRNA Gene Sequences.” Nature Reviews Microbiology 12 (9): 635–45.
Zhang, Jiajie, Kassian Kobert, Tomáš Flouri, and Alexandros Stamatakis. 2013. “PEAR: A Fast and Accurate Illumina Paired-End reAd mergeR.” Bioinformatics 30 (5): 614–20.