13.3 Genome-resolved

Genome-resolved metagenomics aims at recovering near-complete bacterial genomes from metagenomic mixtures. It relies on the assembling and read-mapping procedures explained in assembly-based approach section, which is followed by a binning procedure to produce the so-called Metagenome-Assembled Genomes (MAGs).

Note that entire suites and pipelines are available for conducting all the steps outlined in this section, and often more. Some of them include:

  • Anvi’o
  • metaWRAP
  • ATLAS

Binning

Metagenomic binning is the bioinformatic process that attempts to group metagenomic sequences by their organism of origin {Goussarov}. In practice, what binning does is to cluster contigs of a metagenomic assembly into putative bacterial genomes. In the last decade over a dozen of binning algorithms have been released, each relying on different structural and mathematical properties of the input data.

Two of the most relevant structural properties to group contigs into bins are oligonucleotide composition of contigs and present of universally conserved genes in contigs. MaxBin, for example, relies on such universally conserved genes to initialize clusters, which are then expanded using the oligonucleotide composition of contigs. Besides structural attributes of contigs, the main quantitative measure used for binning is differential coverage, which is computed by counting the number of reads from different samples mapped to the assembly. This information is used by binning algorithms CONCOCT and MetaBat, for example.

Metabat and Maxbin require a depth file to be generated first.

jgi_summarize_bam_contig_depths \
    --outputDepth {output.depth} \
    {input.assemblybampath}

Example code for launching metabat2.

metabat2 \
    -i {input.assemblypath} \
    -a {input.depth} \
    -o {output.basepath} \
    -m 1500 \
    -t {threads} \
    --unbinned

Example code for launching MaxBin.

run_MaxBin.pl \
    -contig {input.assemblypath} \
    -abund {input.depth} \
    -out {output.basepath} \
    -thread {threads}

Bin refinement

The performance of the binning algorithms is largely dependent on the specific properties of each sample. A software that performs very well with a given sample can be easily outcompeted by another one in the next sample. In consequence, many researchers opt for ensemble approaches whereby assemblies are binned using multiple algorithms, followed by a refinement step that merges all generated information to yield consensus bins. This final step is ofter referred to as “bin refinement”, and can be performed using tools like metaWRAP [46] or Dastool [47]. Several benchmarking studies have shown that such ensemble approaches are usually better than individual binning tools.

The following code can be used to run an ensemble binning using metaWRAP.

metawrap binning -o {params.outdir} \
    -t {threads} \
    -m {params.memory} \
    -a {params.assembly} \
    -l 1500 \
    --metabat2 \
    --maxbin2 \
    --concoct \

The following code can be used to refine binds using metaWRAP.

metawrap bin_refinement \
    -m {params.memory} \
    -t {threads} \
    -o {params.outdir} \
    -A {params.concoct} \
    -B {params.maxbin2} \
    -C {params.metabat2} \
    -c 70 \
    -x 10

Bin quality assessment

Metagenomic binning is a powerful yet complex procedure that yields many bins that do not properly represent bacterial genomes. It is therefore essential to assess the quality of those bins before considering them representative of bacterial genomes. The two main parameters used for bin assessment are completeness and contamination. Completeness refers to the fraction of a given bacterial genome estimated to be represented in the bin, while contamination refers to the proportion of the bin estimated to belong to a different genome. The most commonly employed software to assess bin quality is CheckM, which yields completeness and contamination metrics based on single-copy core genes.

Based on completeness and contamination metrics, a group of experts proposed some community standards to classify bins according to their quality and establish minimum quality requirements for considering a bin as a MAG [48].

Bin curation

Contamination is an issue that in certain cases can be minimised by curating bins. The Anvi’o suite [49] provides a powerful visual interface to manually curate bins by dropping contigs that display distinct features (e.g., taxonomic annotation, coverage, GC%) to the rest of the contigs included in a bin. GUNC provides a way to implement a similar curation step in a more automatised manner [50].

Dereplication

Dereplication is the reduction of a set of MAGs based on high sequence similarity between them [51]. Although this step is neither essential nor meaningful in certain cases (e.g., when studying straing-level variation or pangenomes), in most cases it contributes to overcome issues such as excessive computational demans, inflated diversity or inspecific read mapping. If the catalogue of MAGs used to map sequencing reads to (see read mapping section below) contains many similar genomes, read mapping results in multiple high-quality alignments. Depending on the software used and parameters chosen, this leads to sequencing reads either being randomly distributed across the redundant genomes or being reported at all redundant locations. This can bias quantitative estimations of relative representation of each MAG in a given metagenomic sample.

Dereplication is based on pairwise comparisons of average nucleotide identity (ANI) between MAGs. This implies that the number of comparisons scales quadratically with an increasing amount of MAGs, which requires for efficient strategies to perform dereplication in a cost-efficient way. A popular tool used for dereplicating MAGs is dRep [52], which combines the fast yet innacurate algorithm MASH with the slow but accurate gANI computation to yield a fast and accurate estimation of ANIs between MAGs. An optimal threshold that balances between retaining genome diversity while minimising cross-mapping issues has been found to be 98% ANI.

Taxonomic annotation

Although not necessary for conducting most of the downstream analyses, taxonomic annotation of MAGs is an important step to provide context, improve comparability and facilitate result interpretation in holo-omic studies. MAGs can be taxonomically annotated using different algorithms and reference databases, but the Genome Taxonomy Database (GTDB) [53] and associated taxonomic classification toolkit (GTDB-Tk) [54] have become the preferred option for many researchers.

Functional annotation

Functional annotation refers to the process of identifying putative functions of genes present in MAGs based on information available in reference databases. As explained in the assembly-based approach, the first step is to predict genes in the MAGs (unless these are available from the assembly), followed by functional annotation by matching the protein sequences predicted from the genes with reference databases. Currently, multiple tools exist that perform all these procedures in a single pipeline, such as DFAST [44] and DRAM [45]. DFAST annotates genes against the TIGRFAM and Clusters of Orthologous Groups (COG) databases, while DRAM performs the annotation using Pfam, KEGG, UniProt, CAZY and MEROPS databases.

DRAM.py annotate \
      -i {input.MAG} \
      -o {outdir} \
      --threads {threads} \
      --min_contig_size 1500

These functional annotations can be used for performing functional gene enrichment analyses, distilling them into genome-inferred functional traits, and many other downstrean operations explained in the statistics part.

Read mapping

When the objective of a genome-resolved metagenomic analysis is to reconstruct and analyse a microbiome, researchers usually require relative abundance information to measure how abundant or rare each bacteria was in the analysed sample. In order to achieve this, it is necessary to map the reads of each sample back to the MAG catalogue and retrieve mapping statistics. The procedure is identical to that explained in the assembly read-mapping section, yet using the MAG catalogue as a reference database rather than the metagenomic assembly. This procedure usually happens in two steps. In the first step, reads are mapped to the MAG catalogue to generate BAM or CRAM mapping files. In the second step, these mapping files are used to extract quantitative read-abundance information in the form of a table in which the amount of reads mapped to each MAG in each sample is displayed.

First, all MAGs need to be concatenated into a single file, which will become the reference MAG catalogue or database.

cat {MAG.path}/*.fa.gz > {all_MAGs}.fa.gz

The MAG catalogue needs to be indexed before the mapping.

bowtie2-build \
      --large-index \
      --threads {threads} \
       {all_MAGs}.fa.gz

Then, the following step needs to be iterated for each sample, yielding a BAM mapping file for each sample.

bowtie2 \
      --time \
      --threads {threads} \
      -x {all_MAGs} \
      -1 {input.r1} \
      -2 {input.r2} \
      | samtools sort -@ {threads} -o {output}

Finally, CoverM can be used to extract the required stats, such as covered fraction per MAG per sample.

coverm genome \
      -b {input} \
      -s ^ \
      -m count covered_fraction length \
      -t {threads} \
      --min-covered-fraction 0 \
      > {output.count_table}

Or relative abundance per MAG per sample.

coverm genome \
      -b {params.BAMs}/*.bam \
      -s ^ \
      -m relative_abundance \
      -t {threads} \
      --min-covered-fraction 0 \
      > {output.mapping_rate}

Contents of this section were created by Antton Alberdi.

References

44. Tanizawa Y, Fujisawa T, Nakamura Y. DFAST: A flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics. 2017;34:1037–9.
45. Shaffer M, Borton MA, McGivern BB, Zayed AA, La Rosa SL, Solden LM, et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 2020;48:8883–900.
46. Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6:1–13.
47. Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat Microbiol. 2018;3:836–43.
48. Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35:725–31.
49. Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: An advanced analysis and visualization platform for ’omics data. PeerJ. 2015;3:e1319.
50. Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR, et al. GUNC: Detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 2021;22:178.
51. Evans JT, Denef VJ. To dereplicate or not to dereplicate? mSphere. 2020;5.
52. Olm MR, Brown CT, Brooks B, Banfield JF. dRep: A tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11:2864–8.
53. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50:D785–94.
54. Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: Memory friendly classification with the genome taxonomy database. Bioinformatics. 2022;38:5315–6.