13.2 Assembly-based
Assembly-based approaches for processing metagenomic data are based on assembling sequencing reads into longer DNA sequences known as contigs, which can then be used to predict genes and perform functional analyses. The main limitation of this approach is that the entire metagenome (set of contigs) in each sample is considered as a single unit, thus overlooking which bacterial genome each detected gene belongs to. Assembly-based approaches can be divided in two main strategies:
Individual assembly-based
Two of the most popular metagenome assemblers are Megahit and MetaSpades. Metaspades is considered superior in terms of assembly quality, yet memory requirements are much larger than those of Megahit. Thus, one of the most relevant criteria to choose the assembler to be employed is the balance between amount of data and available memory. Another minor, yet relevant difference between both assemblers is that Megahit allows removing contings below a certain size, while MetaSpades needs to be piped with another software (e.g. bbmap) to get rid off barely informative yet often abundant short contigs.
Assembly statistics using Quast
The metagenome assemblies can have very different properties depending on the amount of data used for the assembly, the complexity of the microbial community, and other biological and technical aspects. It is therefore convenient to obtain some general statistics of the assemblies to decide whether they look meaningful to continue with downstream analyses. This can be easily done using the software Quast.
Coassembly-based
Coassembly is the process of assembling input files consisting of reads from multiple samples, as opposed to performing an independent assembly for each sample, where the input would only include reads from that particular sample. Coassembly has several advantages, such as increased read depth, simplified comparison across samples by utilizing a single reference assembly for all, and frequently, a better capability to recover genomes from metagenomes by obtaining differential coverage information. However, it can also limit the capacity to recover strain-level variation.
Coassembling multiple samples does not require special assemblers, but only preparing the input files in the correct way to enable assemblers to perform the assembly over multiple samples. An example for metaspades is shown below:
#Concatenate input reads into a single big input file
cat {input.reads}/*_1.fq.gz > {params.r1_cat}
cat {input.reads}/*_2.fq.gz > {params.r2_cat}
# Run metaspades
metaspades.py \
-t {threads} \
-k 21,33,55,77,99 \
-1 {params.r1_cat} -2 {params.r2_cat} \
-o {params.workdir}
2> {log}
# Remove contigs shorter than 1,500 bp using bbmap
reformat.sh \
in={params.workdir}/scaffolds.fasta \
out={output.Coassembly} \
minlength=1500
Note that the genome-resolved metagenomic approach also relies on assemblies or co-assemblies, but downstream binning procedures are explained in the genome-resolved approach section.
Gene annotation
Gene annotation refers to the process of identifing and assigning function to genes present in an assembly. In the first step, protein-coding and other types of genes are identified using tools such as Prodigal based on structural information of the DNA sequences. These software also predict the protein sequences these genes are expected to yield, which are then used to assign functions by contrasting them with functionally annotated reference databases. Due to the amount of reference databases available, it is common practice to match the genes against multiple databases and yield multiple annotations per gene. Currently, multiple tools exist that perform all these procedures in a single pipeline, such as DFAST [44] and DRAM [45]. DFAST annotates genes against the TIGRFAM and Clusters of Orthologous Groups (COG) databases, while DRAM performs the annotation using Pfam, KEGG, UniProt, CAZY and MEROPS databases.
The procedure for annotating MAGs, which is explained in the genome-resolved approach section, is identical to this one, with the only difference that the a MAG or a set of MAGs are used as input data rather than a metagenomic assembly.
Read mapping
The aim of assembly-based analyses is often to obtain gene-abundance information per studied sample, to characterise the functional properties of the entire metagenome as a whole (as opposed to the genome-resolved approach approach, in which functional attributes are assigned to each MAG). This requires reads from each sample to be mapped against the sequence of all protein-coding genes identified during the annotation process. All gene prediction software and annotation pipelines produce a FASTA file only containing gene sequences, which is used as the reference material.
The gene catalogue needs to be indexed before the mapping.
Then, the following step needs to be iterated for each sample, yielding a BAM mapping file for each sample.
bowtie2 \
--time \
--threads {threads} \
-x {all_genes} \
-1 {input.r1} \
-2 {input.r2} \
| samtools sort -@ {threads} -o {output}
Or relative abundance per gene per sample.