12.1 Host reference genome

Genomes of eukaryotic organisms are generally complex, because they carry multiple copies of the same genome, genomes contain duplications, repetitive sequences, mobile elements, etc. In consequence, generating a high-quality reference genome that represents all this complexity is a complex effort, that today, requires multiple complementary molecular techniques to be merged. Although multiple genome assembly protocols exist, in this guidebook we will focus on the one employed in the Vertebrates Genomes Project, the largest consortium aiming at generating animal reference genomes in a standardised way [38]. The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.

12.1.1 Genome quality

Before advancing with genome generation procedures, it is important to acknowledge that reference genomes can have different qualities. Quality is measured by assembly statistics, such as the N50 and L90 metrics, which provide an overview of the completeness and accuracy of the genome. Based on those metrics, eukaryotic genomes are usually categorised in three levels:

Contig level: Contig level refers to the lowest level of genome assembly, where the genome is fragmented into small pieces called contigs. Contigs are contiguous sequences of DNA that are typically hundreds to thousands of base pairs in length. Contig-level genome assemblies lack information about the order and orientation of the contigs and may contain gaps between them. Scaffold level: Scaffold level is the next level of genome assembly, where contigs are linked together using paired-end reads or other genomic information to form larger structures called scaffolds. Scaffolds provide information about the order and orientation of contigs but may still contain gaps between them. Chromosome level: Chromosome level is the highest level of genome assembly, where the genome is fully assembled into chromosomes. Chromosome-level assemblies provide the most complete and accurate representation of the genome, with few gaps and accurate order and orientation of genomic elements. These assemblies typically require multiple sources of genomic information and sophisticated computational tools to produce.

12.1.2 Genome profile analysis

Gathering metrics on genome properties before initiating a de novo genome assembly project is very helpful in setting expectations for the assembly. In the past, DNA flow cytometry was commonly used to estimate genome size, but computational approaches have become the preferred method in recent times [39]. Currently, genome profiling is based on k-mer frequency analysis, which not only provides information on the genome’s complexity, such as its size and levels of heterozygosity and repeat content, but also on the quality of the data.

k-mer spectra can be generated with Meryl, which generates k-mer profile by decomposing the sequencing data into k-length substrings, counting the occurrence of each k-mer and determining its frequency.

#Create a k-mer database
meryl count k=31 mer=both output reads.meryl threads=4 \
     input reads_1.fastq reads_2.fastq

#Generate a k-mer spectrum
meryl histogram reads.meryl > reads.hist

The k-mer histogram produced by Meryl can be used to deduce genome properties with the help of GenomeScope2. This tool utilises a nonlinear least-squares optimisation to fit a combination of negative binomial distributions, providing estimates for genome size, repetitiveness, and heterozygosity rates [40].

./genomescope2.pl -k 31 -i reads.hist -o reads_genomescope

12.1.3 Genome assembly using hifiasm

Hifiasm is a powerful de novo assembler specifically developed for PacBio HiFi reads. One of the key advantages of hifiasm is that it allows us to resolve near-identical, but not exactly identical, sequences, such as repeats and segmental duplications [41]. Hifiasm can be run in multiple modes depending on data availability:

Solo mode

The solo mode generates a pseudohaplotype assembly, resulting in a primary and an alternate assembly solely using HiFi reads.

Hi-C-phased mode

The Hi-C-phased mode generates a hap1 assembly and a hap2 assembly, which are phased using the Hi-C reads from the same individual.

Trio mode

The trio mode requires long-read PacBio HiFi reads from child, and Illumina short-reads from both parents to generate a maternal assembly and a paternal assembly, which are phased using reads from the parents.

12.1.4 Assembly evaluation

Assemblies can be evaluated using a variety of approaches that assess different parameters of the assembled genomes.

gfastats can be used or summary statistics (e.g., contig count, N50, NG50, etc.)

BUSCO assesses genome completeness based on an evolutionary functional perspective. BUSCO genes are anticipated to exist in a single-copy haplotype for a particular clade, and their presence, absence, or duplication can help researchers determine whether an assembly is deficient in significant regions or has multiple copies, which may necessitate purging [42].

Merqury performs a reference-free assessment of assembly completeness and phasing based on k-mers. Merqury compares k-mers in the reads to the k-mers found in the assemblies, as well as the copy number (CN) of each k-mer in the assemblies [43,].

12.1.5 Assembly scaffolding

The following step in the process is to assemble contigs into scaffolds, i.e., to connect contigs interspaced with gaps. While traditionally, this process has been performed using paired-end short-read data with long insert-sizes, the VGP pipeline currently scaffolds using two more advanced technologies: Bionano optical maps and Hi-C data.

Scaffolding using Bionano optical maps

Content to be added.

Scaffolding using Hi-C data

Content to be added.

12.1.6 Final genome evaluation

Content to be added.

Contents of this section were created by Antton Alberdi.

12.1.7 Reference genome annotation

Content to be added.

References

38. Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592:737–46.

39. Wang H, Liu B, Zhang Y, Jiang F, Ren Y, Yin L, et al. Estimation of genome size using k-mer frequencies from corrected long reads. 2020.

40. Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020;11:1432.

41. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–5.

42. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.

43. Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:245.