Genomes of eukaryotic organisms are generally complex, because they carry multiple copies of the same genome, genomes contain duplications, repetitive sequences, mobile elements, etc. In consequence, generating a high-quality reference genome that represents all this complexity is a complex effort, that today, requires multiple complementary molecular techniques to be merged. Although multiple genome assembly protocols exist, in this guidebook we will focus on the one employed in the Vertebrates Genomes Project, the largest consortium aiming at generating animal reference genomes in a standardised way . The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.
Before advancing with genome generation procedures, it is important to acknowledge that reference genomes can have different qualities. Quality is measured by assembly statistics, such as the N50 and L90 metrics, which provide an overview of the completeness and accuracy of the genome. Based on those metrics, eukaryotic genomes are usually categorised in three levels:
Contig level: Contig level refers to the lowest level of genome assembly, where the genome is fragmented into small pieces called contigs. Contigs are contiguous sequences of DNA that are typically hundreds to thousands of base pairs in length. Contig-level genome assemblies lack information about the order and orientation of the contigs and may contain gaps between them. Scaffold level: Scaffold level is the next level of genome assembly, where contigs are linked together using paired-end reads or other genomic information to form larger structures called scaffolds. Scaffolds provide information about the order and orientation of contigs but may still contain gaps between them. Chromosome level: Chromosome level is the highest level of genome assembly, where the genome is fully assembled into chromosomes. Chromosome-level assemblies provide the most complete and accurate representation of the genome, with few gaps and accurate order and orientation of genomic elements. These assemblies typically require multiple sources of genomic information and sophisticated computational tools to produce.
Gathering metrics on genome properties before initiating a de novo genome assembly project is very helpful in setting expectations for the assembly. In the past, DNA flow cytometry was commonly used to estimate genome size, but computational approaches have become the preferred method in recent times . Currently, genome profiling is based on k-mer frequency analysis, which not only provides information on the genome’s complexity, such as its size and levels of heterozygosity and repeat content, but also on the quality of the data.
k-mer spectra can be generated with Meryl, which generates k-mer profile by decomposing the sequencing data into k-length substrings, counting the occurrence of each k-mer and determining its frequency.
The k-mer histogram produced by Meryl can be used to deduce genome properties with the help of GenomeScope2. This tool utilises a nonlinear least-squares optimisation to fit a combination of negative binomial distributions, providing estimates for genome size, repetitiveness, and heterozygosity rates .
Hifiasm is a powerful de novo assembler specifically developed for PacBio HiFi reads. One of the key advantages of hifiasm is that it allows us to resolve near-identical, but not exactly identical, sequences, such as repeats and segmental duplications . Hifiasm can be run in multiple modes depending on data availability:
The solo mode generates a pseudohaplotype assembly, resulting in a primary and an alternate assembly solely using HiFi reads.
The Hi-C-phased mode generates a hap1 assembly and a hap2 assembly, which are phased using the Hi-C reads from the same individual.
Assemblies can be evaluated using a variety of approaches that assess different parameters of the assembled genomes.
gfastats can be used or summary statistics (e.g., contig count, N50, NG50, etc.)
BUSCO assesses genome completeness based on an evolutionary functional perspective. BUSCO genes are anticipated to exist in a single-copy haplotype for a particular clade, and their presence, absence, or duplication can help researchers determine whether an assembly is deficient in significant regions or has multiple copies, which may necessitate purging .
Merqury performs a reference-free assessment of assembly completeness and phasing based on k-mers. Merqury compares k-mers in the reads to the k-mers found in the assemblies, as well as the copy number (CN) of each k-mer in the assemblies [43,].
The following step in the process is to assemble contigs into scaffolds, i.e., to connect contigs interspaced with gaps. While traditionally, this process has been performed using paired-end short-read data with long insert-sizes, the VGP pipeline currently scaffolds using two more advanced technologies: Bionano optical maps and Hi-C data.