12.2 Host genome resequencing

Once a reference genome is available, short-read sequencing data can be used for generating single nucleotide polymorphism (SNP) data. Although multiple options exists, the pipeline below describes a typical workflow to process data using Bowtie2 for read mapping, Picard for marking duplicates, and GATK for performing variant calling. The resulting SNP data can be used for a wide range of downstream analyses, such as identifying genetic variants associated with diseases, studying population genetics, and performing genome-wide association studies (GWAS). The pipeline is customisable and can be modified to suit the specific needs of the researcher, such as changing the parameters of the tools used or incorporating additional analysis steps. Overall, this pipeline is a powerful tool for investigating genetic variation in genomes and can provide valuable insights into the genetic basis of various biological processes.

The first step is to map the reads agains the reference genome:

bowtie2 -x reference_genome_index \
    -1 forward_reads.fq \
    -2 reverse_reads.fq \
    -S mapped_reads.sam

If the mapping file is saved to an uncompressed SAM file, this should be compressed, and sorted for downstream analyses.

samtools view -bS mapped_reads.sam > mapped_reads.bam
samtools sort mapped_reads.bam -o sorted_mapped_reads.bam

Picard can be then used to mark duplicates in the sorted BAM file.

java -jar picard.jar MarkDuplicates \
      INPUT=sorted_mapped_reads.bam \
      OUTPUT=dedup_sorted_mapped_reads.bam \
      METRICS_FILE=metrics.txt VALIDATION_STRINGENCY=LENIENT

The deduplicated BAM file without redundant reads must be done indexed before starting the variant calling.

samtools index dedup_sorted_mapped_reads.bam

GATK4 is the used to perform local realignment around indels.

gatk --java-options "-Xmx4g" IndelRealigner \
      -R reference_genome.fa \
      -I dedup_sorted_mapped_reads.bam
      -O realigned_reads.bam \
      -targetIntervals intervals.list

Then, base quality score recalibration is performed using GATK4.

gatk --java-options "-Xmx4g" BaseRecalibrator \
      -R reference_genome.fa
      -I realigned_reads.bam
      --known-sites known_snps.vcf \
      -O recal_data.table

Subsequently, base quality score recalibration is applied to the.

gatk --java-options "-Xmx4g" ApplyBQSR \
    -R reference_genome.fa
    -I realigned_reads.bam
    --bqsr-recal-file recal_data.table \
    -O recal_reads.bam