Skip to content

23. Variant calling and annotation

1. Introduction

Variant calling and annotation are crucial processes in modern genomics and bioinformatics. These techniques allow researchers and clinicians to identify and interpret genetic variations within an individual’s genome, providing insights into disease susceptibility, drug responses, and evolutionary patterns. As a student of bioinformatics, understanding these processes is essential for your future work in genomics research, personalized medicine, and other related fields.

This comprehensive guide will delve into the technical aspects of variant calling and annotation, exploring the underlying principles, methodologies, and tools used in the field. We’ll also examine various use cases to illustrate the practical applications of these techniques in real-world scenarios.

2. Fundamentals of Genomic Variation

Before diving into the specifics of variant calling and annotation, it’s crucial to understand the types of genomic variations we’re dealing with:

  1. Single Nucleotide Polymorphisms (SNPs): These are single base pair changes in the DNA sequence. They are the most common type of genetic variation.

  2. Insertions and Deletions (Indels): These involve the addition or removal of one or more nucleotides from the DNA sequence.

  3. Copy Number Variations (CNVs): These are larger structural variations where sections of the genome are repeated or deleted.

  4. Structural Variants (SVs): These include large-scale rearrangements such as inversions, translocations, and large deletions or duplications.

Understanding these variations is crucial because different types of variants require different approaches for accurate detection and annotation.

3. The Variant Calling Pipeline

Variant calling is a multi-step process that involves analyzing sequencing data to identify genetic variations. Let’s break down the typical variant calling pipeline:

3.1. Quality Control and Preprocessing

Before variant calling can begin, raw sequencing data must be preprocessed:

  • Quality Assessment: Tools like FastQC are used to assess the quality of raw sequencing reads.
  • Adapter Trimming: Adapters used in sequencing are removed using tools like Trimmomatic or Cutadapt.
  • Quality Trimming: Low-quality bases at the ends of reads are removed to improve overall data quality.

3.2. Alignment to Reference Genome

Preprocessed reads are aligned to a reference genome:

  • Alignment Algorithms: Tools like BWA (Burrows-Wheeler Aligner) or Bowtie2 are commonly used for this step.
  • SAM/BAM Files: Alignments are typically stored in SAM (Sequence Alignment/Map) or its binary version, BAM format.
  • Duplicate Marking: PCR duplicates are marked or removed to prevent bias in variant calling.

3.3. Variant Calling

This is the core step where genetic variations are identified:

  • Probabilistic Models: Most variant callers use probabilistic models to determine the likelihood of a variant at each position.
  • Bayesian Approach: Tools like GATK’s HaplotypeCaller use a Bayesian approach to calculate the probability of each possible genotype.
  • De Novo Assembly: Some tools, like Platypus, perform local de novo assembly around potential variant sites for improved accuracy.

3.4. Variant Filtering

Raw variant calls often contain false positives and need to be filtered:

  • Hard Filtering: Simple threshold-based filters are applied based on metrics like read depth, mapping quality, and strand bias.
  • VQSR: GATK’s Variant Quality Score Recalibration uses machine learning to filter variants based on known true variants.
  • Ensemble Approach: Some pipelines use multiple variant callers and combine results to improve accuracy.

4. Variant Annotation

Once variants are called, they need to be annotated to provide biological context. Annotation can be broadly categorized into three types:

4.1. Functional Annotation

This involves predicting the impact of variants on gene function:

  • Genomic Context: Identifying whether a variant is in a coding region, intron, UTR, or intergenic region.
  • Protein Impact: For coding variants, predicting whether they cause amino acid changes, frameshifts, or premature stop codons.
  • Regulatory Impact: Identifying variants in regulatory regions like promoters or enhancers.

4.2. Structural Annotation

This focuses on the physical characteristics of the variant:

  • Variant Type: Classifying variants as SNPs, indels, CNVs, or structural variants.
  • Genomic Coordinates: Providing precise locations of variants in the genome.
  • Allele Frequency: Determining how common the variant is in different populations.

4.3. Clinical Annotation

This links variants to clinical information:

  • Disease Associations: Identifying variants associated with specific diseases or phenotypes.
  • Pharmacogenomics: Linking variants to drug response or adverse effects.
  • Clinical Significance: Classifying variants as pathogenic, likely pathogenic, benign, etc., based on guidelines like those from ACMG (American College of Medical Genetics and Genomics).

5. Tools and Software

A wide range of tools is available for variant calling and annotation. Here are some of the most commonly used ones:

5.1. Variant Calling Tools

  1. GATK (Genome Analysis Toolkit): Developed by the Broad Institute, GATK is one of the most widely used toolkits for variant discovery and genotyping.

  2. SAMtools/BCFtools: These tools provide a suite of programs for manipulating alignments and calling variants.

  3. FreeBayes: A Bayesian genetic variant detector designed to find small polymorphisms.

  4. Strelka2: Primarily used for somatic variant calling in cancer genomics.

  5. DeepVariant: A deep learning-based variant caller developed by Google.

5.2. Annotation Tools

  1. Ensembl VEP (Variant Effect Predictor): A powerful tool for annotating and prioritizing variants.

  2. SnpEff: Annotates and predicts the effects of genetic variants on genes and proteins.

  3. ANNOVAR: A tool for functionally annotating genetic variants detected from diverse genomes.

  4. VarSome: A platform that integrates multiple annotation sources and provides clinical interpretations.

  5. dbNSFP: A database for functional prediction and annotation of all potential non-synonymous single-nucleotide variants in the human genome.

6. Use Cases and Applications

Variant calling and annotation have numerous applications in genomics and precision medicine. Let’s explore some key use cases:

6.1. Cancer Genomics

In cancer research and treatment, variant calling and annotation are crucial for:

  • Tumor Profiling: Identifying somatic mutations specific to tumor cells.
  • Driver Mutation Detection: Distinguishing driver mutations that contribute to cancer progression from passenger mutations.
  • Treatment Selection: Identifying mutations that may respond to targeted therapies.
  • Monitoring: Tracking tumor evolution and treatment response over time.

Example workflow:

  1. Sequence tumor and matched normal samples.
  2. Align reads to the reference genome.
  3. Call somatic variants using tools like MuTect2 or Strelka2.
  4. Annotate variants with tools like Oncotator or COSMIC.
  5. Prioritize variants based on their potential as driver mutations or therapeutic targets.

6.2. Rare Disease Diagnosis

For diagnosing rare genetic disorders:

  • Trio Analysis: Analyzing variants in a child and both parents to identify de novo mutations.
  • Filtering Strategies: Applying various filters to prioritize potentially causative variants.
  • Pathway Analysis: Identifying affected biological pathways based on variant annotations.

Example workflow:

  1. Perform whole exome or whole genome sequencing on the patient and parents.
  2. Call variants using GATK or similar tools.
  3. Annotate variants with tools like VEP or ANNOVAR.
  4. Filter variants based on inheritance patterns, frequency in population databases, and predicted functional impact.
  5. Prioritize candidate variants for further investigation.

6.3. Population Genetics

In studying genetic diversity and evolution:

  • Allele Frequency Calculation: Determining the frequency of variants in different populations.
  • Selection Analysis: Identifying variants under positive or negative selection.
  • Demographic Inference: Using variant patterns to infer population history and migration patterns.

Example workflow:

  1. Sequence or genotype individuals from multiple populations.
  2. Call variants using population-aware callers like GATK’s GenotypeGVCFs.
  3. Annotate variants with population frequency data from resources like gnomAD.
  4. Perform statistical tests for selection using tools like SweepFinder or iHS.
  5. Visualize population structure using tools like ADMIXTURE or PCA.

6.4. Pharmacogenomics

For predicting drug response and optimizing treatment:

  • Drug Metabolism: Identifying variants in genes involved in drug metabolism (e.g., CYP450 enzymes).
  • Drug Target Analysis: Detecting variations in genes encoding drug targets.
  • Adverse Effect Prediction: Identifying variants associated with increased risk of adverse drug reactions.

Example workflow:

  1. Sequence or genotype patient samples.
  2. Call variants using standard pipelines.
  3. Annotate variants with pharmacogenomic databases like PharmGKB.
  4. Generate reports highlighting variants relevant to specific drugs or drug classes.
  5. Integrate findings into clinical decision support systems.

7. Challenges and Considerations

While variant calling and annotation have come a long way, several challenges remain:

  1. Complex Regions: Repetitive regions, pseudogenes, and highly polymorphic areas of the genome remain difficult for accurate variant calling.

  2. Structural Variants: Detecting and annotating large structural variants is still challenging, especially with short-read sequencing data.

  3. Non-Coding Variants: Interpreting the functional impact of variants in non-coding regions is often difficult due to our limited understanding of regulatory elements.

  4. Variant Interpretation: Determining the clinical significance of many variants remains challenging, leading to a high proportion of variants of uncertain significance (VUS).

  5. Computational Resources: Variant calling and annotation, especially for whole genome data, require significant computational resources.

  6. Data Integration: Integrating variant data with other omics data (transcriptomics, proteomics, etc.) for comprehensive interpretation is an ongoing challenge.

  7. Ethical Considerations: Handling incidental findings and ensuring patient privacy in genomic data analysis are important ethical considerations.

8. Future Directions

The field of variant calling and annotation is rapidly evolving. Some exciting future directions include:

  1. Long-Read Sequencing: Technologies like PacBio and Oxford Nanopore are improving our ability to detect structural variants and resolve complex genomic regions.

  2. Graph Genomes: Moving beyond linear reference genomes to graph-based representations that better capture human genetic diversity.

  3. Machine Learning and AI: Developing more sophisticated algorithms for variant calling, filtering, and interpretation using machine learning approaches.

  4. Single-Cell Genomics: Applying variant calling and annotation techniques to single-cell sequencing data for understanding cellular heterogeneity.

  5. Functional Validation: High-throughput methods for experimentally validating the functional impact of variants.

  6. Clinical Integration: Improved integration of genomic data into electronic health records and clinical decision support systems.

  7. Pan-Genome Analysis: Analyzing variants in the context of a pan-genome that represents the genetic diversity of a species, rather than a single reference genome.

9. Conclusion

Variant calling and annotation are fundamental processes in modern genomics, enabling us to uncover the genetic basis of diseases, guide treatment decisions, and understand human evolution. As a bioinformatics student, mastering these techniques will be crucial for your future work in genomics research and precision medicine.

The field is rapidly evolving, with new tools, methodologies, and applications emerging regularly. Staying up-to-date with the latest developments, understanding the underlying principles, and gaining hands-on experience with various tools and pipelines will be key to your success in this exciting field.

As you continue your studies, consider exploring some of the use cases and tools mentioned in this guide in more depth. Practice working with real genomic datasets, and don’t hesitate to dive into the primary literature to understand the latest advances in variant calling and annotation.

10. References

  1. Van der Auwera GA, et al. (2013). From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics.

  2. McLaren W, et al. (2016). The Ensembl Variant Effect Predictor. Genome Biology.

  3. Koboldt DC, et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research.

  4. Cingolani P, et al. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly.

  5. Poplin R, et al. (2018). A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology.

  6. Richards S, et al. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine.

  7. 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature.

  8. Whiffin N, et al. (2017). Using high-resolution variant frequencies to empower clinical genome interpretation. Genetics in Medicine.

  9. Karczewski KJ, et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature.

  10. Zook JM, et al. (2019). An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology.