BLAST and its variants
In the rapidly evolving field of bioinformatics, few tools have had as profound an impact as the Basic Local Alignment Search Tool, commonly known as BLAST. Developed by Altschul et al. in 1990, BLAST has become an indispensable algorithm for comparing primary biological sequence information, such as amino acid sequences of proteins or the nucleotides of DNA and RNA sequences.
This article aims to provide a comprehensive overview of BLAST and its variants, tailored for students interested in bioinformatics.
1. Understanding BLAST: Basic Principles
At its core, BLAST is a heuristic algorithm designed to find regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
In a heuristic approach, they divide the query into short subsequences (""words"") and compare them to database sequences. BLAST only evaluates the most significant word matches.
1.1 BLAST Algorithm
The BLAST algorithm can be broken down into several key steps:
-
Seeding: The query sequence is broken down into short, contiguous subsequences (words). For protein sequences, these are typically 3 amino acids long, while for nucleotide sequences, they are 11 base pairs long.
-
Word Matching: These words are then matched against the database sequences. Matches are used as seeds for possible alignments.
-
Extension: Seeds are extended in both directions to create a local alignment. This process continues as long as the score of the alignment increases.
-
Evaluation: The algorithm evaluates the score of each alignment using a substitution matrix (like BLOSUM62 for proteins) and a gap penalty for insertions and deletions.
-
Significance Assessment: The statistical significance of each alignment is calculated, typically expressed as an expect value (E-value).

Understanding this process is crucial for interpreting BLAST results and optimizing searches for specific research needs.
3. BLAST Variants
BLAST has evolved into several specialized variants, each designed for specific types of sequence comparisons. Here, we’ll explore the most commonly used BLAST programs.
3.1 BLASTN
- Purpose: Compares a nucleotide query sequence against a nucleotide sequence database.
- Use Case: Identifying similar DNA sequences across species or within a genome.
- Technical Note: Uses a word size of 11 by default, which can be adjusted for more sensitive searches.
3.2 BLASTP
- Purpose: Compares an amino acid query sequence against a protein sequence database.
- Use Case: Identifying homologous proteins, predicting protein structure and function.
- Technical Note: Uses the BLOSUM62 scoring matrix by default, which can be changed based on the evolutionary distance between sequences.
3.3 BLASTX
- Purpose: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- Use Case: Identifying potential protein-coding regions in a DNA sequence, especially useful for analyzing ESTs or genomic sequences.
- Technical Note: Performs six-frame translation of the query sequence, which can be computationally intensive for large sequences.
3.4 TBLASTN
- Purpose: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
- Use Case: Finding new members of a protein family in unannotated genomic DNA.
- Technical Note: Useful for identifying genes in species where protein databases are limited but genomic sequences are available.
3.5 TBLASTX
- Purpose: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
- Use Case: Comparing two or more genomes at the nucleotide level when the location of protein-coding regions is unknown.
- Technical Note: Computationally intensive due to the multiple translations required.
3.6 PSI-BLAST (Position-Specific Iterated BLAST)
- Purpose: Uses an iterative search in which sequences found in one round of search are used to build a score model for the next round.
- Use Case: Detecting distant evolutionary relationships between proteins.
- Technical Note: Creates a position-specific scoring matrix (PSSM) from multiple sequence alignments, which can capture sequence patterns specific to protein families.
3.7 PHI-BLAST (Pattern-Hit Initiated BLAST)
- Purpose: Combines pattern matching with local alignment to find sequences that both contain a pattern and are homologous to a query sequence.
- Use Case: Identifying proteins with both sequence similarity and a specific motif.
- Technical Note: Requires a predefined pattern in PROSITE format, which can significantly speed up searches in large databases.
3.8 DELTA-BLAST (Domain Enhanced Lookup Time Accelerated BLAST)
- Purpose: Searches a database of pre-constructed position-specific scoring matrices (PSSMs) before searching a protein sequence database.
- Use Case: Improving the accuracy of protein sequence similarity searches, especially for detecting remote homologs.
- Technical Note: Constructs a PSSM using the query and conserved domains, which is then used for the database search.
4. Technical Considerations
To effectively use BLAST in bioinformatics research, it’s crucial to understand and optimize various technical aspects:

4.1 Scoring Matrices
Scoring matrices are fundamental to sequence alignment algorithms, including BLAST. They assign scores to matches, mismatches, and gaps in alignments.
-
For Proteins: BLOSUM (BLOcks SUbstitution Matrix) and PAM (Point Accepted Mutation) matrices are commonly used.
- BLOSUM62 is the default for BLASTP and is suitable for most purposes.
- Higher numbered matrices (e.g., BLOSUM80) are better for closely related sequences.
- Lower numbered matrices (e.g., BLOSUM45) are better for more distantly related sequences.
-
For Nucleotides: Simple match/mismatch scoring is typically used.
- The default is usually +2 for a match and -3 for a mismatch.
Technical Note: Choosing the appropriate matrix can significantly affect the sensitivity and specificity of your BLAST search.
4.2 E-value and Bit Score
Understanding these statistical measures is crucial for interpreting BLAST results:
-
E-value (Expect value): Represents the number of alignments expected by chance with a score equal to or better than the alignment score. Lower E-values indicate more significant alignments.
-
Bit Score: A normalized score that allows comparisons across different searches and databases. Higher bit scores indicate better alignments.
Technical Approach:
- Use E-values for assessing the statistical significance of an alignment.
- Use bit scores when comparing alignments from different searches.
- Be cautious with short sequences, as they can produce misleadingly low E-values.
4.3 Database Selection
Choosing the right database is crucial for meaningful BLAST results:
- nr/nt: Non-redundant nucleotide database, comprehensive but can be overwhelming.
- RefSeq: Curated, non-redundant set of reference sequences.
- SwissProt: Manually annotated and reviewed protein sequences.
- PDB: For searching against known protein structures.
Technical Note: Consider using taxon-specific databases when working with particular organisms or groups.
4.4 Parameters Optimization
Adjusting BLAST parameters can significantly impact search results:
- Word Size: Smaller word sizes increase sensitivity but decrease speed.
- Gap Costs: Adjusting gap opening and extension penalties can affect alignment length and accuracy.
- Filters: Low-complexity filters like SEG for proteins or DUST for nucleotides can prevent spurious matches.
- Compositional Adjustments: Can help when comparing sequences with different nucleotide or amino acid compositions.
Technical Approach:
- Start with default parameters and adjust based on specific needs.
- Use smaller word sizes and more permissive gap costs for finding distant homologs.
- Consider turning off filters when searching for short motifs or low-complexity regions.
5. Use Cases in Bioinformatics
BLAST and its variants have a wide range of applications in bioinformatics research. Here are some key use cases:
5.1 Gene Identification
BLAST is instrumental in identifying genes in newly sequenced genomes. By comparing unknown sequences to databases of known genes, researchers can predict the location and function of genes in a new organism.
Technical Approach:
- Use BLASTX to compare genomic DNA sequences against protein databases.
- Employ TBLASTN to search for protein-coding genes using known protein sequences as queries against a genomic database.
- Utilize BLASTN for identifying non-coding RNA genes by comparing against RNA databases.
5.2 Functional Annotation
Once genes are identified, BLAST helps in predicting their functions based on similarities to known genes or proteins.
Technical Approach:
- Use BLASTP to compare predicted protein sequences against databases like UniProt or RefSeq.
- Employ PSI-BLAST for detecting distant homologs that might share functional similarities.
- Use conserved domain databases (CDD) in conjunction with RPS-BLAST to identify functional domains within proteins.
5.3 Evolutionary Studies
BLAST is crucial for comparative genomics and studying evolutionary relationships between species.
Technical Approach:
- Use BLASTN or TBLASTX for whole-genome comparisons between species.
- Employ BLASTP or PSI-BLAST to identify orthologs and paralogs across different species.
- Utilize BLAST results as input for phylogenetic tree construction tools.
5.4 Primer Design
In molecular biology, BLAST is used to ensure the specificity of PCR primers.
Technical Approach:
- Use BLASTN to check potential primer sequences against the target genome to ensure uniqueness.
- Employ BLAST against databases of other organisms to avoid cross-species amplification.
5.5 Metagenomics
BLAST is essential in analyzing complex microbial communities through metagenomic sequencing.
Technical Approach:
- Use BLASTX to compare short read sequences against protein databases for taxonomic classification.
- Employ BLASTN for comparing assembled contigs against nucleotide databases to identify species present in the sample.
- Utilize specialized BLAST databases like SILVA for 16S rRNA gene analysis in microbial community profiling.
6. Conclusion
BLAST and its variants remain cornerstone tools in bioinformatics, essential for a wide range of genomic and proteomic analyses. As a bioinformatics student, mastering BLAST is crucial for your future work in genomics, proteomics, and computational biology.
Understanding the technical aspects of BLAST - from its basic algorithm to advanced parameter optimization - will enable you to effectively leverage this powerful tool in your research. As the field evolves, staying informed about new developments and