8. Hands-on with NCBI Tools
Introduction
The National Center for Biotechnology Information (NCBI) provides a wealth of resources and tools that are essential for anyone venturing into the field of bioinformatics. As a student interested in this rapidly evolving discipline, understanding and mastering NCBI tools will give you a significant advantage in your studies and future career. This comprehensive guide will walk you through the most important NCBI tools, their applications, and provide hands-on examples to help you gain practical experience.
Understanding NCBI: An Overview
The National Center for Biotechnology Information (NCBI) is a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). Established in 1988, NCBI has become an indispensable resource for researchers, students, and professionals in the fields of molecular biology, genetics, and bioinformatics.
NCBI’s mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. To achieve this, NCBI:
- Conducts research on fundamental biomedical problems at the molecular level using mathematical and computational methods.
- Maintains and distributes the GenBank DNA sequence database.
- Creates automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics.
- Facilitates the use of databases and software by researchers and medical care personnel.
- Coordinates efforts to gather biotechnology information worldwide.
As a bioinformatics student, understanding the scope and capabilities of NCBI tools is crucial for your academic and professional development.
Navigating the NCBI Website
The NCBI website (https://www.ncbi.nlm.nih.gov/) can be overwhelming at first glance due to the sheer number of resources available. Here’s a quick guide to help you navigate the site efficiently:
- Homepage: The main page provides quick access to popular resources and a search bar for all NCBI databases.
- All Resources: This dropdown menu lists all available NCBI tools and databases, categorized by their function.
- Literature: Access to PubMed, PubMed Central, and other literature-related resources.
- Health: Information on diseases, drugs, and genetic testing.
- Genomes: Access to genome-related databases and tools.
- Genes: Gene-specific information and analysis tools.
- Proteins: Protein sequence and structure databases and analysis tools.
- Chemicals: Information on chemical compounds and their biological activities.
Familiarize yourself with this layout, as you’ll be using it frequently throughout your bioinformatics journey.
Essential NCBI Databases
GenBank
GenBank is the NIH genetic sequence database, containing an annotated collection of all publicly available DNA sequences. As a bioinformatics student, you’ll often find yourself accessing GenBank for various projects and analyses.
Key Features:
- Comprehensive: Contains sequences from more than 420,000 formally described species.
- Regular Updates: New sequences are added daily.
- Linked Data: Sequences are linked to related literature and other NCBI resources.
Hands-on Example: Let’s retrieve a DNA sequence from GenBank:
- Go to the NCBI homepage and select “Nucleotide” from the search dropdown.
- Enter an accession number (e.g., NM_000546 for human p53 mRNA).
- Click on the result to view the full record, including sequence, annotations, and related information.
PubMed
PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. As a student, you’ll use PubMed extensively for literature reviews and staying updated on current research.
Key Features:
- Comprehensive Coverage: Over 32 million citations and abstracts.
- Advanced Search: Allows for complex queries using Boolean operators and field tags.
- My NCBI: Personalized experience with saved searches and alerts.
Hands-on Example: Let’s perform a literature search on CRISPR gene editing:
- Go to PubMed (https://pubmed.ncbi.nlm.nih.gov/).
- In the search bar, enter: “CRISPR[Title/Abstract] AND gene editing[Title/Abstract]”.
- Use the filters on the left to refine your search (e.g., publication date, article type).
- Click on relevant articles to read abstracts or access full texts when available.
BLAST
BLAST (Basic Local Alignment Search Tool) is perhaps the most widely used bioinformatics algorithm for sequence similarity searches. It’s essential for identifying homologous sequences, predicting gene function, and evolutionary studies.
Key Features:
- Multiple BLAST Programs: Nucleotide BLAST, Protein BLAST, blastx, tblastn, etc.
- Customizable Parameters: Adjust scoring matrices, gap penalties, and more.
- Batch Searches: Submit multiple sequences at once.
We’ll dive deeper into BLAST in the next section.
Sequence Analysis Tools
BLAST (Basic Local Alignment Search Tool)
BLAST is a fundamental tool in bioinformatics for comparing primary biological sequence information, such as amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.
Types of BLAST:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares an amino acid query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Hands-on Example: Let’s perform a blastp search:
- Go to BLAST homepage (https://blast.ncbi.nlm.nih.gov/Blast.cgi).
- Select “Protein BLAST” under the “Basic BLAST” section.
- Enter a protein sequence (e.g., the human p53 protein sequence) in FASTA format.
- Choose the database (e.g., Non-redundant protein sequences (nr)).
- Click “BLAST” and wait for the results.
- Analyze the results, focusing on E-values, percent identity, and query coverage.
Primer-BLAST
Primer-BLAST is a tool for designing target-specific primers for your PCR experiments. It combines BLAST with a global alignment algorithm to ensure the primers are specific to the intended target sequence.
Key Features:
- Specificity Checking: Ensures primers only amplify the intended target.
- Customizable Parameters: Adjust primer length, GC content, melting temperature, etc.
- Multiple Template Support: Design primers for multiple related sequences simultaneously.
Hands-on Example: Let’s design primers for the human BRCA1 gene:
- Go to Primer-BLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/).
- Enter the BRCA1 gene ID (e.g., 672) in the “Enter accession, gi, or FASTA sequence” field.
- Adjust parameters as needed (e.g., PCR product size, primer melting temperatures).
- Click “Get Primers”.
- Review the resulting primer pairs, considering factors like specificity and self-complementarity.
ORFfinder
ORFfinder (Open Reading Frame Finder) is a graphical analysis tool for finding open reading frames (ORFs) in DNA sequences. This is crucial for identifying potential protein-coding regions in newly sequenced DNA.
Key Features:
- Multiple ORF Detection: Identifies all possible ORFs in six frames.
- Customizable Parameters: Adjust minimum ORF length, start codon, etc.
- Integration with BLAST: Easily BLAST detected ORFs against protein databases.
Hands-on Example: Let’s find ORFs in a DNA sequence:
- Go to ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/).
- Enter a nucleotide sequence or accession number.
- Set parameters (e.g., minimum ORF length, genetic code).
- Click “Submit”.
- Analyze the graphical output and table of ORFs.
- Select interesting ORFs and use the “BLAST ORF” option to search for similar proteins.
Structural Analysis Tools
Cn3D
Cn3D (pronounced “see in 3D”) is a visualization tool for biomolecular structures, sequences, and sequence alignments. It’s particularly useful for understanding the three-dimensional structure of proteins and nucleic acids.
Key Features:
- 3D Structure Viewing: Rotate, zoom, and highlight specific residues or regions.
- Sequence-Structure Linkage: See how sequence features relate to 3D structure.
- Multiple Structure Alignment: Compare structures of related proteins.
Hands-on Example: Let’s visualize a protein structure:
- Download and install Cn3D from the NCBI website.
- Go to the NCBI Structure database and search for a protein (e.g., hemoglobin).
- Click on a structure and select “View in Cn3D”.
- Use the software to explore the 3D structure, highlighting specific domains or residues.
iCn3D
iCn3D is a web-based 3D structure viewer, an evolution of Cn3D that doesn’t require software installation. It offers advanced features for analyzing and annotating molecular structures.
Key Features:
- Web-Based: No installation required, works in modern web browsers.
- Advanced Visualization: Custom coloring, surface representations, distance measurements.
- Shareable Views: Generate URLs for specific views to share with colleagues.
Hands-on Example: Let’s analyze a protein-ligand interaction:
- Go to iCn3D (https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html).
- In the “Load Structure” menu, enter a PDB ID (e.g., 1OKE for HIV-1 protease with an inhibitor).
- Use the “Select” menu to highlight the ligand and nearby residues.
- Use the “Style” menu to change representations (e.g., stick model for the ligand).
- Use the “Analysis” menu to measure distances between atoms.
Genomic Analysis Tools
Genome Data Viewer
The Genome Data Viewer (GDV) is a browser for viewing and analyzing eukaryotic genomes. It’s an essential tool for exploring genomic context, variations, and annotations.
Key Features:
- Multiple Data Tracks: View genes, variations, expression data, and more.
- Customizable Display: Add or remove tracks, adjust zoom levels.
- Data Export: Download sequence data or images for further analysis or publication.
Hands-on Example: Let’s explore the human BRCA1 gene region:
- Go to GDV (https://www.ncbi.nlm.nih.gov/genome/gdv/).
- Choose “Human” from the organism list.
- In the search box, enter “BRCA1”.
- Explore the gene region, noting nearby genes, variations, and other features.
- Use the “Tracks” button to add or remove data tracks as needed.
Gene
The Gene database provides a unified view of genes and their associated information across multiple species. It’s crucial for understanding gene function, expression, and evolution.
Key Features:
- Comprehensive Gene Information: Sequences, variations, expression, pathways, and more.
- Cross-Species Comparisons: Explore orthologs and gene evolution.
- Links to Related Resources: Easy access to literature, structures, and other NCBI databases.
Hands-on Example: Let’s investigate the p53 gene:
- Go to the Gene database (https://www.ncbi.nlm.nih.gov/gene/).
- Search for “TP53” (the official symbol for p53).
- Explore the gene summary, genomic context, and expression data.
- Check the “Orthologs” section to compare p53 across species.
- Use the links to explore related proteins, variations, and literature.
Literature Analysis Tools
PubMed Central
PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature. It’s an invaluable resource for accessing complete research articles.
Key Features:
- Full-Text Access: Many articles are freely available in their entirety.
- Advanced Search: Similar to PubMed, allowing for complex queries.
- Article Formats: View articles in web, PDF, or EPUB formats.
Hands-on Example: Let’s find and read a full-text article on CRISPR:
- Go to PMC (https://www.ncbi.nlm.nih.gov/pmc/).
- Search for “CRISPR gene editing review”.
- Use filters to narrow results (e.g., “Review” article type, last 5 years).
- Select an article and explore different viewing options (HTML, PDF).
- Use the article’s reference list to find related papers.
MeSH Database
Medical Subject Headings (MeSH) is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed. Understanding MeSH can greatly improve your literature search strategies.
Key Features:
- Hierarchical Structure: Broader and narrower terms for refining searches.
- Subheadings: Allow for more specific aspects of a topic.
- Automatic Term Mapping: PubMed uses MeSH to interpret your search terms.
Hands-on Example: Let’s use MeSH to improve a literature search:
- Go to the MeSH Database (https://www.ncbi.nlm.nih.gov/mesh/).
- Search for “gene editing”.
- Explore the MeSH tree structure and related terms.
- Click “Add to search builder” for relevant terms.
- Use the search builder to construct a PubMed query with MeSH terms.
Programmatic Access to NCBI Resources
For bioinformatics students, learning to access NCBI resources programmatically is crucial for handling large-scale data analysis and automating repet