Primary nucleic acid databases store nucleotide sequences directly from researchers.
GenBank, EMBL, and DDBJ are the three most important primary databases, providing public access to raw sequence data.
GenBank is hosted by the National Institute of Health (USA).
DDJB is located in Japan.
EMBL is located in Europe.
These three databases synchronize data daily, ensuring the most up-to-date information.
They are considered primary databases because they contain the original sequencing data.
4.1 EMBL
EMBL Nucleotide Sequence Collection: This is a database maintained by the European Bioinformatics Institute (EBI).
Purpose: It stores primary nucleotide sequences.
Sources: The data originates from various sources, including individual scientists, gene sequencing facilities, and patent offices.
4.2 GenBank
GenBank is a publicly accessible database containing all available nucleotide sequences and their protein interpretations.
It’s a crucial part of the NCBI (National Center for Biotechnology Information) and is responsible for managing this data collection (INSDC).
GenBank provides access to DNA sequences from various species, exceeding 100,000, generated in labs worldwide.
It has become a vital resource for scientists researching natural environments.
GenBank is expanding rapidly, with an exponential growth rate since its inception.
4.3 DDBJ
DDBJ is a nucleotide sequencing database located at the National Institute of Genetics (NIG) in Shizuoka, Japan.
It is the only database specifically designed for use in Asia.
While primarily used by Japanese researchers, DDBJ welcomes contributions from international researchers and donors.
5 Primary protein databases
Primary protein databases store information about protein sequences and structures.
Sequence databases are crucial for both protein and nucleic acid arrangements.
Structure databases are specific to proteins.
The Protein Data Bank (PDB) was established in 1972 and now houses over 200,000 protein structures.
SWISS-PROT is a protein sequence database containing approximately 70,000 sequences from over 5,000 species.
These databases are openly accessible to researchers and contribute to scientific research and data analysis.
5.1 PDB
PDB is the primary source for structural information of biological macromolecules. It’s a widely used library containing both structural data and visual representations.
PDB serves as the foundation for many derived databases in structural bioinformatics.
PDB archive stores 3D coordinates and information about biological molecules. This data includes details about the particles that make up proteins and their spatial arrangement.
PDB files can be accessed in various formats: PDB, mmCIF, and XML.
PDB files contain a header section with extensive information. This includes protein summaries, reference data, and details about the structural arrangement.
PDB files also include a list of particles and their directions.
The data in PDB files is determined through exploratory methods. These methods are documented within the files.
RCSB Protein Data Bank supplements the data with tools and resources for research and education in various fields. These fields include molecular biology, structural biology, and computational biology.
5.2 SWISS-PROT
SWISS-PROT database: This database stores amino acid sequences and connects them to relevant life science information.
Comprehensive overview: SWISS-PROT provides a comprehensive overview of each protein entry by combining findings from primary research with predictions from simulations and other methods.
Multidisciplinary data: SWISS-PROT offers a multidisciplinary overview of protein data, linking to various information databases.
Focus on human and model organisms: SWISS-PROT prioritizes annotation of human and other model organisms, but includes explanations for all species.
Comprehensive family annotation: SWISS-PROT aims to provide sufficient annotation for all protein families, including some explanations for other families based on the HAMAP project.
Continuous updates: Protein families and groups are regularly analyzed to ensure accuracy and keep up with new discoveries.
TrEMBL’s role: TrEMBL, a database aiming to cover all protein categories, provides automated annotations for proteins not yet covered in SWISS-PROT.
6 Secondary protein databases
Secondary protein databases store information derived from primary databases. They contain data like conserved protein sequences, active site residues, and signature sequences.
The Protein Data Bank (PDB) can be divided into secondary structural information bases. Each base represents a different structural feature (e.g., alpha proteins, beta proteins), providing information about their arrangement and optional structural motifs.
Several databases contribute to this secondary information base: SCOP (Cambridge University), PROSITE (Swiss Institute of Bioinformatics), CATH (University College London), and eMOTIF (Stanford).
6.1 CATH
CATH is a database that categorizes protein domains based on their folding structure.
It uses both manual and automated processes to categorize domains from the Protein Data Bank (PDB).
CATH offers an online interface for searching and downloading data.
The categorization is hierarchical, with four main levels:
Class: Based on secondary structure content (over 90% automatically assigned).
Architecture: Describes the fundamental arrangement of secondary structures, excluding connectivity (currently manually assigned).
Topology: Groups structures based on their topological connections and number of secondary structures.
Homologous superfamilies: Groups of proteins with highly similar activities and structures.
6.2 SCOP
SCOP Database Purpose: To provide comprehensive information about structural and evolutionary relationships between proteins with known 3D structures deposited in the Protein Data Bank.
SCOP’s Aim: To enhance understanding of protein evolution.
""Family"" Level: Groups of proteins closely related genetically and evolutionarily. Modern sequence comparison tools like BLAST, PSI-BLAST, and HMMER effectively identify these connections.
""Superfamily"" Level: Protein domains with looser connections. Their similarity primarily lies in shared structural features, including conserved active or binding sites, oligomerization processes, and potential evolutionary origins.
6.3 Prostate
PROSITE is a database that contains information about protein domains, families, and functions.
PROSITE includes examples and profiles used to distinguish between different protein types.
ProRule is a set of rules that complements PROSITE by providing additional information about functionally or structurally important amino acids.
ProRule enhances the accuracy of profiles and examples by providing additional information about specific amino acids.
7 Composite sequence databases
Composite sequence databases simplify searches by combining numerous important data sources.
They leverage different fundamental datasets and principles for their calculations.
These databases streamline and consolidate various search avenues.
The National Center for Biotechnology Information (NCBI) maintains these nucleotide and protein databases, offering researchers free access through its extensive server network.
7.1 Meta-databases
Meta-databases are knowledge bases that gather information from various data sets to generate new data.
Purpose: They can combine data from diverse sources to create new, more insightful presentations or focus on specific areas like diseases or organisms.
Examples:
BioGraph: Integrates over 20 databases for information discovery.
Information Framework for Neuroscience: Combines hundreds of neuroscience-related resources.
ConsensusPathDB: Consolidates data from 12 databases to understand molecular interactions.
Entrez (NCBI): A comprehensive database for biological information.
8 Genomics and proteomics databases
Genomic Information Bases: These databases store genetic variations, focusing on specific traits (single or multiple) or specific populations/ethnic groups. They are crucial for human genome informatics, aiding in understanding the genetic causes of diseases and confirming genomic variations.
Importance of Genomic Databases: These resources organize genetic data and variations, aiming to be valuable tools for molecular diagnostics, doctors, and analysts. They aim to streamline access to information for consumers.
Proteome Databases: These databases collect and curate protein information from various sources, including publicly available sequences and scientific literature. They are searchable by species and provide easy access to protein data via the internet.
Example: ProteomicsDB, hosted by TUM, is an example of a proteome database.
8.1 The search engines for literature
Internet access revolutionizes access to clinical writing: It provides easy access to a wide range of materials including diaries, databases, dictionaries, course readings, and electronic journals.
Online search engines facilitate research: Popular search engines like Google and Yahoo and specialized medical databases like MEDLINE and PubMed are crucial tools for accessing medical information.
Business web assets enhance resources: Websites like Medscape, MedConnect, and MedicineNet provide valuable medical information.
Online libraries act as meta-destinations: Libraries like the Medical Framework and Emory Libraries offer connections to various health resources worldwide.
Specialized dermatology websites expand resources: Websites like DermIs, DermNet, and Genamics Jornal-seek cater to dermatological information needs.
Choosing the right search tool is crucial: Scientists should consider the advantages and disadvantages of the search engine or database they use, including factors like the type of content, user interface, reliability, and coverage period.
9.1 Humans
The Human Genome Project has significantly impacted genetics research and will soon affect biology and medicine.
The project aimed to map all human genes and decode the entire DNA sequence.
It has led to accelerated identification of genetic disease factors and advanced DNA technologies.
The project has identified genes responsible for most common and rare genetic diseases.
It has uncovered disease-causing mutations, improving diagnosis and revealing new genetic pathways.
The project enables the study of genotype-phenotype correlation, examining the relationship between molecular defects and cellular malfunctions.
The Genome Database (GDB) provides public access to data on human genes, clones, polymorphisms, and maps.
GDB integrates data with scientific literature and other databases, including sequence databases, OMIM, and the Mouse Genome Database.
GDB features Comprehensive Maps for positional searches and visual presentations.
It includes a map viewer for printing maps and displaying query results graphically.
GDB collaborates with the HUGO Nomenclature Committee to maintain gene symbols and associated data.
As research shifts from mapping to sequencing and functional analysis, the GDB schema is expanding.
9.2 Animals
Genomics and its application in animal agriculture: The text highlights the importance of genomics in developing new agricultural practices for more efficient and sustainable animal production.
Benefits of using genomic information: Genomics can lead to healthier, faster-growing animals with improved disease resistance and stress tolerance, resulting in higher-quality products, reduced costs for farmers, and improved consumer satisfaction.
Challenges of animal genome research: The text acknowledges the significant time, cost, and resources required for animal genome research. Access to technology and expertise may be limited, especially for individual researchers or smaller institutions.
Collaboration in animal genome research: The text emphasizes the need for collaboration among researchers to advance the field and find practical applications for genomic data.
The Animal Genome Size Database: This database provides valuable information on the genome sizes of over 6,000 animal species, including taxonomic details, estimation methods, and links to additional resources.
9.3 Fungi
Understanding fungal interactions is crucial for harnessing their potential for human benefit. This includes areas like industrial production, energy production, and climate management.
A 5-year project is underway to sequence the genomes of 1,000 different fungal species. This initiative aims to create a comprehensive ""Fungal Tree of Life"" by sequencing at least two reference genomes from each of the over 500 known fungal families.
The project aims to collect data for future research on plant-microbe interactions, microbial greenhouse gas release and absorption, and environmental metagenomics.
The FungiDB database serves as a resource for functional genomics of pan-fungal genomes. It offers a user-friendly interface similar to EuPathDB, enabling complex and integrated searches.
FungiDB currently contains genomic sequences and annotations for 18 representative fungal species. These include species from Basidiomycota, Ascomycota, and Mucormycotina lineages.
FungiDB provides additional information on cell cycle microarrays, hyphal growth RNA sequences, and yeast two-hybrid interactions.
9.4 Microorganisms
Microorganisms are incredibly diverse and abundant, with an estimated 12,000 known species and millions more yet to be discovered.
Bacteria can survive in a wide range of environments, from extreme heat and cold to high salt concentrations and even seven miles below the surface of the water.
The study of microorganisms has provided valuable insights into evolution, microbial biology and ecology.
The MBGD system facilitates the comparative analysis of completely sequenced microbial genomes, enabling the generation of orthologous gene categorization tables.
MBGD uses all-against-all similarity correlations between genes in different genomes to create these tables.
The categorization tables generated by MBGD can be used to narrow down searches for organisms within specific taxonomic categories.
9.5 Plant and crop genomic database
PlantGDB is a database that stores molecular sequences of sequenced plant species.
EST sequences are organized into ""contigs,"" which represent individual genes.
PlantGDB aims to identify gene groups shared across all plants or unique to specific species.
It integrates bioinformatics tools for gene prediction and cross-species comparison.
PlantGDB displays genomes of species with large-scale sequencing projects, combining EST and cDNA evidence for gene models.