Skip to content

What is a database?

1 Introduction

  • What is a database? A database is a large, organized collection of data designed to be durable and accessible.
  • DBMS Role: Database Management Systems (DBMS) like MySQL, Oracle, and PostgreSQL manage data storage, access, and modification.
  • Biological Databases: These databases focus on data from natural sciences, specifically molecular science and bioinformatics.
  • Importance of Biological Databases:
    • Handling Big Data: They help manage the vast amounts of data generated by advancements in molecular research and sequencing.
    • Personalized Medicine: Contribute to developing tailored treatments and drug prescriptions.
    • Genetic Disease Treatment: Support the modification of DNA for treating genetic disorders.
    • Bioinformatics Hub: Serve as central repositories for bioinformatics data, allowing for efficient retrieval and analysis.
    • Diverse Applications: Support applications in bioweapons development, evolutionary research, agriculture, and food science.
    • Knowledge Discovery: Machine learning and natural language processing (NLP) enable automated curation and the uncovering of hidden insights within raw data.
    • Accessibility: Provide multiple access points for retrieving publicly available data.
    • Improved Indexing & Reduced Redundancy: Enhance data accessibility through indexing and minimize data duplication through computational and manual curation.

1.1 Characteristics of biological data

  • Complexity: Biological data is highly complex, requiring models that can capture intricate substructures and relationships to avoid information loss.
  • Data Model Flexibility: Biological data models need adaptability to handle diverse data types and values due to exceptions and overlaps in data collected for various species and genome projects.
  • Dynamic Schemas: Biological databases undergo constant schema modifications. Traditional database systems struggle to accommodate these changes, leading to frequent re-releases of entire databases instead of incremental updates.
  • Versioning and History: Biological data requires mechanisms to track and access previous versions of data. Examples like GenBank’s Accession and GI numbers are crucial for version control.

2.1 Primary database

  • Primary databases, also known as archival databases, are mainly composed of datasets generated through experiments. These datasets include nucleotide and protein sequences as well as information on how macromolecules are assembled.
  • Primary databases are augmented with functional annotations, references, and links to other databases.
  • The information in primary databases is entered by the researchers themselves and assigned a unique accession number.
  • Types of primary databases include:
    • Primary nucleotide sequence databases: GenBank, EMBL, and DDBJ are the three major databases that store raw nucleic acid sequences.
    • Microarray/Functional genomics databases: These databases focus on experiments using high-throughput technologies to analyze transcripts, proteins, and metabolites.
    • Protein sequences and structure databases: PIR-PSD and SWISS-PROT store protein sequences. PIR-PSD is a well-annotated, complete object-relational database management system that organizes protein sequences into groups using the concept of ""superfamilies.""

2.2 Secondary database

  • Secondary databases store information about protein families, conserved sequences, and active site residues. These databases are created by aligning related protein sequences. Examples include SCOP, CATH, PROSITE, and eMOTIF.
  • Protein databases are essential for modern biology research. They contain information about protein structures, functions, and sequences. Searching these databases allows scientists to compare proteins and learn about relationships between them.
  • Functional information databases store data needed for a specific function. This information is useful for planning and managing work.
  • Nucleotide sequence and annotation databases help scientists understand the structure and function of genes and proteins. This process involves analyzing raw data to identify genes, their functions, and other important information.

2.3 Composite database

  • Composite databases integrate information from various major databases, eliminating the need to search multiple locations.
  • Each composite database utilizes a specific primary database and unique search criteria. This allows for a variety of search methods within the composite database.
  • The National Center for Biotechnology Information (NCBI) provides unrestricted access to nucleotide and protein databases, hosted on their high-performance servers.
  • A link to the Online Mendelian Inheritance in Man Database (OMIM) is provided, offering information on proteins associated with inherited diseases.
  • Uniprot can function as both a primary and secondary database. It accepts primary peptide sequences and also collects protein clusters from them, integrating data from TrEMBL and SwissProt.

3 Models of databases

  • Database models are the frameworks used for storing data within a DBMS.
  • The choice of model significantly impacts data retrieval and storage effectiveness.
  • Early models were simple, often using two-dimensional tables or single files.
  • Modern models are more complex and interconnected due to the growth of data.
  • Common models include Flat File, Hierarchic, Network, Entity-Relationship, and Relational.
  • The ideal model depends on factors like data relationships, application requirements (speed, accuracy, usability, adaptability, cost).
  • Most DBMS are designed around a specific model, though they may support multiple models.
  • Models provide structure for data organization and outline actions that can be performed on the data.
  • Actions like ""select"" and ""join"" are fundamental building blocks for query languages, even if not explicitly stated.

3.1 Flat file

  • Flat File Databases: The most basic database structure is the flat file, which organizes information in a table-like format with columns (fields) and rows (records).
  • Organization: Columns define different fields (data categories), while rows contain data for a single record, all sharing a common ID.
  • Evolution: The relational database model emerged from the flat file model.
  • Biological Example: GenBank entries illustrate the flat file paradigm in biological databases, with fields and values representing data elements.

3.2 Hierarchical model

  • Hierarchical model: Data is organized in a tree-like structure with a root node and child nodes connected to parent nodes.
  • Root Node: The hierarchy starts with the root data, representing the top level of the structure.
  • Child Nodes: Additional nodes are added as children of existing parent nodes, expanding the tree.
  • One-to-One Relationship: Each child node is connected to only one parent node.
  • Tree-like Structure: Data is arranged in a hierarchical fashion, reflecting relationships between different categories.
  • Example: In biological sciences, the cell type acts as the root node in a biological function network database. Organelles, cytoskeletal elements, pathways, and functions are all linked as child nodes, forming a hierarchical structure.

3.3 Network model

  • Network Model: This model is a network-based extension of the hierarchical model.
  • Graph-like Structure: Data is structured like a graph, allowing nodes to have multiple parent nodes.
  • Interconnected Data: The network model features a high degree of interconnected data, making access easier and faster.
  • Many-to-Many Relationships: This model supports many-to-many relationships between data entities.
  • Prevalence Before Relational Databases: The network model was widely used before the advent of relational databases.

3.4 Entity relationship model

  • Entity relationship model: This model involves breaking down an item into its parts (substances) and attributes (ascribes).
  • Linking: Connections are established between different chemicals.
  • Conversion: Designs using this model can be converted into tables for use in the relational model.

3.5 Relational database model

  • Key Features:
    • Organizes data in two-dimensional tables.
    • Uses common fields (primary and foreign keys) to link tables.
    • Primary Key: A unique identifier within a table, ensuring no data duplication. Examples: accession number, index number.
    • Foreign Key: A field connecting one table to another (often the primary key of the linked table).
  • Origin: Developed by E.F. Codd in 1970 to make database management more application-independent.
  • Core Concepts:
    • Relations: Represented as tables with rows and columns.
    • Attributes: Columns representing data categories.
    • Tuples: Rows representing individual records.
  • Prevalence: Widely adopted as the most common information storage type due to its flexibility and efficiency.

3.6 Other models

  • Inverted File Model: Uses information as keys in a query table, with pointers to locations of specific content items.
  • Dimensional Model: A social model for data communication in distribution centers, allowing for easy summarization through OLAP queries.
  • Graph Model: Offers a more open structure than traditional datasets, enabling connections between any nodes (hubs).
  • Multivalue Model: Similar to relational databases, but allows for greater depth, making them ""knotty"" due to their ability to store similar information.