Skip to content

My Project: Stellar Stuff Sooner

Take your stuff to the moon and back in the blink of an eye.

1 Introduction

  • Bioinformatics is an interdisciplinary field that combines biology and computer science. It focuses on collecting, analyzing, and interpreting biological data, particularly the large datasets generated by genomic research.
  • The field is vast and encompasses a wide range of disciplines. This includes molecular biology, biochemistry, biophysics, statistics, and computer science.
  • Bioinformatics plays a crucial role in modern biology. It enables scientists to understand complex biological phenomena, analyze genetic information, and develop new treatments and therapies.
  • The field is rapidly evolving. It is transitioning from applied to fundamental research and moving from tool creation to hypothesis generation.
  • Bioinformatics, computational biology, and bio-information infrastructure are closely related. They share a common goal of using computational methods to study biological systems.

2 History

  • Early Pioneers (1960s):
    • The term ""bioinformatics"" was coined in 1970, but the field’s roots trace back to the 1960s.
    • Margaret Oakley Dayhoff, considered the ""mother and father of bioinformatics,"" led the creation of the first ""Protein Information Resource"" (PIR) in the 1970s, which organized protein sequence data.
    • Russell F. Doolittle and Walter M. Fitch also made significant early contributions.
  • Expansion and Database Development (1970s-1990s):
    • Elvin A. Kabat’s analysis of antibody sequences in the 1970s further advanced the field.
    • George Bell and his associates initiated the DNA sequence collection into GenBank in 1974.
    • The DNA Data Bank of Japan (DDBJ) and European Molecular Biology Laboratory (EMBL) were established in 1984 and 1980, respectively.
    • Swiss bioinformaticians developed software for sequence comparison, protein structure modeling, and databases during the 1980s.
    • The Swiss Institute of Bioinformatics (SIB) was founded in 1998.
    • The development of web-based searching algorithms, like GENEINFO and BLAST, revolutionized database accessibility in the 1990s.
  • Modern Bioinformatics (1990s - Present):
    • The establishment of NCBI (National Center of Biotechnology Information) in 1994 and major databases like PubMed (1997) and Human Genome (1999) significantly impacted the field.
    • The integration of bioinformatics into biology education is increasing.
    • New fields like synthetic biology, systems biology, and whole-cell modeling are emerging due to the convergence of computer science and biology.

3 Biological databases

  • What is a biological database? A biological database is a structured collection of biological data in a computer-readable format, designed to improve search speed and retrieval. They were created in response to the vast amount of data generated by DNA sequencing technologies.
  • Types of databases:
    • Primary databases: Contain raw, experimentally generated data like protein and nucleotide sequences. Examples include GenBank, DDBJ, EMBL, Swiss-Prot, PIR, and Protein Databank.
    • Secondary databases: Compile and analyze data from primary databases, offering more sophisticated biological information. Examples include UniProt KB, Motif databases, PDB, and InterPro.
    • Composite or derived databases: Combine data from primary and secondary databases. Examples include OMIM and Swissport.
  • Purpose of biological databases: To store, manage, and retrieve biological information, making it accessible to researchers, scientists, and students.
  • Data retrieval: Biological databases allow users to access a wide range of information, including binding sites, molecular actions, biological sequences, metabolic interactions, motifs, protein families, molecular action, homologous and functional relationships.
  • Database accessibility: Many databases are publicly available, allowing for efficient communication and collaboration among researchers.

4 Algorithms in computational biology

  • Computational biology and bioinformatics are interdisciplinary fields focused on using computers to address biological problems. While there’s a slight distinction, they’re often used interchangeably.
  • Bioinformatics emphasizes developing and using computational tools for biological data analysis.
  • Computational biology focuses on creating algorithms to solve biologically relevant problems.
  • Recent advancements in technology have generated a massive amount of biological data, requiring computational approaches for analysis and understanding.
  • Algorithms are crucial for managing, analyzing, and interpreting this data.
  • Developing a computational biology algorithm involves two key steps:
    • Formulating a biologically relevant question and building a model that translates it into a computational problem.
    • Creating an algorithm to solve the formulated problem.
  • The quality of an algorithm is assessed based on its space complexity, running time, and the biological relevance of the results.
  • Understanding basic computational algorithms is essential for bioinformaticians and researchers.
  • Expertise in developing novel algorithms provides a strategic advantage in both academia and industry.
  • Common algorithms in computational biology include:
    • Dynamic programming (Needleman-Wunsch and Smith-Waterman for sequence alignment)
    • Hidden Markov Models (HMM) for sequence modeling
    • Principal Component Analysis Clustering
    • Phylogenetic tree construction
    • Machine learning applications (SVM, neural networks)
    • Microarray data analysis
    • Protein secondary structure prediction
  • Global and local sequence alignment helps understand protein relationships across different organisms.
  • HMMs use a probabilistic finite state machine to model DNA sequences, where the probability of an event depends on the previous state.
  • Gene regulation networks are formed by protein interactions within an organism, influencing cell type determination.

5 Genetic variation and bioinformatics

  • Genetic Variation is Fundamental: Genetic variations, changes in chromosome sequences, drive evolution, allowing organisms to adapt to their environment. These variations can be beneficial, harmful, or neutral.
  • Sources of Variation: The primary sources of genetic variation are mutations, which are permanent changes in DNA sequences, and recombination, where genetic material from parents combines during reproduction.
  • Single Nucleotide Polymorphisms (SNPs): SNPs are common variations in a single DNA base (A, G, C, T) and are crucial in understanding individual differences and disease susceptibility.
  • Bioinformatics and Genetic Variation: Bioinformatics plays a critical role in analyzing genetic variations, particularly SNPs. Algorithms help classify variations as disease-causing or neutral, with tools like VEP, SIFT, and DbSNP providing valuable insights.
  • Bioinformatics Applications in Genetic Variation: Bioinformatics enables:
    • Development of algorithms and software for analyzing genetic variations.
    • Analyzing genetic variations in the genome, including SNPs.
    • Studying large-scale datasets to understand disease prevalence and genetic factors.
    • Identifying genetic variations, annotating their functions, and simulating their effects on pathways.

6 Structural bioinformatics

  • Structural bioinformatics focuses on predicting and analyzing the 3D structures of macromolecules like DNA, RNA, and proteins.
  • Understanding protein structure is crucial because it determines function, is more conserved than sequence, and allows for the design and modification of proteins for medical and industrial applications.
  • Protein structure visualization is essential in structural bioinformatics, with common methods including cartoon, lines, surface, and sticks representations.
  • The field of protein structure prediction has advanced significantly, with computer calculations now able to predict secondary and tertiary structures with varying degrees of accuracy.
  • High-throughput methods have provided the knowledge needed to link protein structures to their functions, leading to applications in medical science.
  • Despite advancements in 3D structure prediction, it’s important to remember that proteins are dynamic systems, not static structures.
  • Molecular dynamics simulations, though computationally demanding, are becoming more accessible with increasing computing power, offering insights into protein dynamics.

7 High-throughput technology

  • High-throughput sequencing (HTS) has revolutionized molecular biology: It allows for large-scale sequencing of DNA and RNA, enabling comprehensive studies of gene expression, genome sequencing, protein interactions, and other biological processes.
  • HTS is used for RNA-seq and genome sequencing: RNA-seq analyzes RNA sequences, while genome sequencing focuses on assembling the complete genomic DNA.
  • HTS is a powerful alternative to microarrays: It offers wider applicability to non-model organisms and can detect rare genetic variations.
  • HTS generates complex, high-dimensional data: This requires specialized bioinformatics tools for analysis, storage, and integration with other data sources.
  • Network-based approaches are valuable for integrating HTS data: These approaches can combine data from various sources to provide a comprehensive picture of biological processes.
  • Multidisciplinary collaborations are essential: Combining expertise from biologists, physicians, and bioinformaticians is crucial for translating HTS data into meaningful insights for healthcare.
  • High-performance computing (HPC) is critical for HTS analysis: HPC resources, including GPUs, clusters, and cloud computing platforms, provide the computational power needed for complex analyses.
  • HTS applications span a wide range: Examples include read mapping, exome analysis, and large-scale genomic projects like the 1000 Genomes Project and the International Cancer Genome Consortium.

8 Drug informatics

Drug Informatics: A Bridge Between Data and Patient Care

  • Definition: Drug Informatics combines computer techniques and pharmaceutical expertise to study and understand drugs, their mechanisms, and structures. It focuses on enhancing medication awareness and improving patient outcomes.
  • Problem: The sheer volume and complexity of drug and disease data, fueled by the genomic revolution, makes it impossible for healthcare professionals to manually manage all relevant information for safe and effective care.
  • Solution: Drug Informatics provides the tools and techniques to manage this data effectively, leading to better drug use in clinical, commercial, and research settings.
  • Data Lifecycle: Drug Informatics encompasses all data generated throughout a drug’s lifecycle, from lab research to patient use.
  • Information Sources: Drug informatics data comes from:
    • Primary: Research labs, pharmaceutical companies, and clinical observations. This includes published and unpublished data, excluding review articles and editorials.
    • Secondary: Data generated from non-clinical uses of drug information, like pharmacy benefit management. This includes sources that abstract or index primary literature.
    • Tertiary: Compiled and condensed overviews of the subject, such as journal review papers, textbooks, and general internet information.
  • Evolution:
    • Early Stages (1960s): Information technology focused on clerical and financial systems.
    • Growth (1980s): Network technology and personal computers led to the development of clinically oriented computing systems for healthcare.
  • Current State: Drug Informatics is a rapidly developing field, leveraging information technology and computer science to manage, analyze, and organize drug-related data.
  • Key Aims:
    • Disseminate Knowledge-Based Information: Healthcare scientific literature.
    • Disseminate Patient-Specific Information: Data generated during patient care.
  • Future Direction: Efforts are underway to improve healthcare system safety, quality, and efficiency while reducing costs.

9 System and network biology

  • Network and Systems Biology is a field focused on analyzing complex biological systems through a comprehensive approach.
  • This field seeks to understand the intricate interactions between numerous biological molecules.
  • The rise of powerful computer tools and the generation of massive biological data (genomes, proteomes, transcriptomes) have fueled the growth of Network and Systems Biology.
  • This field integrates various disciplines, including computer science, biology, chemistry, physics, and statistics.
  • The goal of Network Biology is to understand cells and organisms as complete units, encompassing their mechanisms and functions.
  • System Biology confronts the challenge of analyzing vast biological networks and molecular data.
  • Types of data in System Biology: DNA sequences, molecular structures of DNA and RNA, gene expression data, protein-protein interaction data, and metabolic pathways data.
  • Network generation: Different data types are integrated to create networks.
  • Multivariate analysis: Techniques such as regression analysis, PCA, and clustering are used to analyze these networks.
  • Applications of Network Algorithms in System Biology:
    • Function prediction: Determining the functions of unknown entities through network creation.
    • Protein complex detection: Utilizing methods like Y2H and affinity purification-Mass Spectrometry.
    • Analyzing evolution, drug development, disease diagnosis, and interaction prediction.

10 Machine learning in bioinformatics

  • Machine learning analyzes large datasets to create predictive tools. It differs from traditional algorithms by learning from input-output relationships, allowing it to predict outputs for new data.
  • Machine learning is crucial in bioinformatics due to the rapid growth of high-throughput technologies. The field is embracing an AI-based approach to handle the deluge of data.
  • Machine learning algorithms fall into three categories: supervised learning (using known input and output), unsupervised learning (clustering data based on similarities), and reinforcement learning (choosing outputs based on reward and punishment).
  • Machine learning is widely applied in bioinformatics, making it an essential tool for bioinformaticians.

11 Bioinformatics workflow management systems

  • Bioinformatics workflow management systems are essential for compiling and implementing sequences of bioinformatics steps, ensuring reproducibility of results.
  • Pipelines are divided into components, each built with standardized inputs and outputs for independent integration.
  • Workflow frameworks offer visual interfaces, allowing users to create complex programs without extensive programming knowledge.
  • Popular workflow management systems:
    • KNIME (Konstanz Information Miner): Free, open-source data processing framework with modular components.
    • Online HPS: Offers high-performance computing and job flow services.
    • Galaxy: Helps scientists without programming skills manage computational biology results.
    • UGENE: Software application for bioinformatics, operating on various platforms, allowing analysis of genetic data.
    • Gene Pattern: Publicly accessible, open-source software kit for replicating genomic analyses.

12 Application of bioinformatics

  • Drug Designing: Bioinformatics tools allow for the analysis of protein structures, leading to more targeted and effective drug development.

  • Personalized Medicine: By analyzing individual genetic profiles, bioinformatics enables physicians to prescribe more personalized treatments tailored to a patient’s specific needs.

  • Gene Therapy: Bioinformatics facilitates the use of gene manipulation for disease treatment, prevention, and cure.

  • Microbial Genome Applications: Bioinformatics aids in understanding the complex microbial ecosystems, which has implications for various fields like energy, industry, and health.

  • Evolutionary Studies: Bioinformatics plays a crucial role in phylogenetic studies, helping to reconstruct the tree of life by analyzing genome sequences.

  • NextFlow is a workflow architecture and DSL developed by the Comparative Bioinformatics Group at CRG. It allows scientists to use software containers to create scalable and reusable scientific procedures.

  • NextFlow can be run locally or on AWS. For resource-intensive procedures, AWS is recommended, but instances need to be terminated after use.

  • The text describes an RNA-Seq analysis workflow using NextFlow. This workflow has over 3700 changes and uses various programs. It offers significant time savings compared to starting from scratch.

  • NextFlow can handle millions of samples with sufficient computing resources. 23andMe uses NextFlow for its genetic data analysis.

  • For industry-scale data processing, bioinformatics workflow managers may not be the best option. Ginkgo Bioworks uses Airflow, Celery, and AWS batch for terabyte-scale data processing.

  • NextFlow is well-suited for biotechnology companies and university labs.

  • A key advantage of NextFlow is the separation of workflow implementation from execution platform configuration. This makes workflows portable and adaptable to different computing environments. ”