Skip to content

Statistical Analysis

10 Platforms employed for statistical analysis

  • Platforms for Statistical Analysis:
    • Linux system is commonly used, alongside R programming and IDEs such as Rstudio, Jupyter Notebook, and Vim.
    • Galaxy is the primary user-friendly option for analyzing Next Generation Sequencing (NGS) research.
    • DNAnexus and AIR (artificial intelligence-based RNA-seq) are cloud software UIs with subscription-based access.
    • Cloud services like Amazon Web Services and Microsoft Azure can handle massive genetic data but are largely controller-driven.

10.1 Downstream analysis and visualization

  • Downstream analysis is performed after identifying substantially expressed profiles in a study.
  • This analysis provides biological context for the described genomes.
  • Two main types of downstream analysis are done on transcriptome datasets.

11 Gene ontology & pathway analysis

  • Gene Ontology (GO) analysis helps understand how genes function in a specific biological system.
  • GO classifies genes into three categories:
    • Cellular components
    • Molecular functions
    • Biological processes
  • Gene Enrichment Analysis (GEA) is a type of GO study that can be categorized into three groups. (The text ends without listing these groups).

11.1 Singular enrichment analysis (SEA)

  • Singular Enrichment Analysis (SEA): This technique examines gene lists from high-throughput experiments (like NGS or microarrays) to identify functional categories that are over-represented within those lists.
  • Input Data: The input data for SEA is typically a set of genes, often defined by the user based on specific criteria.
  • Classification: The genes are categorized into three major functional areas (which are not explicitly stated in the text).
  • Statistical Analysis: SEA uses statistical methods like Fisher’s exact test, EASE score, or Chi-square test to determine if there’s a significant association between the genes and their functional classifications.
  • Tools: Commonly used tools for SEA include DAVID, GoStat, and Bingo.

11.2 Gene set enrichment analysis (GSEA)

  • GSEA utilizes all genomes from a high-throughput study. This ensures the analysis is unbiased, unlike other methods that might have limitations.
  • GSEA can analyze genes with small differential expression. This allows for a more comprehensive analysis.
  • GSEA calculates Maximum Enrichment Scores (MESs) based on gene ranking within a class. This score indicates the level of enrichment.
  • The p-value is determined by comparing MESs with expected values. This statistical test assesses the significance of the enrichment.
  • Tools like ErmineJ and FatiScan can be used for GSEA. These tools provide the necessary analysis and interpretation.

11.3 Modular enrichment analysis (MEA)

  • Modular enrichment analysis (MEA) combines SEA-type enrichment analysis with connectivity search techniques to facilitate phrase linkages.
  • MEA uses Kappa estimates of concordance to assess agreement, and removes genes that don’t appear frequently in nearby words.
  • Platforms like ADGO, DAVID, and GeneCodis can perform MEA.
  • MEA integrates information from various domains, such as KEGG for pathway assessment, Pfam for protein domains, and TRANSFAC for transcriptional regulation.

11.4 Correlation networks

  • Correlation Networks: Analyzing gene lists for statistically significant associations is crucial for understanding gene interactions.
  • GeneMania: This program offers detailed information on gene interactions, including co-expression, co-localization, and physical forces, providing a more comprehensive view than just gene function.
  • Biogrid: A repository of biochemical, genomic, and protein-protein interaction data, updated regularly based on known findings.
  • STRING: A library focusing on protein interactions, aiding in understanding how proteins encoded by genes work together.
  • WGCNA: An R tool that uses microarrays or RNA-sequencing data to construct correlation networks between genes in a specific study.

12 Future prospects and conclusion

  • Statisticians are crucial to bioinformatics: They develop advanced models and analysis methods to extract meaningful biological insights from vast genomic data.
  • Integrative research is essential: Combining data from multiple systems is key to a deeper understanding of cellular biology, requiring innovative approaches that balance statistical rigor, scalability, and interpretability.
  • Statistical evaluation of omics studies is challenging: Agreement on the best methods is difficult, requiring further research to validate techniques and improve data integration.
  • Combining clinical and genetic information is a major challenge: Fully evaluating hypotheses and making results useful to the public remains a significant hurdle in bioinformatics.
  • Statistics has a unique opportunity: By providing researchers with tools to analyze large datasets, statistics can significantly contribute to scientific progress in bioscience and healthcare.