Week 4: Basic Statistics for Bioinformatics in R
Welcome to Week 4! Before we harness the predictive power of machine learning, we must master the statistical foundations that govern biological data. Understanding these concepts is the key to differentiating a true biological discovery from random noise.
1. The Biological Problem
Imagine you’ve sequenced cells from 50 breast cancer patients and compared them to 50 healthy samples. You notice a specific gene seems to be expressed much higher in the cancer samples than in the healthy ones.
How do you prove that this difference is a real biological phenomenon and not just due to random chance or minor variations in how the samples were processed? This is the fundamental challenge of bioinformatics: separating the “biological signal” from “biological noise.” Statistical tests allow us to quantify our confidence, moving beyond guesswork to rigorous evidence.
2. Intuition & Theory
To analyze biological data, we use three core concepts:
- Mean: The average value. It tells us the “center” of our data.
- Variance: How spread out our data points are. Do all our samples look similar, or are they wildly different?
- Statistical Significance (p-values): This tells us the probability that our results happened by pure chance.
Understanding the t-test and p-values
A “t-test” is essentially a way to compare the means of two groups while accounting for their variance. It asks: “Are the centers of these two groups far enough apart, given how much they overlap, that they are likely two different distributions?”
In biology, we use the p-value as our benchmark for trust. A p-value of less than 0.05 is the universal standard for “significance.” It means there is less than a 5% chance the observed difference happened by accident. If the p-value is lower than 0.05, we have strong evidence that the biological difference is real.
3. Visual Breakdown
To understand how p-values work, watch this excellent explanation from StatQuest:
4. Translating Theory to Code
In R, performing these tests is incredibly straightforward. Here are the core functions you will use in Thursday’s lab to start quantifying your data:
# --- Statistical Analysis Snippets ---
# 1. Calculating the mean of a groupgene_mean <- mean(cancer_group_data)
# 2. Calculating the standard deviation (measure of variance)gene_sd <- sd(cancer_group_data)
# 3. Running a basic Student's t-test# comparing two groups (e.g., cancer_group vs healthy_group)result <- t.test(cancer_group_data, healthy_group_data)
# 4. View your resultsprint(result)