Week 5: Exploratory Data Analysis & Visualization (Part 1)
Welcome to Week 5! Before we apply complex machine learning models, we must “interview” our data to understand its shape, quality, and potential errors. This process is known as Exploratory Data Analysis (EDA). It is the crucial first step where we let the data speak to us before we make any assumptions.
1. The Biological Problem
Imagine you just received a spreadsheet containing RNA-seq expression counts for 20,000 genes across 10 biological samples. What if one of the sequencing machines malfunctioned during the run and produced complete garbage data for one of your samples?
With 200,000 data cells, you cannot read through the spreadsheet row by row to find the problem. We need mathematical and visual summaries that can quickly reveal outliers, distribution shapes, and experiment-wide failures. EDA allows us to safeguard our downstream analyses by spotting these errors early.
2. Intuition & Theory
EDA consists of two main parts: initial numeric inspection and visual exploration.
Numeric Inspection
Before drawing any plots, we want to look at the general range of our values. Are they mostly small numbers or massive values? What is the average?
1-Dimensional Visualizations
To see the distribution of our biological data, we use two main graphics:
- Histogram: This counts how frequently values fall into specific ranges (bins). It shows us the continuous shape and skewness of our data.
- Boxplot: This organizes our dataset into statistical quarters. It provides a visual summary of the median, range, and extreme values (outliers).
3. Visual Breakdown
To understand how boxplots are constructed and how to interpret them, check out this beautiful breakdown:
4. Translating Theory to Code
In R, we can inspect and construct basic visualizations with just a few simple lines of code:
# --- Exploratory Data Analysis & Visualizations ---
# 1. Take a peak at the first few rows of your genetic data matrixhead(expression_data)
# 2. Get a quick statistical summary (Min, Max, Median, Mean, etc.)summary(expression_data)
# 3. Create a histogram to see the distribution of gene expressionhist(expression_data$GeneA, main="Distribution of Gene A Expression", xlab="Expression Level", col="skyblue")
# 4. Generate a boxplot to compare distributions side-by-sideboxplot(expression_data, main="Sample Distributions", ylab="Expression Value", col="tomato")