Skip to content

Week 5: Exploratory Data Analysis & Visualization (Part 1)

Welcome to Week 5! Before we apply complex machine learning models, we must “interview” our data to understand its shape, quality, and potential errors. This process is known as Exploratory Data Analysis (EDA). It is the crucial first step where we let the data speak to us before we make any assumptions.

1. The Biological Problem

Imagine you just received a spreadsheet containing RNA-seq expression counts for 20,000 genes across 10 biological samples. What if one of the sequencing machines malfunctioned during the run and produced complete garbage data for one of your samples?

With 200,000 data cells, you cannot read through the spreadsheet row by row to find the problem. We need mathematical and visual summaries that can quickly reveal outliers, distribution shapes, and experiment-wide failures. EDA allows us to safeguard our downstream analyses by spotting these errors early.

2. Intuition & Theory

EDA consists of two main parts: initial numeric inspection and visual exploration.

Numeric Inspection

Before drawing any plots, we want to look at the general range of our values. Are they mostly small numbers or massive values? What is the average?

1-Dimensional Visualizations

To see the distribution of our biological data, we use two main graphics:

  • Histogram: This counts how frequently values fall into specific ranges (bins). It shows us the continuous shape and skewness of our data.
  • Boxplot: This organizes our dataset into statistical quarters. It provides a visual summary of the median, range, and extreme values (outliers).

Anatomy of a Boxplot Source: [Wikimedia Commons]

3. Visual Breakdown

To understand how boxplots are constructed and how to interpret them, check out this beautiful breakdown:

4. Translating Theory to Code

In R, we can inspect and construct basic visualizations with just a few simple lines of code:

# --- Exploratory Data Analysis & Visualizations ---
# 1. Take a peak at the first few rows of your genetic data matrix
head(expression_data)
# 2. Get a quick statistical summary (Min, Max, Median, Mean, etc.)
summary(expression_data)
# 3. Create a histogram to see the distribution of gene expression
hist(expression_data$GeneA,
main="Distribution of Gene A Expression",
xlab="Expression Level",
col="skyblue")
# 4. Generate a boxplot to compare distributions side-by-side
boxplot(expression_data,
main="Sample Distributions",
ylab="Expression Value",
col="tomato")

Topics Covered

bioinformatics EDAexploratory data analysis Rdata visualization biologygenomic data inspectionR histograms boxplots