Skip to content

Week 6: Advanced EDA & Visualization with ggplot2 (Part 2)

Welcome to Week 6! While basic R plots are great for quick diagnostic checks, showcasing your findings in a research paper requires highly polished, layered, and customizable graphics. Today, we will master the industry standard for scientific visualization in R: the ggplot2 package.

1. The Biological Problem

Imagine you’ve run a differential expression analysis comparing healthy cells to disease cells. You have calculated the statistical significance (p-values) and fold-change direction of 20,000 genes.

How do you communicate to a reader exactly which genes are significantly up-regulated or down-regulated in a single, intuitive, and highly professional image? Scanning a massive table of numbers is impractical. We need publication-grade visualizations like Volcano Plots and Heatmaps to instantly highlight key biological targets.

2. Intuition & Theory

The cornerstone of modern visualization in R is a concept called the “Grammar of Graphics”, implemented by the ggplot2 package. Think of it like building with LEGO bricks. Instead of creating a finished plot with one rigid tool, you build your plot in layers:

  • The Dataset (Base Layer): The underlying raw data matrix that you want to plot.
  • Aesthetics (aes): Mapping columns of your data to visual channels like the x-axis, y-axis, color, size, or shape.
  • Geometries (geom): The actual shapes rendered on the page, such as dots (geom_point), bars (geom_bar), or lines (geom_line).

Volcano Plots and Heatmaps

  • Volcano Plot: A custom scatter plot where the x-axis shows the magnitude of change (fold-change) and the y-axis shows statistical significance ($-\log_10$ of the p-value). Highly significant genes appear at the top-left (strongly down-regulated) and top-right (strongly up-regulated) corners, resembling an erupting volcano.
  • Heatmap: A grid where columns are samples, rows are genes, and color intensity represents expression levels. It groups similar samples and genes together to reveal hidden patterns.

Example Volcano Plot Source: [Wikimedia Commons/Differential Expression Volcano Plot]

3. Visual Breakdown

To grasp how the Grammar of Graphics works and how easily you can construct complex plots, watch this intuitive tutorial:

4. Translating Theory to Code

Let’s translate this layering concept into code. In ggplot2, we initialize our canvas and add layers on top using the + operator:

# --- Advanced EDA & Visualization with ggplot2 ---
# Ensure ggplot2 is installed and loaded
library(ggplot2)
# 1. Initialize the plot canvas with data and basic axis mapping
# We map 'log2FoldChange' to X, and the statistical significance 'minus_log10_p' to Y
plot <- ggplot(gene_data, aes(x = log2FoldChange, y = minus_log10_p))
# 2. Add a geometry layer to render individual points for each gene
# We can dynamically color the dots based on whether they meet our threshold
plot_with_points <- plot + geom_point(aes(color = threshold_status))
# 3. Add labels and a clean theme for publication quality
final_volcano_plot <- plot_with_points +
labs(title = "Volcano Plot of Differential Gene Expression",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-Value") +
theme_minimal()
# 4. Display the finalized ggplot
print(final_volcano_plot)

Topics Covered

ggplot2 bioinformaticsR volcano plotgene expression heatmapgrammar of graphics Radvanced data visualization