Skip to content

Programming with R – The Computational Biologist’s Toolkit

Welcome to Week 3! Today, we are opening the door to a powerful new way of working with biological data. We are diving into R and RStudio for Bioinformatics, the essential toolkit for turning raw genomic data into meaningful biological insights.

In this guide, we will step through why R is preferred over Excel or Python for statistical biology, dissect our coding workspace, learn how to command the computer using functions, and explore the logical building blocks used to translate messy biological reality into clean, reproducible science.


1. R vs. Python: Choosing the Right Tool for the Job

When you step into the world of bioinformatics, you will immediately hear active debates about Python vs. R. Both are magnificent, but they were built with fundamentally different philosophies:

  • Python was designed by computer scientists as a general-purpose programming language. It is exceptionally versatile for building websites, automating servers, and running deep learning networks.
  • R was built specifically by and for statisticians and data scientists to calculate, analyze, and visualize data.

“The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.”
Bo Cowgill

When staring at a massive matrix of genetic information (such as 25,000 genes across 100 clinical tumor samples), you do not want to write complex, nested computer science algorithms just to run a correlation test. R has these advanced statistical capabilities built directly into its core language.

Why R Dominates Modern Bioinformatics

  1. Bioconductor Ecosystem: Hosted in R, Bioconductor is a massive, curated, open-source software repository with thousands of specialized tools specifically engineered to analyze high-throughput genomic data. If you are doing RNA-Seq, single-cell analysis, or epigenetics, R is the absolute industry standard.
  2. Unmatched Visualization: Packages like ggplot2 allow biologists to create publication-quality figures representing complex multi-dimensional datasets with minimal, elegant syntax.
  3. Industry Standard: From academic labs to global biomanufacturing and pharmaceutical giants, R remains completely irreplaceable for clinical trials, statistical analysis, and bioinformatics workflows.

Video Resource: Comparing Languages

To visualize why R dominates statistical biology and how it compares to Python for data science, explore this excellent breakdown by Luke Barousse:

Source: Luke Barousse (YouTube)


2. Base R vs. RStudio: The Engine vs. The Dashboard

To write code effectively, we need a functional workspace. We will use two different interfaces, and it is crucial to understand how they rely on each other.

FeatureBase R (The Engine)RStudio (The IDE)
The AnalogyA bare-bones motorcycle engine. It has massive raw power and runs perfectly, but lacks mirrors, a dashboard, comfortable seats, or headlights.A high-end luxury sports car built around the engine. It adds the dashboard, navigation system, climate controls, and ergonomic seats.
The InterfaceA simple black-and-white command-line interface (CLI) with a blinking > prompt waiting for commands.A highly organized, 4-pane Integrated Development Environment (IDE) with files, visual plots, and documentation helpers.
When to Use ItWhen logging into remote Cloud Servers, supercomputing clusters, or High-Performance Computing (HPC) nodes where full graphics would crash or slow down the connection.For your daily desktop analysis, writing scripts, organizing reproducible research projects, and rendering high-resolution scientific plots.

Touring the RStudio Interface (The 4-Pane Layout)

When you open RStudio, you are greeted by an organized dashboard divided into four primary quadrants. Mastering this layout makes writing code feel intuitive:

+-----------------------------------+-----------------------------------+
| | |
| 1. SOURCE EDITOR | 3. ENVIRONMENT / HISTORY |
| - Write and save your scripts | - View loaded datasets |
| - Real-time syntax highlighting | - Monitor variables & objects |
| - Step-by-step code execution | - Track command history |
| | |
+-----------------------------------+-----------------------------------+
| | |
| 2. CONSOLE PANEL | 4. FILES / PLOTS / HELP |
| - The active "Engine" terminal | - Directory structure browser |
| - Executes code immediately | - Displays rendered figures |
| - Shows errors and output logs | - Full help documentation search |
| | |
+-----------------------------------+-----------------------------------+

RStudio Interface Layout Source: Official Posit Documentation (https://posit.co/)

Video Walkthrough: Exploring RStudio

For a physical walkthrough of these panels in real-time, watch the following reference:

Source: RStudio for Beginners (YouTube)


3. Functions in R: Casting Computational Spells

In R, you will frequently command the computer using instructions called Functions.

An easy way to understand a function is to think of a magic spell from Harry Potter. If you are standing in front of a heavy, locked oak door, you wave your wand and call Alohomora:

  1. The Input (Argument): The locked door.
  2. The Spell (Function): An invisible, highly complex sequence of events occurs in the background.
  3. The Output (Result): An unlocked, wide-open door.

In R, if you need your computer to parse a massive biological CSV file, you don’t write hundreds of lines of code to scan text; you simply cast the function read.csv(). If you need the mathematical average of 10,000 gene expression observations, you cast mean().

The general syntax is simple:

result <- function_name(argument1 = input1, argument2 = input2)

Functions represent your incantations to command data effortlessly.


4. Storing Data: The Assignment Operator (<-) and Objects

Once your function executes and returns a brand-new result, where does it go? In R, the core power comes from storing data values in objects. Think of an object as a sturdy, custom-labeled box in memory.

Beginners are frequently tempted to use a standard equals sign (=) to save values. While R supports = in some contexts, the idiomatic and universally accepted way to store data in R is with a unique arrow called the assignment operator (<-).

# We are taking the numeric value 25 and pointing it inside a box labeled "patient_age"
patient_age <- 25
# Let's count the number of base pairs in our gene sequence fragment
sequence_length <- 1420
# Now, we can perform math directly on these labeled boxes
double_age <- patient_age * 2
print(double_age) # Returns 50

Tip: You are literally drawing an arrow pointing your data into its designated storage container!


5. The Building Blocks: Core Data Types in R

Before constructing complex multi-omics databases, we must understand the core ingredients that make up all data. Every single value in R belongs to a specific data type. The three most vital types for genomic workflows are:

1. Numeric (Double/Integer)

Any numerical value, with or without decimal points.

  • Biological Examples: pH level, sequencing depth, temperature.
  • Sample Values: 98.6, 7.4, 1000000

2. Character (String)

Text data. In R, characters must always be wrapped in quotation marks ("..." or '...') so the R engine knows it represents a string literal rather than a variable name.

  • Biological Examples: Gene names, amino acid sequences, diagnostic states.
  • Sample Values: "TP53", "Tumor", "MALE", "ATCGGACT"

3. Logical (Boolean)

A simple truth value used for boolean logic, comparison testing, and data filtering. It must be written in ALL CAPS without quotes.

  • Biological Examples: Has mutation present? Patient survived?
  • Sample Values: TRUE or FALSE

6. Structuring Biological Chaos: Vectors and Data Frames

Biological reality is wonderfully chaotic. To analyze it computationally, we must force it into structured digital containers. In R, we do this using specialized data structures:

Vectors (1-Dimensional Collections)

Imagine you need to record the heart rates of 5 patients. Instead of creating 5 independent variables (hr1, hr2, …), R allows you to lock them together in a single, ordered list called a Vector.

Vectors are built using the c() function, which stands for combine:

# Creating a numeric vector of patient heart rates
heart_rates <- c(72, 85, 68, 92, 78)
# Creating a character vector of gene markers
genes <- c("BRCA1", "TP53", "EGFR", "MYC")
# NOTE: A vector must be homogeneous (every single item must be of the identical data type)

Data Frames (2-Dimensional Arrays)

Clinical trials and genomic analysis usually contain more than one physical dimension. What if you want to store patient names, their respective heart rates, and whether they possess a specific gene mutation?

For this, we use a Data Frame—R’s native equivalent of a spreadsheet or SQL table.

# Combining our vectors into a structured biological cohort
clinical_cohort <- data.frame(
patient_id = c("P001", "P002", "P003", "P004"),
expression_level = c(12.4, 45.1, 2.8, 18.9),
mutation_detected = c(TRUE, FALSE, TRUE, TRUE)
)

The Golden Rules of Data Frames:

  • Rows = Observations: Each horizontal row represents a single biological subject or experiment (e.g., a specific patient or tissue sample).
  • Columns = Variables: Each vertical column represents a specific, single physical parameter measured across all subjects (e.g., age, weight, expression level).
  • Uniform Length: Every column in a single data frame must contain the exact same number of items, maintaining a perfect rectangular structure.

7. The Working Directory & RStudio Projects: Avoiding “File Not Found” Pitfalls

When loading genomic datasets into R, you will inevitably hit the most common beginner frustration: Error in file(file, "rt") : cannot open the connection (No such file or directory)

This happens due to a misunderstanding of the Working Directory.

Think of the R engine as a laboratory worker sitting at an physical desk in a highly specific room. If you tell them to “grab the file patient_data.csv”, they will scan only that one specific room. If your raw spreadsheet is located in the Downloads/ directory, but R is currently sitting in Documents/, the worker will look, fail to find it, and throw a connection error.

# Check where R is currently sitting in your file system
getwd()
# Manually changing directories (often leads to broken paths when sharing code)
setwd("/Users/bioinformatician/Documents/lab_project")

Best Practice: RStudio Projects (.Rproj)

Rather than micro-managing file paths with manual, absolute commands like setwd(), modern professional bioinformatics relies on RStudio Projects.

A Project file creates a self-contained “workshop” folder inside your workspace:

  1. RStudio automatically locks R’s active working directory directly to that project’s folder.
  2. All files, datasets, scripts, and output plots are housed together.
  3. When you send your project folder to another researcher, all paths remain relative and functional, completely immune to “file not found” errors!

8. Deepen Your Learning: Curated Resources

Before jumping into your next active coding session, review these curated educational reference links to cement your understanding:


Topics Covered

bioinformaticsR programmingRStudio setupgenomic data analysisreproducible research