Skip to content

Week 8: Introduction to Machine Learning in Bioinformatics

Welcome back from your Mid-Term break! Hopefully you are refreshed and ready for the next phase of our journey. Over the first half of this semester, we focused on describing and summarizing data using exploratory statistics and visualization. Now, we are shifting our focus from the present to the future. We are transitioning from simply describing history to predicting outcomes using Machine Learning in Bioinformatics.

1. The Biological Problem

Imagine you are studying a rare, aggressive form of tumor. You have successfully gathered tissue samples and sequenced the gene expression levels of 5,000 distinct genes from 100 cancer patients. Half of these tumors were diagnosed by traditional pathologists as benign, and the other half as malignant.

Now, a new patient arrives at your clinic, and you sequence their tumor tissue. Instead of waiting weeks for detailed pathological testing, can we build a mathematical rule that scans these 5,000 expression levels and automatically, keying in on subtle genomic combinations, tells us whether this new tumor is benign or malignant? Doing this manually is impossible—no human can examine 5,000 coordinates simultaneously. This is where machine learning comes to the rescue.

2. Intuition & Theory

To understand machine learning, it helps to divide it into two primary philosophies based on the kind of questions we are asking:

Supervised Learning: Guided Learning

In Supervised Learning, we act as the teacher. We feed our machine learning algorithm data where the correct answers (known as labels) are already provided. The model’s job is to learn the formulas that link our inputs (the gene expression profiles) to the target outputs (labels like “benign” or “malignant”).

  • Classification: Predicting group categories (e.g., diseased vs. healthy tissue, tumor grade 1 vs. 2).
  • Regression: Predicting continuous, numerical values (e.g., predicting a patient’s life expectancy in months, or estimating viral load concentration).

Unsupervised Learning: Undirected Discovery

In Unsupervised Learning, we do not provide labels. We give our algorithm a heap of data and say: “Find the hidden structures and similarity clusters in this group.” It is incredibly useful in biology for discovering new subtypes of disease that pathologists might not be able to identify under a microscope.

Classification vs Clustering Source: [Wikimedia Commons/Supervised vs Unsupervised learning]

3. Visual Breakdown

To grasp the baseline foundations of how machine learning models learn and make predictions, watch this introductory guide from StatQuest:

4. Translating Theory to Code

Let’s look at the standard workflow of training and validating a machine learning model. In R, libraries like caret or tidymodels streamline this process. Here is the conceptual code workflow you’ll be using:

# --- Machine Learning Pipeline Steps ---
# 1. Split your genomic dataset into Training and Testing sets
# This ensures we test our model on data it has never seen before!
train_index <- createDataPartition(dataset$Diagnosis, p = 0.8, list = FALSE)
training_data <- dataset[train_index, ]
testing_data <- dataset[-train_index, ]
# 2. Train a classification model (e.g., Random Forest or Logistic Regression)
# Here, we predict the "Diagnosis" label using all other gene columns (.)
model <- train(Diagnosis ~ ., data = training_data, method = "rf")
# 3. Use the trained model to make predictions on the brand-new, unseen test data
predictions <- predict(model, newdata = testing_data)
# 4. Evaluate the model's accuracy
confusionMatrix(predictions, testing_data$Diagnosis)

Topics Covered

machine learning bioinformaticssupervised learningunsupervised learningAI in biologygenomic prediction models