Week 8: Introduction to Machine Learning in Bioinformatics
Welcome back from your Mid-Term break! Hopefully you are refreshed and ready for the next phase of our journey. Over the first half of this semester, we focused on describing and summarizing data using exploratory statistics and visualization. Now, we are shifting our focus from the present to the future. We are transitioning from simply describing history to predicting outcomes using Machine Learning in Bioinformatics.
1. The Biological Problem
Imagine you are studying a rare, aggressive form of tumor. You have successfully gathered tissue samples and sequenced the gene expression levels of 5,000 distinct genes from 100 cancer patients. Half of these tumors were diagnosed by traditional pathologists as benign, and the other half as malignant.
Now, a new patient arrives at your clinic, and you sequence their tumor tissue. Instead of waiting weeks for detailed pathological testing, can we build a mathematical rule that scans these 5,000 expression levels and automatically, keying in on subtle genomic combinations, tells us whether this new tumor is benign or malignant? Doing this manually is impossible—no human can examine 5,000 coordinates simultaneously. This is where machine learning comes to the rescue.
2. Intuition & Theory
To understand machine learning, it helps to divide it into two primary philosophies based on the kind of questions we are asking:
Supervised Learning: Guided Learning
In Supervised Learning, we act as the teacher. We feed our machine learning algorithm data where the correct answers (known as labels) are already provided. The model’s job is to learn the formulas that link our inputs (the gene expression profiles) to the target outputs (labels like “benign” or “malignant”).
- Classification: Predicting group categories (e.g., diseased vs. healthy tissue, tumor grade 1 vs. 2).
- Regression: Predicting continuous, numerical values (e.g., predicting a patient’s life expectancy in months, or estimating viral load concentration).
Unsupervised Learning: Undirected Discovery
In Unsupervised Learning, we do not provide labels. We give our algorithm a heap of data and say: “Find the hidden structures and similarity clusters in this group.” It is incredibly useful in biology for discovering new subtypes of disease that pathologists might not be able to identify under a microscope.
3. Visual Breakdown
To grasp the baseline foundations of how machine learning models learn and make predictions, watch this introductory guide from StatQuest:
4. Translating Theory to Code
Let’s look at the standard workflow of training and validating a machine learning model. In R, libraries like caret or tidymodels streamline this process. Here is the conceptual code workflow you’ll be using:
# --- Machine Learning Pipeline Steps ---
# 1. Split your genomic dataset into Training and Testing sets# This ensures we test our model on data it has never seen before!train_index <- createDataPartition(dataset$Diagnosis, p = 0.8, list = FALSE)training_data <- dataset[train_index, ]testing_data <- dataset[-train_index, ]
# 2. Train a classification model (e.g., Random Forest or Logistic Regression)# Here, we predict the "Diagnosis" label using all other gene columns (.)model <- train(Diagnosis ~ ., data = training_data, method = "rf")
# 3. Use the trained model to make predictions on the brand-new, unseen test datapredictions <- predict(model, newdata = testing_data)
# 4. Evaluate the model's accuracyconfusionMatrix(predictions, testing_data$Diagnosis)