Skip to content

Week 10: ML Model Building and Evaluation

Welcome to Week 10! You have learned how to clean your genomic datasets, fill in the blanks with imputation, scale features down to a standard range, and even train robust machine learning models like Random Forests.

But here is the million-dollar question: How do we know our machine learning model actually works? Imagine you train a model to detect tumor cells, and it boasts a seemingly perfect 100% accuracy score on your computer. Excitedly, you hand it over to a local hospital, and… it fails to diagnose new patients completely. Why? Let’s dive in to understand how we can fairly and rigorously evaluate predictive models.

1. The Biological Problem

The core danger in machine learning is called overfitting, which is the mathematical equivalent of memorization.

Imagine you are a biology student preparing for a final exam. If your instructor gives you a practice test and you simply memorize the exact answer key, you might score 100% on that practice test. But if the actual exam has entirely new questions testing the same concepts, and you only memorized the specific answers rather than learning the underlying biological rules, you will likely fail.

Similarly, if we evaluate a cancer-predicting model using the exact same genomic samples that it used to learn, the model will perform flawlessly because it has already seen the “answers.” We need a fair way to evaluate whether our models have learned genuine biological signals, or if they have simply memorized the training datasets.

2. Intuition & Theory

To solve this validation dilemna, we use several key concepts and metrics:

The Train/Test Split

The simplest way to prevent cheating is the Train/Test Split. We segment our biological data into two independent pools:

  • Training Set (e.g., 80% of data): The textbook we feed to the algorithm to let it learn patterns.
  • Testing Set (e.g., 20% of data): Housed in a diagnostic “vault” that the model is never allowed to see or interact with during its preparation phase. We only expose the model to this test set to grade its performance at the very end.

The Confusion Matrix

Once the model predicts labels for our unseen test set, we compare its predictions side-by-side with the true, clinical diagnoses. This comparison is structured in a 2x2 grid called a Confusion Matrix:

  • True Positive (TP): Model predicts “Cancer,” and the patient has cancer. (Success!)
  • True Negative (TN): Model predicts “Healthy,” and the patient is healthy. (Success!)
  • False Positive (FP): Model predicts “Cancer,” but the patient is healthy. (Type I Error)
  • False Negative (FN): Model predicts “Healthy,” but the patient actually has cancer. (Type II Error - highly dangerous!)

Confusion Matrix Anatomy Source: [Wikimedia Commons/Precision and Recall concepts]

Why Accuracy is Often a Trap in Biology

Suppose you are designing a diagnostic test for an extremely rare genetic disease that only affects 1 in 1,000 patients. If your model simple guesses “Healthy” for every single person who walks through the clinic, it will be 99.9% accurate! However, it has failed to find a single sick patient.

To handle this, we measure other metrics alongside accuracy:

  • Precision: Out of all patients the model predicted as “Cancer,” how many actually had it? ($TP / (TP + FP)$)
  • Recall (Sensitivity): Out of all the patients who actually have cancer, how many did the model successfully identify? ($TP / (TP + FN)$)

3. Visual Breakdown

To understand how to read a Confusion Matrix and why balancing Precision and Recall is so vital for biological models, watch this video:

4. Translating Theory to Code

In Thursday’s lab session, we will write R code to split datasets and calculate performance metrics. Here are the core code structures:

# --- Model Partitioning and Performance Evaluation ---
# Set a random seed so our random split is reproducible
set.seed(42)
# 1. Create indices for a 70% Train and 30% Test split
sample_size <- floor(0.70 * nrow(tumor_dataset))
train_indices <- sample(seq_len(nrow(tumor_dataset)), size = sample_size)
# Split the dataset
train_set <- tumor_dataset[train_indices, ]
test_set <- tumor_dataset[-train_indices, ]
# 2. Simulate model predictions on the testing set
# Let's say we have predictions saved in a vector 'predicted_labels'
# and the true clinical results in 'test_set$clinical_status'
# 3. Form a basic confusion matrix
evaluation_matrix <- table(Predicted = predicted_labels, Actual = test_set$clinical_status)
# Show the resulting 2x2 matrix
print(evaluation_matrix)

Topics Covered

machine learning evaluationconfusion matrixprecision vs recalltrain test split bioinformaticscross-validation