Week 9: Feature Engineering & Data Pre-processing
Welcome to Week 9! Today, we are discussing the absolute golden rule of machine learning: “Garbage In, Garbage Out.” No matter how advanced your machine learning algorithm is, it cannot learn anything useful if you feed it noisy, messy, or incorrectly formatted data. In fact, most of a bioinformatician’s time is spent preparing, cleaning, and transforming data—a process called data pre-processing and feature engineering.
1. The Biological Problem
Imagine you are building a machine learning model to predict how breast cancer patients will respond to a new chemotherapy drug. You collect clinical records and blood panels, but you immediately run into several real-world issues:
- Missing Information: The clinical spreadsheet is incomplete; some hospital staff forgot to record patient ages or body indexes.
- Irrelevant Data: Some of your measured genes have zero variance—they are completely silent (expression of 0) across every single cell and every single patient. They provide no classification value.
- Non-Numeric Types: Some records contain categorical values like blood types (
A,B,AB,O) or patient sexes (Male,Female). Computers don’t speak biological categories; they only understand numbers.
How do we convert these messy medical files into a neat, standardized mathematical grid that an algorithm can use to predict drug efficacy?
2. Intuition & Theory
To solve these issues, we apply three foundational pre-processing techniques:
Imputation: Handling Missing Data
We cannot always simply throw away rows with missing data because clinical samples are incredibly expensive and rare. Imputation is the process of filling in those gaps with educated guesses. For example, we might replace a missing patient age with the median age of all patients in that study group.
One-Hot Encoding: Translating Words to Numbers
Algorithms do mathematical computations. How do we turn Blood Type A or O into numbers? We use One-Hot Encoding, which creates a binary column for each potential category. If a patient is blood type A, they get a 1 under the column BloodType_A and 0 under all other blood type columns.

Scaling/Normalization: Leveling the Playing Field
Imagine you are measuring two features: Age (ranging from 10 to 90) and blood platelet count (ranging in the hundreds of thousands). If you plug these values straight into a machine learning algorithm, the raw magnitude of the platelet count will overpower the patient’s age. We use Scaling or Normalization to rescale all measurements onto a shared playing field (such as between 0 and 1, or standardizing them to have a mean of 0 and a standard deviation of 1).
3. Visual Breakdown
To understand why data scaling and pre-processing are so vital for neural networks and machine learning models, watch this crystal-clear walkthrough:
4. Translating Theory to Code
In R, pre-processing and feature scaling can be accomplished with built-in commands or packages like caret. Here are the core transformations you will practice performing in Thursday’s lab:
# --- Data Pre-processing Snippets ---
# 1. Clean the dataset by omitting rows with missing values (NA)cleaned_data <- na.omit(raw_data)
# 2. Rescale raw numeric gene columns using the scale() function# This standardizes each column (Mean = 0, Standard Deviation = 1)numeric_columns <- cleaned_data[, c("Gene_A", "Gene_B", "Gene_C")]scaled_numeric_data <- scale(numeric_columns)
# 3. Quick min-max scaling function to scale data between 0 and 1min_max_scale <- function(x) { return ((x - min(x)) / (max(x) - min(x)))}normalized_gene_a <- min_max_scale(cleaned_data$Gene_A)