Final Capstone Project: Predictive Genomic Modeling
Congratulations on making it to this point in the BIOL4000L course! You have journeyed remarkably far: from your very first steps in installing R, setting up library paths, and managing workspace environments, to executing exploratory visualizations, engineering raw genomic inputs, and building predictive machine learning models in high-performance cloud notebook instances.
To demonstrate your mastery of computational biology and predictive medicine, you will complete this Final Capstone Project in Predictive Genomic Modeling. This comprehensive project is worth 20% of your final semester grade and serves as the ultimate laboratory capstone where you showcase your ability to design a robust, clean, and fully reproducible bioinformatics pipeline.
1. The Biological Scenario
For this capstone, you will put yourself in the shoes of a Lead Computational Biologist working inside a premier research hospital’s oncology department. Your clinical collaborators have recently performed RNA-sequencing (RNA-seq) on bulk tumor tissues harvested from 150 patients diagnosed with standard tissue growths.
The resulting high-dimensional dataset records the normalized expression profiles of 500 candidate cancer genes across these 150 individuals. Out of the 150 patients, a subset has been diagnosed with a benign tumor growth, while the remaining individuals have been confirmed to have a malignant clinical tumor.
Your clinic’s director has handed this spreadsheet to your team with a core mandate: Design. Preprocess. Predict. You must build an end-to-end machine learning and statistics pipeline in R that can parse this gene expression matrix, clean its metadata, identify key biological drivers of malignancy, and train a supervised binary classifier to predict the benign or malignant patient diagnosis using only genomic inputs.
2. Project Milestones (The Pipeline)
Your pipeline must be systematically organized into three distinct, chronological development milestones:
Phase 1: Exploratory Data Analysis & Statistics
Before training any model, a skilled bioinformatician must understand the texture of their biological inputs:
- Data Validation: Read the provided RNA-seq dataset into the R environment. Scan for missing value indicators (
NA), infinite results, or corrupt records. Correct these manually or programmatic-wise. - Highly Variable Gene (HVG) Selection: Write an R script to compute the variance of every gene’s normalized expression across all 150 patients. Sort the results and identify the top 3 most variably expressed genes.
- Statistical Evaluation: Conduct a two-sample t-test (or non-parametric alternative where appropriate) to verify whether the mean expression levels of these top 3 genes differ significantly across benign and malignant patient tumor groups.
- Publication-Quality Visuals: Design customized, publication-ready
ggplot2visualizations (e.g., grouped boxplots, density plots, or an interactive volcano plot) documenting these expression profiles. Ensure they are adorned with high-contrast color palettes, clear labels, legend titles, and proper axis formatting.
Phase 2: Feature Engineering
Transforming raw data into predictive, stable mathematical properties is essential for high-fidelity modeling:
- Input Scaling: Explain why raw normalization outputs often need standardization. Programmatically scale and center the gene expression arrays using R’s
scale()mechanisms or custom workflows. - Data Cleaning: Handle intentional “dirty data” points left in the clinical spreadsheet. This includes parsing and encoding categorical metadata parameters (converting them into numeric integers or cleanly structured R
factorvariables).
Phase 3: Machine Learning & Evaluation
Train, test, and extract reliable insights from your diagnostic modeling framework:
- Train/Test Split: Implement a robust cross-validation setup by partitioning your patient cohort into an 80/20 train/test split. Use seed locks to ensure the split remains completely reproducible.
- Supervised Model Training: Using only the scaled training subset, fit at least one binary classification model (such as a Logistic Regression model or a Random Forest classifier) using patient diagnoses as the dependent target variable and gene expressions as inputs.
- Confusion Matrix Generation: Validate your fitted classifier against the untouched 20% validation test cohort. Construct a standard biological Confusion Matrix comparing actual pathological outcomes against algorithmic predictions.
- Metrics Extraction: Explicitly calculate and write code to compute your classifier’s Precision and Recall metrics. Interpret what these metrics mean biologically—specifically clarifying what type of medical diagnostics risk increases when a model yields low recall versus low precision in cancer prediction.
3. The Bias-Variance Trade-off Analysis
As part of your project deliverable, you must include a written conceptual reflection addressing the Bias-Variance Trade-off within your specific modeling workflow.
Write a 300-word critical evaluation answering the following prompt:
Explain the specific strategies, decisions, and preprocessing steps you integrated inside your R pipeline to prevent your trained classification model from overfitting to the training subset. How did you balance the bias-variance trade-off to ensure your models are highly generalizable when evaluated against entirely novel, unseen patient datasets from different clinical labs?
4. Grading Rubric
Your final submission will be evaluated out of a total of 100 points, allocated across the following strict laboratory criteria:
| Category | Points | Core Criteria for Max Points |
|---|---|---|
| Code Quality & Reproducibility | 20 pts | The script contains complete code comments, defines clear functions, avoids hardcoding, and runs from start to finish without generating fatal syntax or path errors. |
| EDA & Visualization | 25 pts | Data checks are thorough, the t-test is set up and interpreted correctly, and all ggplot2 graphs are clean, fully labeled, styled with legible color palettes, and biologically intuitive. |
| Model Building & Accuracy | 25 pts | The data split is randomized and seeded, scaling and feature encoding are executed without leaks, and the predictive model successfully surpasses null/random baseline performance. |
| Evaluation Metrics | 15 pts | The Confusion Matrix is formatted correctly. The calculations for precision and recall are mathematically flawless and interpreted correctly in a biological context. |
| Bias-Variance Analysis | 15 pts | The student’s written analysis is comprehensive, well-structured, exactly 300 words (or slightly more), and demonstrates a mature understanding of bias, variance, and model generalization techniques. |
5. Submission Guidelines
To complete your capstone requirements, you must package and submit the following two files via your AUW student portal (or your designated Google Classroom assignment page) before the final deadline:
- Your Complete R Markdown / Script File (
.Ror.ipynb): This script must contain your complete, end-to-end coding stream from raw data ingestion to modeling output. Ensure your code is thoroughly documented using R comment tags (#) so that any user can run the notebook and achieve identical outputs. - Your Compiled Lab Report (
.pdf): A nicely styled, compiled PDF containing all generatedggplot2visualizations, statistical summaries, confusion matrices, and your 300-word written analysis of the bias-variance trade-off.