Week 12: Introduction to Scripting in Google Colab
So far in this semester, we have performed all of our statistical analyses and sequencing data visualization on our own local laptops using RStudio. But what happens when you need to analyze a clinical dataset that is 500 Gigabytes, or you have to train a massive machine learning model that would cause your personal computer’s processor to melt? We move to the Cloud.
1. The Biological Problem
Imagine you are accepted into a prestigious international research consortium tasked with analyzing the whole-genome sequences of 10,000 newly discovered deep-sea organisms. The raw genomic files are massive—measuring into the terabytes.
You cannot simply email a dataset that massive to your clinical collaborators in Europe, and even if you had a superfast internet connection, their personal laptops might run different software versions or lack the raw RAM needed to load the dataset. To collaborate on global biological threats or diverse ecological projects, we need a single, shared, high-powered computer in the sky where everyone can look at the exact same dataset, run the exact same scripts, and achieve identical, reproducible results.
2. Intuition & Theory
To resolve these computational and collaboration challenges, we use two interlinked technologies:
Cloud Computing: High-Efficiency Supercomputers
Instead of buying a $10,000 computer rig for every research lab, Cloud Computing allows us to rent virtual computer power hosted in Google or Amazon data centers. When you launch a cloud workspace, you are instantly connected to a high-speed server running on a massive rack of processors somewhere across the country.
The Jupyter Notebook Paradigm
In RStudio, code script files and visual output charts are displayed in separate panels. Google Colaboratory (or “Colab”) is built on the Jupyter Notebook model, which seamlessly alternates between two types of spaces:
- Text Cells: Use a simple formatting style called Markdown to write beautiful paragraphs, format equations, create diagrams, and write notes to explain the science.
- Code Cells: Contain actual interactive snippets of code (typically Python or R) that you can execute live right inside your browser window.

3. Visual Breakdown
To see how to set up, format, and execute your very first cloud notebooks, watch this step-by-step tutorial:
4. Translating Theory to Code
Writing in Google Colab is designed to be highly interactive. Let’s look at how text and code live side-by-side:
Writing Notes in a Text Cell
To format clear descriptions to send to your co-authors, you use basic Markdown headers and bold styles:
# Section Header: Analyzing Ocean Genome #5Here is a list of my **initial observations** of the mutated sequences:* Observation A: High GC content detected.* Observation B: Mutation located on Chromosome 3.Running Scripts in a Code Cell
Directly below that text, you enter a code cell and click the “Play” icon. The cloud supercomputer executes the code in real-time and prints the results directly beneath:
# --- Python Code Cell Example ---# Calculate the GC content of sample sequencedna_sequence = "GCTATCGATCGA"gc_count = dna_sequence.count("G") + dna_sequence.count("C")gc_percentage = (gc_count / len(dna_sequence)) * 100
print(f"Genomic GC Content: {gc_percentage}%")