Skip to content

Week 12: Introduction to Scripting in Google Colab

So far in this semester, we have performed all of our statistical analyses and sequencing data visualization on our own local laptops using RStudio. But what happens when you need to analyze a clinical dataset that is 500 Gigabytes, or you have to train a massive machine learning model that would cause your personal computer’s processor to melt? We move to the Cloud.

1. The Biological Problem

Imagine you are accepted into a prestigious international research consortium tasked with analyzing the whole-genome sequences of 10,000 newly discovered deep-sea organisms. The raw genomic files are massive—measuring into the terabytes.

You cannot simply email a dataset that massive to your clinical collaborators in Europe, and even if you had a superfast internet connection, their personal laptops might run different software versions or lack the raw RAM needed to load the dataset. To collaborate on global biological threats or diverse ecological projects, we need a single, shared, high-powered computer in the sky where everyone can look at the exact same dataset, run the exact same scripts, and achieve identical, reproducible results.

2. Intuition & Theory

To resolve these computational and collaboration challenges, we use two interlinked technologies:

Cloud Computing: High-Efficiency Supercomputers

Instead of buying a $10,000 computer rig for every research lab, Cloud Computing allows us to rent virtual computer power hosted in Google or Amazon data centers. When you launch a cloud workspace, you are instantly connected to a high-speed server running on a massive rack of processors somewhere across the country.

The Jupyter Notebook Paradigm

In RStudio, code script files and visual output charts are displayed in separate panels. Google Colaboratory (or “Colab”) is built on the Jupyter Notebook model, which seamlessly alternates between two types of spaces:

  • Text Cells: Use a simple formatting style called Markdown to write beautiful paragraphs, format equations, create diagrams, and write notes to explain the science.
  • Code Cells: Contain actual interactive snippets of code (typically Python or R) that you can execute live right inside your browser window.

Google Colab Notebook Interface Source: [Wikimedia Commons/Jupyter Notebook Anatomy]

3. Visual Breakdown

To see how to set up, format, and execute your very first cloud notebooks, watch this step-by-step tutorial:

4. Translating Theory to Code

Writing in Google Colab is designed to be highly interactive. Let’s look at how text and code live side-by-side:

Writing Notes in a Text Cell

To format clear descriptions to send to your co-authors, you use basic Markdown headers and bold styles:

# Section Header: Analyzing Ocean Genome #5
Here is a list of my **initial observations** of the mutated sequences:
* Observation A: High GC content detected.
* Observation B: Mutation located on Chromosome 3.

Running Scripts in a Code Cell

Directly below that text, you enter a code cell and click the “Play” icon. The cloud supercomputer executes the code in real-time and prints the results directly beneath:

# --- Python Code Cell Example ---
# Calculate the GC content of sample sequence
dna_sequence = "GCTATCGATCGA"
gc_count = dna_sequence.count("G") + dna_sequence.count("C")
gc_percentage = (gc_count / len(dna_sequence)) * 100
print(f"Genomic GC Content: {gc_percentage}%")

Topics Covered

google colab bioinformaticscloud computing biologyjupyter notebooksreproducible researchdata science cloud