Skip to content

Data handling using Python

1 Introduction

  • Data handling is the process of collecting, storing, and analyzing data. It is a key skill for biologists as it is crucial to their work.
  • Statistical analysis is often used to determine statistical significance and the sample size needed for accurate conclusions.
  • Data management ensures that raw and processed data can be efficiently organized to investigate specific questions related to the experiment.
  • Python is a popular programming language in bioinformatics research due to its versatility, ease of use, and powerful libraries.
  • Python libraries like Pandas and NumPy facilitate data manipulation and statistical operations.
  • Python 2 was discontinued in 2020, with Python 3 introducing significant changes, including a focus on consistency and ease of use for beginners.
  • Python is a community-driven language with a relatively flat learning curve, making it an excellent choice for beginners.

2.1 Datatypes

  • Data Structures: Specialized methods for storing and arranging information in programming, offering various ways to handle data depending on the type.
  • Common Python Datatypes:
    • int: Integers or whole numbers (e.g., 1, 3, 4, 0)
    • float: Decimal numbers or floating-point numbers (e.g., 1.0, 3.14, 2.33)
    • bool: Boolean values, representing True or False, used for creating conditions.
    • str: Strings, collections of characters like text, frequently used in biology to represent sequences and names.
  • String Importance: Strings are crucial for biologists due to the prevalence of DNA, RNA, protein sequences, and names being text-based.
  • String Representation: String data is always enclosed in quotes, for instance, ""MKSGSGGGSP"" is a Python string representing a peptide.

2.2 Operators

  • Common Operators:
    • +: Addition
    • -: Subtraction
    • *: Multiplication
    • /: Division
    • =: Assignment
    • **: Exponent
  • Data Type Interactions:
    • Integer + Float = Float
    • Integer + Integer = Integer (except for division)
    • Division (/) always returns a Float
  • Integral Division:
    • The // operator performs integer division, returning an integer result (ignoring any remainder).

3 Variables

  • Variables in Python: Similar to mathematical variables, they store values and are defined using an assignment operator (=).
  • Components of Variables: Variables have a name and a value.
  • Variable Usage: Variables make code more readable and reusable by storing data that can be accessed and modified later.
  • Reassignment: Variables can be assigned new values at any time, overwriting the previous value.
  • Data Type Reassignment: Python allows variables to be assigned different data types (e.g., integer to text) without requiring explicit type declaration.
  • Case Sensitivity: Variable names are case sensitive, so ""gene_symbol"" is different from ""Gene Symbol.""

4 Strings

  • Strings in Programming: Strings are fundamental data structures used to represent collections of characters, typically text, and are crucial for bioinformatics tasks.
  • String Creation in Python: Strings are created in Python by enclosing characters within single quotes (’ ’), double quotes ("" ""), triple single quotes (''' ''') or triple double quotes ("""""" """""")
  • Single/Double vs. Triple Quotes: Single or double quotes create single-line strings, while triple quotes allow for multi-line strings.
  • Consistency in Quotes: Strings should use the same type of quote marks consistently (single or double) for defining a string datatype.

4.1 String indexing

  • String indexing allows for extracting individual characters or portions of a string using their indexes.
  • Indexes start at 0 for the leftmost character and increment sequentially.
  • Backward indexing starts at -1 for the rightmost character and increments towards the left.
  • To extract a portion of a string, use the format string_name[start:end], where start is the starting index and end is the index up to but not including it.
  • Examples:
    • word[1:3] extracts characters from index 1 to 2 (excluding 3).
    • word[3:] extracts characters from index 3 to the end.
    • word[:] extracts the entire string.
    • word[1:10] extracts characters from index 1 to 9 (truncated if the index is too large).
    • word[:-2] extracts characters up to but not including the last two characters.
    • word[-2:] extracts the last two characters.

4.2 Operations on strings

  • String concatenation: Use the plus symbol (+) to join strings together.
  • Data type consistency: All elements being concatenated must be strings. Convert numbers to strings using the str() function before combining them.
  • String repetition: Use the asterisk (*) operator and an integer to repeat a string multiple times.

4.3 Methods in strings

  • String Handling Methods: The text highlights several methods for manipulating strings in Python, including:

    • count(): Counts occurrences of a substring within a string.
    • find(): Locates the first occurrence of a substring within a string.
    • len(): Returns the length (number of characters) of a string.
    • str.split(): Divides a string into a list of substrings based on a specified delimiter. This is particularly useful for parsing data from delimited files like CSV and TSV.
  • CSV and TSV File Formats: The text explains the concepts of Comma-Separated Values (CSV) and Tab-Separated Values (TSV) file formats, where data is organized into columns separated by commas or tabs, respectively.

  • Extracting Data from CSV Files: The text demonstrates how to use the str.split() method to extract data from a CSV file. It explains the concept of a header row (containing column names) and how to assign individual values to variables using the split method.

  • Lists in Python: The text introduces the concept of lists, which are Python data structures used to store collections of items.

5 Python lists and tuples

  • Lists are data structures used to store multiple values of any type, similar to arrays in other programming languages.

  • Key Features of Lists:

    • Maintain Order: Lists keep track of the order in which items are inserted.
    • Index Access: Individual elements in a list can be accessed using their index.
    • Diverse Contents: Lists can hold numbers, strings, and even other lists.
    • Mutable: Lists can be modified by adding, removing, or changing elements.
  • Example: The text mentions a string variable ""peptide"" and uses string methods like count(), find(), and len().

5.1 Accessing values in list

  • Accessing List Elements: List items, similar to strings, have indexes starting from 0 for forward access and -1 for backward access. You can use square brackets ([]) and the index to retrieve individual elements within a list.
  • Slicing Lists: Slicing allows you to access a portion of a list. The syntax is similar to string slicing ([start:stop:step]). Omitting the start index begins the slice at the beginning, omitting the end index extends the slice to the end, and omitting both creates a copy of the entire list.
  • List Concatenation and Repetition: The + operator combines two lists, and the * operator repeats a list a specified number of times.
  • String Splitting (split()): The split() method in Python allows you to break a string into a list of substrings based on a delimiter.
    • The example demonstrates splitting a string (first_row) using a comma (,) as the delimiter, storing the resulting substrings into individual variables, and then printing them.

5.2 Methods with lists

List Methods

  • count(item): Returns the number of times an item appears in the list.

    • Example:
      my_list = [1, 2, 2, 3, 4, 4, 4]
      occurrences = my_list.count(4) # occurrences will be 3
  • index(item): Returns the index of the first occurrence of item in the list. If the item is not found, a ValueError is raised.

    • Example:
      my_list = [""apple"", ""banana"", ""cherry""]
      index_of_banana = my_list.index(""banana"") # index_of_banana will be 1
  • append(item): Adds item to the end of the list.

    • Example:
      my_list = [1, 2, 3]
      my_list.append(4) # my_list becomes [1, 2, 3, 4]
  • remove(item): Removes the first occurrence of item from the list. If the item is not found, a ValueError is raised.

    • Example:
      my_list = [1, 2, 3, 2]
      my_list.remove(2) # my_list becomes [1, 3, 2]
  • pop(index=None): Removes and returns the item at the specified index. If no index is provided, it removes and returns the last item.

    • Example:
      my_list = [1, 2, 3]
      removed_item = my_list.pop(1) # removed_item will be 2, my_list becomes [1, 3]
  • min(list): Returns the minimum value in the list. The list must contain only numbers.

    • Example:
      my_list = [10, 5, 20, 1]
      smallest_number = min(my_list) # smallest_number will be 1
  • max(list): Returns the maximum value in the list. The list must contain only numbers.

    • Example:
      my_list = [10, 5, 20, 1]
      largest_number = max(my_list) # largest_number will be 20
  • sum(list): Returns the sum of all values in the list. The list must contain only numbers.

    • Example:
      my_list = [1, 2, 3, 4]
      total_sum = sum(my_list) # total_sum will be 10
  • len(list): Returns the number of items in the list.

    • Example:
      my_list = [""apple"", ""banana"", ""cherry""]
      list_length = len(my_list) # list_length will be 3
  • sort(): Sorts the list in ascending order (for numerical lists) or alphabetically (for lists of strings).

    • Example:
      my_list = [3, 1, 2]
      my_list.sort() # my_list becomes [1, 2, 3]

    You can sort in descending order by using the reverse=True parameter:

    my_list = [3, 1, 2]
    my_list.sort(reverse=True) # my_list becomes [3, 2, 1]

Key Points

  • List methods modify the original list in place, meaning they change the list directly.
  • Use caution with methods like remove() and pop() to avoid unexpected results, as they modify the list’s contents.

5.3 Tuples

  • Tuples are one of Python’s four built-in data types for storing collections.
  • Tuples are sequential and immutable, meaning their contents cannot be changed after creation.
  • Tuples are similar to lists in many ways, but tuples are immutable while lists are mutable.
  • Tuples can hold elements of any data type, just like lists.
  • List slicing allows you to extract specific portions of a list using index ranges.

6 Dictionary in Python

  • Dictionaries in Python: Python dictionaries are similar to hash tables or hashmaps in other languages. They store key-value pairs.
  • Creating Dictionaries: Dictionaries are created using curly braces {}. An empty dictionary is declared as {}.
  • Dictionary Properties:
    • Unordered: Key-value pairs are not stored in any specific order.
    • Mutable: Dictionaries can be modified by adding, removing, or changing key-value pairs.
    • Key-Based Access: Values are accessed using their corresponding keys, not numerical indexes.
    • Unique Keys: Each key in a dictionary must be unique.
    • Immutable Keys: Keys should be of immutable data types like strings, integers, or tuples.
    • Values Can Be Varied: Values can be any data type, including numbers, strings, lists, or even other dictionaries (nested dictionaries).
  • Example: The provided code creates a dictionary named crop with key-value pairs related to wheat. It demonstrates printing the dictionary and its data type.

7 Conditional statements

  • Conditional statements are crucial for making programs make decisions based on conditions.
  • Computers operate on a binary system of True or False, similar to a light switch being On or Off.
  • Booleans represent these True/False values in Python.
  • Conditions are defined by comparisons, using operators like:
    • Equal: a == b
    • Not Equal: a != b
    • Less than: a < b
    • Less than or equal to: a <= b
    • Greater than: a > b
    • Greater than or equal to: a >= b
  • These comparisons always result in a Boolean value (""True"" or ""False"").

7.1 Logical operators

  • Logical operators (""and"", ""or"", ""not"") in Python are used to combine and evaluate multiple conditions.
  • The ""and"" operator returns True only if both conditions are True.
  • The ""or"" operator returns True if at least one of the conditions is True.
  • The ""not"" operator inverts the truth value of a condition (True becomes False, False becomes True).

7.2 If and else statements

  • If and else statements: These statements are used to execute code conditionally based on whether a certain condition is true or false.
  • If statement: Executes a block of code if the specified condition is true.
  • Else statement: Executes a block of code if the condition in the preceding ‘if’ statement is false.
  • Indentation in Python: Indentation is crucial in Python to define blocks of code within ‘if’ and ‘else’ statements.
  • Example of gene expression analysis: The text illustrates how ‘if’ and ‘else’ statements can be used to compare gene expression levels in controlled and treated environments.

8 Loops in Python

  • While Loop: This loop repeats a block of code as long as a condition remains true.
  • For Loop: This loop iterates through a sequence (like a list or range of numbers) and executes a block of code for each element.

Extractive Summary (List format):

  • Loops in Python are used to repeat code until a condition is met or becomes false.
  • There are two main types of loops: while loops and for loops.
  • While loops execute code as long as a specified condition remains true.
  • For loops iterate through sequences (e.g., lists) and execute code for each element in the sequence.

8.1 While loop

  • While Loops in Python: A while loop repeatedly executes a set of statements until a specific condition becomes false. The loop’s syntax includes a stop condition that determines when the loop should terminate.
  • Example of a while Loop: The text provides an example of a loop printing numbers from 0 to 5, demonstrating how the loop iterates based on the value of a variable (a) and a stop condition (a < 6).
  • Infinite Loops: If a while loop lacks a stop condition or its condition never becomes false, the loop becomes infinite, potentially requiring the program to be restarted.
  • Importance of Stop Conditions: The text emphasizes the importance of a stop condition in while loops. The example illustrates how the stop condition is used to control the loop’s termination, preventing it from running indefinitely.

8.2 “For” loop

  • ""For"" loops are definite loops: They iterate over a set of items, such as words in a text, items in a list, or keys and values in a dictionary.
  • Purpose of ""for"" loops: Used for tasks that require iterating through a sequence of data.
  • Syntax of ""for"" loops: Similar to ""while"" loops, they have a ""for"" statement followed by a block of code.
  • Iteration: The ""for"" loop processes each item in the sequence, executing the code block for each item.
  • Variable: In the ""for"" statement, a variable (e.g., ""plant"") represents the current item being processed.

8.3 Breaking a loop

  • Purpose of ""break"": The ""break"" command allows you to prematurely exit a loop, whether it’s a while loop or a for loop.
  • Behavior of ""break"": Once ""break"" is encountered within a loop, the loop terminates immediately. The program then continues executing the code that comes after the loop.
  • Example with ""break"" in a ""for"" loop: The text highlights an example where a loop iterates over a list of plant names. When the variable ""temp"" (presumably within the loop’s code) reaches the value ""Thallophyte"", the ""break"" statement is executed, preventing the printing of ""Conifer"" from the list.

9 File handling in Python

  • File Handling in Python:
    • Python uses the open() function to interact with files.
    • Files are stored on secondary memory, like hard drives, and persist even when the computer is turned off.
    • Files are essential for storing and sharing data, especially in research.
  • Accessing Files:
    • f = open('myfile.txt'): This opens a file named ‘myfile.txt’ in the same directory as the Python script.
    • f = open('C:\Python33\Scripts\myfile.txt'): This opens a file with the specified full path.
  • Importance in Research:
    • Biological data is often stored in files with specific formats (PDB, networks, sequence files).
    • The ability to read and manipulate these files is crucial for scientific work.

9.1 Specify file mode

  • File Modes in Python: The text describes eight file modes used for file operations in Python, each with a specific purpose:
    • r (Read): Opens a file for reading only.
    • w (Write): Opens a file for writing only, overwriting existing content.
    • a (Append): Opens a file for writing only, adding content to the end.
    • r+ (Read and Write): Opens a file for both reading and writing.
    • x (Create): Creates a new file.
    • t (Text): Reads and writes strings. This is the default mode.
    • b (Binary): Reads and writes bytes objects. Used for non-text files like images.
  • FASTA File Format: The text uses the FASTA file format as an example. FASTA files are used to store genetic sequences and their identifiers.
    • The identifier line starts with "">"".
    • The sequence follows the identifier line.
  • File Handling in Python:
    • The open() function is used to open files in Python.
    • The read() method reads the entire content of a file.
    • The readlines() method returns a list of all lines in a file.
    • The write() method writes content to a file.
    • It’s important to close files after use using the close() method. “

10 Importing functions

  • Python Modules:
    • Python modules are files with the .py extension containing Python code.
    • Modules help organize code into reusable components, making programs easier to maintain and understand.
    • You use the import keyword to access modules within your Python scripts.
  • Python Packages:
    • Packages are collections of modules that provide functionality for specific tasks.
    • The Python Package Index (PyPI) hosts over 227,000 packages, covering diverse areas like web development, data science, and machine learning.
    • You can install packages using the pip installer (or conda within Anaconda distributions).
  • T-Tests:
    • T-tests are statistical tools used to compare the means of two groups.
    • They help determine if observed differences between groups are statistically significant or likely due to random chance.
    • T-tests are used when data has a normal distribution but unknown variances.
  • Example: Drug Trial:
    • A pharmaceutical company testing a new drug uses a t-test to analyze the effectiveness of the treatment compared to a placebo.
    • The t-test helps determine if the observed difference in life expectancy between the treatment and control groups is statistically significant or likely a random variation.
  • Example: Plant Growth Experiment:
    • Researchers investigate the effect of a nutrient on plant growth by comparing a treatment group to a control group.
    • A t-test helps determine if the difference in plant height between the groups is statistically significant or a result of natural variation.
    • A p-value is generated. If it’s less than 0.05, the difference is considered statistically significant.

10.1 Running a simple linear regression in Python

  • The scipy.stats module offers a variety of statistical analyses.
  • The pearsonr function within this module performs Pearson regression.
  • Pearson regression assesses the relationship between two sets of continuous numerical data.
  • It can be used to determine if there’s a correlation between variables, such as frog size and call length.

11 Data handling

  • Pandas is a powerful Python library for data manipulation and analysis. It’s free and easy to use, making it popular among developers for data science projects.
  • Pandas DataFrames offer a convenient way to work with data. You can add/delete columns, slice, index, and handle missing values effortlessly.
  • Pandas simplifies data cleaning, modification, and analysis. It lets you analyze data from CSV files, perform statistical analysis, identify relationships between rows, and detect anomalies in data.
  • You can visualize data with Matplotlib. Create various plots, like scatterplots, lines, bars, bubbles, and histograms, to gain insights from your data.
  • The importance of learning programming is emphasized. It can lead to better career opportunities and help tackle modern challenges.