Data handling using Python
1 Introduction
- Data handling is the process of collecting, storing, and analyzing data. It is a key skill for biologists as it is crucial to their work.
- Statistical analysis is often used to determine statistical significance and the sample size needed for accurate conclusions.
- Data management ensures that raw and processed data can be efficiently organized to investigate specific questions related to the experiment.
- Python is a popular programming language in bioinformatics research due to its versatility, ease of use, and powerful libraries.
- Python libraries like Pandas and NumPy facilitate data manipulation and statistical operations.
- Python 2 was discontinued in 2020, with Python 3 introducing significant changes, including a focus on consistency and ease of use for beginners.
- Python is a community-driven language with a relatively flat learning curve, making it an excellent choice for beginners.
2.1 Datatypes
- Data Structures: Specialized methods for storing and arranging information in programming, offering various ways to handle data depending on the type.
- Common Python Datatypes:
- int: Integers or whole numbers (e.g., 1, 3, 4, 0)
- float: Decimal numbers or floating-point numbers (e.g., 1.0, 3.14, 2.33)
- bool: Boolean values, representing True or False, used for creating conditions.
- str: Strings, collections of characters like text, frequently used in biology to represent sequences and names.
- String Importance: Strings are crucial for biologists due to the prevalence of DNA, RNA, protein sequences, and names being text-based.
- String Representation: String data is always enclosed in quotes, for instance, ""MKSGSGGGSP"" is a Python string representing a peptide.
2.2 Operators
- Common Operators:
+: Addition-: Subtraction*: Multiplication/: Division=: Assignment**: Exponent
- Data Type Interactions:
- Integer + Float = Float
- Integer + Integer = Integer (except for division)
- Division (
/) always returns a Float
- Integral Division:
- The
//operator performs integer division, returning an integer result (ignoring any remainder).
- The
3 Variables
- Variables in Python: Similar to mathematical variables, they store values and are defined using an assignment operator (
=). - Components of Variables: Variables have a name and a value.
- Variable Usage: Variables make code more readable and reusable by storing data that can be accessed and modified later.
- Reassignment: Variables can be assigned new values at any time, overwriting the previous value.
- Data Type Reassignment: Python allows variables to be assigned different data types (e.g., integer to text) without requiring explicit type declaration.
- Case Sensitivity: Variable names are case sensitive, so ""gene_symbol"" is different from ""Gene Symbol.""
4 Strings
- Strings in Programming: Strings are fundamental data structures used to represent collections of characters, typically text, and are crucial for bioinformatics tasks.
- String Creation in Python: Strings are created in Python by enclosing characters within single quotes (’ ’), double quotes ("" ""), triple single quotes (''' ''') or triple double quotes ("""""" """""")
- Single/Double vs. Triple Quotes: Single or double quotes create single-line strings, while triple quotes allow for multi-line strings.
- Consistency in Quotes: Strings should use the same type of quote marks consistently (single or double) for defining a string datatype.
4.1 String indexing
- String indexing allows for extracting individual characters or portions of a string using their indexes.
- Indexes start at 0 for the leftmost character and increment sequentially.
- Backward indexing starts at -1 for the rightmost character and increments towards the left.
- To extract a portion of a string, use the format
string_name[start:end], wherestartis the starting index andendis the index up to but not including it. - Examples:
word[1:3]extracts characters from index 1 to 2 (excluding 3).word[3:]extracts characters from index 3 to the end.word[:]extracts the entire string.word[1:10]extracts characters from index 1 to 9 (truncated if the index is too large).word[:-2]extracts characters up to but not including the last two characters.word[-2:]extracts the last two characters.
4.2 Operations on strings
- String concatenation: Use the plus symbol (+) to join strings together.
- Data type consistency: All elements being concatenated must be strings. Convert numbers to strings using the
str()function before combining them. - String repetition: Use the asterisk (*) operator and an integer to repeat a string multiple times.
4.3 Methods in strings
-
String Handling Methods: The text highlights several methods for manipulating strings in Python, including:
count(): Counts occurrences of a substring within a string.find(): Locates the first occurrence of a substring within a string.len(): Returns the length (number of characters) of a string.str.split(): Divides a string into a list of substrings based on a specified delimiter. This is particularly useful for parsing data from delimited files like CSV and TSV.
-
CSV and TSV File Formats: The text explains the concepts of Comma-Separated Values (CSV) and Tab-Separated Values (TSV) file formats, where data is organized into columns separated by commas or tabs, respectively.
-
Extracting Data from CSV Files: The text demonstrates how to use the
str.split()method to extract data from a CSV file. It explains the concept of a header row (containing column names) and how to assign individual values to variables using the split method. -
Lists in Python: The text introduces the concept of lists, which are Python data structures used to store collections of items.
5 Python lists and tuples
-
Lists are data structures used to store multiple values of any type, similar to arrays in other programming languages.
-
Key Features of Lists:
- Maintain Order: Lists keep track of the order in which items are inserted.
- Index Access: Individual elements in a list can be accessed using their index.
- Diverse Contents: Lists can hold numbers, strings, and even other lists.
- Mutable: Lists can be modified by adding, removing, or changing elements.
-
Example: The text mentions a string variable ""peptide"" and uses string methods like
count(),find(), andlen().
5.1 Accessing values in list
- Accessing List Elements: List items, similar to strings, have indexes starting from 0 for forward access and -1 for backward access. You can use square brackets (
[]) and the index to retrieve individual elements within a list. - Slicing Lists: Slicing allows you to access a portion of a list. The syntax is similar to string slicing (
[start:stop:step]). Omitting the start index begins the slice at the beginning, omitting the end index extends the slice to the end, and omitting both creates a copy of the entire list. - List Concatenation and Repetition: The
+operator combines two lists, and the*operator repeats a list a specified number of times. - String Splitting (
split()): Thesplit()method in Python allows you to break a string into a list of substrings based on a delimiter.- The example demonstrates splitting a string (
first_row) using a comma (,) as the delimiter, storing the resulting substrings into individual variables, and then printing them.
- The example demonstrates splitting a string (
5.2 Methods with lists
List Methods
-
count(item): Returns the number of times anitemappears in the list.- Example:
my_list = [1, 2, 2, 3, 4, 4, 4]occurrences = my_list.count(4) # occurrences will be 3
- Example:
-
index(item): Returns the index of the first occurrence ofitemin the list. If the item is not found, aValueErroris raised.- Example:
my_list = [""apple"", ""banana"", ""cherry""]index_of_banana = my_list.index(""banana"") # index_of_banana will be 1
- Example:
-
append(item): Addsitemto the end of the list.- Example:
my_list = [1, 2, 3]my_list.append(4) # my_list becomes [1, 2, 3, 4]
- Example:
-
remove(item): Removes the first occurrence ofitemfrom the list. If the item is not found, aValueErroris raised.- Example:
my_list = [1, 2, 3, 2]my_list.remove(2) # my_list becomes [1, 3, 2]
- Example:
-
pop(index=None): Removes and returns the item at the specifiedindex. If noindexis provided, it removes and returns the last item.- Example:
my_list = [1, 2, 3]removed_item = my_list.pop(1) # removed_item will be 2, my_list becomes [1, 3]
- Example:
-
min(list): Returns the minimum value in the list. The list must contain only numbers.- Example:
my_list = [10, 5, 20, 1]smallest_number = min(my_list) # smallest_number will be 1
- Example:
-
max(list): Returns the maximum value in the list. The list must contain only numbers.- Example:
my_list = [10, 5, 20, 1]largest_number = max(my_list) # largest_number will be 20
- Example:
-
sum(list): Returns the sum of all values in the list. The list must contain only numbers.- Example:
my_list = [1, 2, 3, 4]total_sum = sum(my_list) # total_sum will be 10
- Example:
-
len(list): Returns the number of items in the list.- Example:
my_list = [""apple"", ""banana"", ""cherry""]list_length = len(my_list) # list_length will be 3
- Example:
-
sort(): Sorts the list in ascending order (for numerical lists) or alphabetically (for lists of strings).- Example:
my_list = [3, 1, 2]my_list.sort() # my_list becomes [1, 2, 3]
You can sort in descending order by using the
reverse=Trueparameter:my_list = [3, 1, 2]my_list.sort(reverse=True) # my_list becomes [3, 2, 1] - Example:
Key Points
- List methods modify the original list in place, meaning they change the list directly.
- Use caution with methods like
remove()andpop()to avoid unexpected results, as they modify the list’s contents.
5.3 Tuples
- Tuples are one of Python’s four built-in data types for storing collections.
- Tuples are sequential and immutable, meaning their contents cannot be changed after creation.
- Tuples are similar to lists in many ways, but tuples are immutable while lists are mutable.
- Tuples can hold elements of any data type, just like lists.
- List slicing allows you to extract specific portions of a list using index ranges.
6 Dictionary in Python
- Dictionaries in Python: Python dictionaries are similar to hash tables or hashmaps in other languages. They store key-value pairs.
- Creating Dictionaries: Dictionaries are created using curly braces
{}. An empty dictionary is declared as{}. - Dictionary Properties:
- Unordered: Key-value pairs are not stored in any specific order.
- Mutable: Dictionaries can be modified by adding, removing, or changing key-value pairs.
- Key-Based Access: Values are accessed using their corresponding keys, not numerical indexes.
- Unique Keys: Each key in a dictionary must be unique.
- Immutable Keys: Keys should be of immutable data types like strings, integers, or tuples.
- Values Can Be Varied: Values can be any data type, including numbers, strings, lists, or even other dictionaries (nested dictionaries).
- Example: The provided code creates a dictionary named
cropwith key-value pairs related to wheat. It demonstrates printing the dictionary and its data type.
7 Conditional statements
- Conditional statements are crucial for making programs make decisions based on conditions.
- Computers operate on a binary system of True or False, similar to a light switch being On or Off.
- Booleans represent these True/False values in Python.
- Conditions are defined by comparisons, using operators like:
- Equal: a == b
- Not Equal: a != b
- Less than: a < b
- Less than or equal to: a <= b
- Greater than: a > b
- Greater than or equal to: a >= b
- These comparisons always result in a Boolean value (""True"" or ""False"").
7.1 Logical operators
- Logical operators (""and"", ""or"", ""not"") in Python are used to combine and evaluate multiple conditions.
- The ""and"" operator returns True only if both conditions are True.
- The ""or"" operator returns True if at least one of the conditions is True.
- The ""not"" operator inverts the truth value of a condition (True becomes False, False becomes True).
7.2 If and else statements
- If and else statements: These statements are used to execute code conditionally based on whether a certain condition is true or false.
- If statement: Executes a block of code if the specified condition is true.
- Else statement: Executes a block of code if the condition in the preceding ‘if’ statement is false.
- Indentation in Python: Indentation is crucial in Python to define blocks of code within ‘if’ and ‘else’ statements.
- Example of gene expression analysis: The text illustrates how ‘if’ and ‘else’ statements can be used to compare gene expression levels in controlled and treated environments.
8 Loops in Python
- While Loop: This loop repeats a block of code as long as a condition remains true.
- For Loop: This loop iterates through a sequence (like a list or range of numbers) and executes a block of code for each element.
Extractive Summary (List format):
- Loops in Python are used to repeat code until a condition is met or becomes false.
- There are two main types of loops: while loops and for loops.
- While loops execute code as long as a specified condition remains true.
- For loops iterate through sequences (e.g., lists) and execute code for each element in the sequence.
8.1 While loop
- While Loops in Python: A
whileloop repeatedly executes a set of statements until a specific condition becomes false. The loop’s syntax includes a stop condition that determines when the loop should terminate. - Example of a
whileLoop: The text provides an example of a loop printing numbers from 0 to 5, demonstrating how the loop iterates based on the value of a variable (a) and a stop condition (a < 6). - Infinite Loops: If a
whileloop lacks a stop condition or its condition never becomes false, the loop becomes infinite, potentially requiring the program to be restarted. - Importance of Stop Conditions: The text emphasizes the importance of a stop condition in
whileloops. The example illustrates how the stop condition is used to control the loop’s termination, preventing it from running indefinitely.
8.2 “For” loop
- ""For"" loops are definite loops: They iterate over a set of items, such as words in a text, items in a list, or keys and values in a dictionary.
- Purpose of ""for"" loops: Used for tasks that require iterating through a sequence of data.
- Syntax of ""for"" loops: Similar to ""while"" loops, they have a ""for"" statement followed by a block of code.
- Iteration: The ""for"" loop processes each item in the sequence, executing the code block for each item.
- Variable: In the ""for"" statement, a variable (e.g., ""plant"") represents the current item being processed.
8.3 Breaking a loop
- Purpose of ""break"": The ""break"" command allows you to prematurely exit a loop, whether it’s a
whileloop or aforloop. - Behavior of ""break"": Once ""break"" is encountered within a loop, the loop terminates immediately. The program then continues executing the code that comes after the loop.
- Example with ""break"" in a ""for"" loop: The text highlights an example where a loop iterates over a list of plant names. When the variable ""temp"" (presumably within the loop’s code) reaches the value ""Thallophyte"", the ""break"" statement is executed, preventing the printing of ""Conifer"" from the list.
9 File handling in Python
- File Handling in Python:
- Python uses the
open()function to interact with files. - Files are stored on secondary memory, like hard drives, and persist even when the computer is turned off.
- Files are essential for storing and sharing data, especially in research.
- Python uses the
- Accessing Files:
f = open('myfile.txt'): This opens a file named ‘myfile.txt’ in the same directory as the Python script.f = open('C:\Python33\Scripts\myfile.txt'): This opens a file with the specified full path.
- Importance in Research:
- Biological data is often stored in files with specific formats (PDB, networks, sequence files).
- The ability to read and manipulate these files is crucial for scientific work.
9.1 Specify file mode
- File Modes in Python: The text describes eight file modes used for file operations in Python, each with a specific purpose:
- r (Read): Opens a file for reading only.
- w (Write): Opens a file for writing only, overwriting existing content.
- a (Append): Opens a file for writing only, adding content to the end.
- r+ (Read and Write): Opens a file for both reading and writing.
- x (Create): Creates a new file.
- t (Text): Reads and writes strings. This is the default mode.
- b (Binary): Reads and writes bytes objects. Used for non-text files like images.
- FASTA File Format: The text uses the FASTA file format as an example. FASTA files are used to store genetic sequences and their identifiers.
- The identifier line starts with "">"".
- The sequence follows the identifier line.
- File Handling in Python:
- The
open()function is used to open files in Python. - The
read()method reads the entire content of a file. - The
readlines()method returns a list of all lines in a file. - The
write()method writes content to a file. - It’s important to close files after use using the
close()method. “
- The
10 Importing functions
- Python Modules:
- Python modules are files with the
.pyextension containing Python code. - Modules help organize code into reusable components, making programs easier to maintain and understand.
- You use the
importkeyword to access modules within your Python scripts.
- Python modules are files with the
- Python Packages:
- Packages are collections of modules that provide functionality for specific tasks.
- The Python Package Index (PyPI) hosts over 227,000 packages, covering diverse areas like web development, data science, and machine learning.
- You can install packages using the
pipinstaller (orcondawithin Anaconda distributions).
- T-Tests:
- T-tests are statistical tools used to compare the means of two groups.
- They help determine if observed differences between groups are statistically significant or likely due to random chance.
- T-tests are used when data has a normal distribution but unknown variances.
- Example: Drug Trial:
- A pharmaceutical company testing a new drug uses a t-test to analyze the effectiveness of the treatment compared to a placebo.
- The t-test helps determine if the observed difference in life expectancy between the treatment and control groups is statistically significant or likely a random variation.
- Example: Plant Growth Experiment:
- Researchers investigate the effect of a nutrient on plant growth by comparing a treatment group to a control group.
- A t-test helps determine if the difference in plant height between the groups is statistically significant or a result of natural variation.
- A p-value is generated. If it’s less than 0.05, the difference is considered statistically significant.
10.1 Running a simple linear regression in Python
- The
scipy.statsmodule offers a variety of statistical analyses. - The
pearsonrfunction within this module performs Pearson regression. - Pearson regression assesses the relationship between two sets of continuous numerical data.
- It can be used to determine if there’s a correlation between variables, such as frog size and call length.
11 Data handling
- Pandas is a powerful Python library for data manipulation and analysis. It’s free and easy to use, making it popular among developers for data science projects.
- Pandas DataFrames offer a convenient way to work with data. You can add/delete columns, slice, index, and handle missing values effortlessly.
- Pandas simplifies data cleaning, modification, and analysis. It lets you analyze data from CSV files, perform statistical analysis, identify relationships between rows, and detect anomalies in data.
- You can visualize data with Matplotlib. Create various plots, like scatterplots, lines, bars, bubbles, and histograms, to gain insights from your data.
- The importance of learning programming is emphasized. It can lead to better career opportunities and help tackle modern challenges.