Basics of R and R Studio
# Setting Up Your Workspace
Start by installing R and RStudio. R is the programming language, and RStudio is a user-friendly interface that makes working with R much easier. Just Google “RStudio download” and follow the installation instructions. It’s completely free!
Installation
-
Download and Install R: Head to https://cran.r-project.org/ and download the R distribution for your operating system. Follow the installation instructions.
-
Download and Install RStudio: Visit https://www.rstudio.com/products/rstudio/download/ and download the free RStudio Desktop version. Install it after you’ve installed R.
RStudio interface
When you open RStudio, you’ll see a a user-friendly interface with four key panes:
-
-
Script/Source Window (Top): This is where you’ll write and edit your R code.
-
Console Window (Bottom Left): Execute commands directly and view output.
-
Environment Pane (Top Right): This pane displays the objects (variables, data frames) you create in your R session.
-
Files/ Plots/ Help Pane (Bottom Right): Manage your R scripts, data files, and project files. Displays graphs generated by your code. This pane provides documentation for functions and data sets, as well as access to online resources.
-
# Basic R Syntax
-
R as a Calculator: R can perform basic arithmetic operations:
7 + 7 -
Variable Assignment: Assign values to variables using the arrow symbol (<-). For example:
my_variable <- "Hello, world!" -
Printing Values: Use the print() function to display the value of a variable or any expression:
print(my_variable)
Working with Data Structures
R offers various data types and structures for storing and manipulating information.
Data Types:
-
Numeric: Whole numbers (integers) and decimal numbers (doubles)
-
Character: Text strings
-
Logical: TRUE/FALSE values
-
Complex: Numbers with real and imaginary parts
-
Raw: Binary data
Data Structures:
-
Vectors: A sequence of elements of the same data type. Create vectors using the c() function:
my_vector <- c(1, 2, 3, 4, 5) # Numeric vectormy_vector_char <- c("apple", "banana", "cherry") # Character vector -
Lists: Collections of elements of potentially different data types:
my_list <- list(1, "hello", TRUE) # Contains a number, string, and boolean -
Matrices: Two-dimensional arrays of elements of the same data type:
my_matrix <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2) # Create a 2x2 matrix -
Arrays: Multi-dimensional data structures:
my_array <- array(c(1, 2, 3, 4, 5, 6), dim = c(2, 3)) # Create a 2x3 array -
Factors: Represent categorical variables, treating data as groups rather than individual values:
my_factor <- factor(c("red", "green", "blue", "red")) # Create a factor -
Data Frames: Organize data in a tabular format, with columns representing different data types and rows representing individual observations. Create data frames using the data.frame() function:
my_df <- data.frame(name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 28))
**Coercion (**Changing data type)
Manual Coercion: Use functions like as.integer(), as.numeric(), as.data.frame() to convert data types explicitly.
Entering Data Manually
Sometimes, you might need to enter small amounts of data directly into R. Methods for Entering Data:
-
Colon Operator: x <- 0:10 (Creates a sequence from 0 to 10)
-
seq() Function: x <- seq(1, 10, by=2) (Creates a sequence starting at 1, ending at 10, with a step of 2)
-
C() Function: x <- c(1, 4, 7, 2) (Combines individual values)
-
scan() Function: Allows interactive data entry (enter values followed by Enter, end with two Enter presses)
-
rep() Function: x <- rep(TRUE, 5) (Repeats a value 5 times)
Help on Functions and Datasets
Use a question mark followed by the function or dataset name to access help documentation:
?mean?mpg# Power Up R with Packages
R packages are collections of functions and tools that expand R’s capabilities.
Install Packages
-
Option 1: By using the install.packages() function:
- install.packages(“dplyr”)
-
Option 2: By using PacMan:
-
Install the PacMan package: install.packages(“pacman”)
-
Load PacMan: library(pacman)
-
Install a set of packages: p_load(dplyr, tidyr, stringr, lubridate, ggplot2, readr)
-
-
To check if a package has been installed:
-
You can use the
installed.packages()function. Here’s a simple way to check:is_installed <- "package_name" %in% rownames(installed.packages())Replace “package_name” with the name of the package you want to check. This will return
TRUEif the package is installed, andFALSEif it’s not.
-
Load Packages
Once installed, you can load packages usinglibrary():
-
- library(dplyr)
**To check if a package has been loaded: **You can use the isNamespaceLoaded() function or check the search() list. Here are two methods:
# Method 1is_loaded <- isNamespaceLoaded("package_name")
# Method 2is_loaded <- "package:package_name" %in% search()Again, replace “package_name” with the name of the package you want to check. Both methods will return TRUE if the package is loaded, and FALSE if it’s not.
Check the version of an installed package
To check the version of an installed package in R, you can use the packageVersion() function. Here’s how to do it
packageVersion("package_name")Replace “package_name” with the name of the package you want to check. This function will return the version number of the package.
For example, to check the version of the “dplyr” package:
packageVersion("dplyr")Or,
print(packageVersion("dplyr"), quote = FALSE)Discover Useful Packages
-
CRAN (Comprehensive R Archive Network): The official repository for R packages, organized by task views (e.g., Bayesian inference, chemo metrics, etc.).
-
CRANtastic: A site listing recently updated and popular R packages.
-
GitHub: A platform where developers share and collaborate on R packages.
# Data Manipulation using Tidyverse and Dplyr
-
Tidyverse: A Collection of Packages: The Tidyverse is a set of packages designed to work together seamlessly for consistent data analysis using the Tidy data format.
-
Dplyr: The Data Manipulation Master: Dplyr is a key Tidyverse package that empowers you to filter, transform, and manipulate data with ease.
The Pipe Operator ( %>%)
The pipe operator ( %>%) chains operations, passing the output of one function as input to the next. For example:
-
my_df %>%filter(age > 28) %>% # Filter for rows where age is greater than 28mutate(new_column = age * 2) # Create a new column doubling the age
Let’s filter by City mileage and then mutate a column in one step:
mpg_filtered_and_mutated <- mpg %>% filter(cty >= 20) %>% mutate(cty_metric = cty * 0.425144)# Output: A new dataset with both filtering and mutation performed
view(mpg_filtered_and_mutated)Subsetting Data
The filter() command allows you to select rows based on specific conditions:
- Filtering by Condition: Get cars with City mileage at least 20 miles per gallon:
mpg_efficient <- mpg %>% filter(cty >= 20)# Output: A new dataset called "mpg_efficient" with only cars that meet the condition
view(mpg_efficient)- Filtering by Variable Value: Get cars manufactured by Ford:
mpg_ford <- mpg %>% filter(manufacturer == "ford")# Output: A new dataset called "mpg_ford" with only Ford cars
view(mpg_ford)Grouped Summaries
- Grouping and Summarizing: Calculate the average City mileage for each vehicle class:
mpg_grouped_summary <- mpg %>% group_by(class) %>% summarize(avg_cty = mean(cty))# Output: A dataset showing the average City mileage for each vehicle class
view(mpg_grouped_summary)- Multiple Summaries: Calculate both average and median City mileage:
mpg_grouped_summary <- mpg %>% group_by(class) %>% summarize(avg_cty = mean(cty), median_cty = median(cty))# Output: A dataset showing both the average and median City mileage for each vehicle class
view(mpg_grouped_summary)# Importing Data from Files
The most common way to get data into R is by importing it from files.
Common Data File Formats:
-
CSV (Comma Separated Values): Plain text version of a spreadsheet.
-
TXT (Text File): Simple text files.
-
XLSX (Excel Spreadsheet): Excel files.
-
JSON (JavaScript Object Notation): Data format often used for web data.
Step 1: Load the readr Package
- library(readr)
Step 2: Import Data
-
CSV: data <- read_csv(“your_file.csv”)
-
TXT: data <- read_delim(“your_file.txt”, delim=“\t”)
-
XLSX: Import using the readxl package: install.packages(“readxl”); library(readxl); data <- read_excel(“your_file.xlsx”)
Practice
This tutorial will guide you through the fundamental data structures in R, specifically focusing on those essential for working with Bioconductor.
Vectors: Building Blocks of Data
A vectors are the foundation of many data structures in R. They hold elements of the same data type, making them efficient for storing and manipulating data.
Example:
# Create a vector of numbersnumbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)# Print the vectornumbers# Get the class of the vectorclass(numbers)Subsetting Vectors: You can access specific elements within a vector using indexing and names.
Example:
# Give names to the elementsnames(numbers) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")# Access elements by indexnumbers[1] # Get the first elementnumbers[3:5] # Get elements from index 3 to 5# Access elements by namenumbers["a"] # Get the element named "a"numbers[c("b", "d", "f")] # Get elements named "b", "d", and "f"Important Note: When using names for subsetting, be aware that non-unique names can lead to confusion. Only the first match will be returned.
# Example with non-unique namesnames(numbers) <- c("a", "a", "b")numbers["a"] # Returns the first element with name "a"Matrices: Organizing Data in Rows and Columns
Matrices are two-dimensional data structures with rows and columns. They are useful for representing tables or arrays of data.
Example:
# Create a matrixmatrix <- matrix(1:9, nrow = 3, ncol = 3)matrix# Get the dimensions of the matrixdim(matrix)# Access elements by indicesmatrix[1, 2] # Access element in the first row, second columnmatrix[1:2, 3] # Access elements in the first and second row, third column# Add row namesrownames(matrix) <- c("R1", "R2", "R3")# Add column namescolnames(matrix) <- c("C1", "C2", "C3")matrixSubsetting Matrices: You can use both numeric indices and names for subsetting matrices, considering their two-dimensional nature.
Lists: Holding Diverse Data
Lists are flexible data structures capable of holding various types of objects, even of different classes.
Example:
# Create a list with different elementsmy_list <- list(numbers = numbers, letters = letters[1:5], function = mean)# Print the listmy_list# Access elements by namemy_list$numbersmy_list$lettersmy_list$functionSubsetting Lists: Similar to vectors, you can subset lists using indexing and names.
Example:
my_list[1:2] # Access the first two elementsmy_list[1] # Access the first element (returns a list with one element)my_list[[1]] # Access the first element (returns the element itself)Important Note: Double brackets ( [[ ]] ) are crucial for accessing the elements directly within a list, as single brackets ([ ]) return a list with one element.
Data Frames: Organizing Data for Analysis
Data frames are essential for data analysis, storing observations of different types in columns. Each column represents a variable, and each row represents an observation.
Example:
# Create a data frame with two variablesmy_df <- data.frame(sex = c("M", "F", "M"), age = c(25, 30, 28))# Print the data framemy_df# Access columnsmy_df$sexmy_df$age# Access rows using subsettingmy_df[1:2, ] # Access the first two rowsData Frame Characteristics:
-
Column Orientation: Data frames are column-oriented, allowing easy access to individual variables.
-
Unique Row Names: Row names in data frames are required to be unique, ensuring clear identification of observations.
Converting Objects: Changing Data Types
R provides functions for converting between different data types.
Example:
# Convert data frame to matrixas.matrix(my_df)# Convert matrix to listas.list(matrix)# Convert a vector to a listas.list(numbers)General Conversion Function: The as function in the methods package offers a general way to convert objects of various types.
Example:
# Use the as function to convert an objectas(my_df, "matrix")This tutorial has provided you with a solid foundation in essential R objects for Bioconductor. You’ve learned about vectors, matrices, lists, and data frames, understanding how to create, manipulate, and convert them. This knowledge will serve you well as you explore the exciting world of Bioconductor.