Skip to content

ML in Bioinfo

1 Introduction to machine learning?

  • Machine learning algorithms are statistical methods used to find hidden patterns in datasets and make predictions. They learn from examples and identify similar patterns in new data.
  • Machine learning is commonly used in areas like social media suggestions, online shopping recommendations, and spam filters.
  • Applications of machine learning in biology include gene prediction, protein function prediction, cell image recognition, and drug molecule prediction.
  • A key aspect of machine learning is feature selection, which involves choosing the most descriptive and relevant information about the problem.
  • Machine learning algorithms are categorized based on the problem type, and different algorithms are used for different tasks.
  • Training and evaluation are essential steps in the machine learning process to assess the performance of the models.
  • Machine learning is transforming life sciences by addressing complex issues and opening new opportunities for the future.

2 Types of machine learning systems

  • Machine learning has become a crucial factor in determining success in various fields.
  • Machine learning algorithms are challenging but valuable.
  • Machine learning systems are categorized based on their outputs:
    • Supervised learning: Predicts continuous numerical values (e.g., temperature, pH).
    • Unsupervised learning: Categorizes data (e.g., positive/negative, warm/cold).
    • Reinforcement learning: Classifies examples based on similarities and differences in features.

2.1 Supervised learning

  • Supervised learning is the most common machine learning method, analogous to using flashcards to teach.
  • In supervised learning, algorithms are trained on labeled data, learning to predict labels for unseen data.
  • Regression is a type of supervised learning used to predict continuous values (e.g., predicting a child’s height based on parent’s data).
  • Classification is another type of supervised learning used to predict categories or labels (e.g., classifying a child’s height as ""Above Average"" or ""Below Average"").

2.2.1 Linear regression

  • Linear Regression: A Supervised Learning Algorithm: Used for predicting continuous variables like sales, age, or weather.
  • Linear Equation Representation: The relationship between an independent variable (x) and a dependent variable (y) is expressed by a linear equation: y = mx + b.
  • Finding the Best Fit: Linear regression aims to find the optimal values for slope (m) and y-intercept (b) that minimize the average difference (error) between data points and the line.
  • Error Measurement: The error between data points and the line is squared and summed to calculate the sum squared error. The average of this sum is the mean squared error (MSE), which indicates model efficiency.
  • Multivariant Linear Regression: Extends to multiple variables (more than two) where the line becomes a plane or hyperplane, represented by the equation: y = m1x1 + m2x2 + … + mnxn + b.
  • Coefficient Interpretation: The coefficients (m1, m2, etc.) reveal the contribution of each feature in determining the dependent variable (y).
  • Approximate Model Creation: Linear regression can be used to derive approximate mathematical models for problems by providing approximate linear equations.

2.3 Logistic regression

  • Classification Problems: These involve categorizing data into two or more classes. Examples include spam/ham detection, tumor diagnosis, and disease prediction.
  • Binary Classification: This type involves two classes, often represented by ""0"" and ""1"".
  • Limitations of Linear Regression: Linear regression is unsuitable for classification because its output can range from negative infinity to positive infinity, while classification requires discrete outputs (""0"" or ""1"").
  • Sigmoid Function: The sigmoid function, with its ""S"" shape, is used for classification because it produces outputs between 0 and 1, making it suitable for representing probabilities of belonging to a class.

2.4 K-nearest neighbor

  • K-nearest neighbor (KNN) algorithm: This algorithm uses the principle of ""a man is known by the company he keeps,"" meaning it classifies new data based on the similarity to known data points.
  • How KNN works: It finds the K closest data points to a new data point and assigns it to the majority class among those neighbors.
  • Importance of K value: Choosing the optimal K value is crucial for balancing accuracy and computational efficiency. Low K values can lead to noise sensitivity, while high K values can increase computational cost.
  • Distance measures: The most common distance measure is Euclidean distance, but other measures like Hamming distance can be used.
  • Dimensionality issue: In high-dimensional spaces, all data points tend to be far apart, posing challenges for KNN.
  • Advantages of KNN: Simplicity, adaptability for both classification and regression, and reliability with large training sets.
  • Disadvantages of KNN: Computational cost increases exponentially with K, high prediction cost, and poor performance with categorical features.

2.5 Decision trees

  • Decision trees are a tree-like model used for making decisions. They have applications in various fields like engineering, law, and machine learning.
  • Decision trees consist of nodes and branches. Nodes represent tests or conditions, while branches represent the outcomes of those tests.
  • Key components of a decision tree:
    • Root Node: The starting point where the data is split.
    • Decision Node: A node that further divides into sub-nodes.
    • Leaf Node: The final outcome of a decision path.
  • Decision trees work by splitting data based on purity or impurity gains.
  • Overfitting can occur in decision trees. This is when the tree becomes too complex and learns the training data too well, leading to poor performance on new data.
  • Tree pruning is used to prevent overfitting. It involves limiting the depth of the tree.
  • Advantages of decision trees:
    • Easy to understand and visualize.
    • Minimal data processing required.
    • Can handle both numerical and categorical data.
  • Disadvantages of decision trees:
    • Prone to overfitting.
    • May not be globally optimal.
    • Often require multiple trees for optimal results.
  • Random forests are an ensemble of decision trees. They combine multiple trees trained on random subsets of the data, improving accuracy and robustness.
  • Random forests are more accurate than single decision trees because:
    • They are not pruned, allowing for finer divisions of the data.
    • Each tree is trained on a random subset of features, promoting diversity.

2.6 Support vector machines

  • Support Vector Machines (SVMs):
    • Are supervised learning algorithms used for both classification and regression tasks.
    • Are linear classifiers, meaning they create linear decision boundaries.
    • Aim to find the optimal hyperplane (decision boundary) that best separates data into different classes.
  • Hyperplanes:
    • Are flat decision boundaries with N-1 dimensions for an N-dimensional dataset.
    • In 2D, they are lines, and in 3D, they are planes.
  • Linearly Separable Data:
    • Data that can be separated by a line or plane.
    • SVM finds the optimal hyperplane for such data.
  • Support Vectors:
    • Data points that are closest to the hyperplane, essentially defining the margin.
    • Changes to these points will affect the hyperplane’s location.
  • Margins:
    • The distance between the hyperplane and the support vectors.
    • Hard Margin: All data points are correctly classified with a clear separation.
    • Soft Margin: Allows for some misclassifications to handle real-world data, maximizing the margin while minimizing errors.
    • ""C"" Parameter: Controls the tolerance of misclassifications in scikit-learn’s SVM implementation. Lower ""C"" means less tolerance, higher ""C"" means more tolerance.

2.6.1 Kernel trick

  • Kernel Trick for Non-Linearly Separable Data: SVMs are traditionally used for linearly separable data. However, real-world data often lacks a clear linear boundary. The kernel trick addresses this by transforming low-dimensional data into a higher-dimensional space where it becomes linearly separable.
  • Example: A one-dimensional dataset with two classes (green and blue) is not linearly separable. By adding a new feature (the square of the original data), the data can be plotted in two dimensions, making the classes linearly separable.
  • Kernel Functions: These functions create new features to map data to higher dimensions. The radial basis function (rbf) is a popular kernel function that calculates distances between data points. The mathematical formula is k(xi, xj) = exp(-g||xi - xj||^2), where g (gamma) controls the feature impact on the decision boundary.
  • Tuning: Similar to the ""C"" parameter, gamma needs to be tuned for optimal SVM performance.

2.7 Neural networks

  • Artificial Neural Networks (ANNs) are a supervised learning algorithm designed to mimic the human brain’s neural network.
  • ANNs are primarily used for statistical analysis and modeling of data, acting as a complement to traditional nonlinear regression models.
  • They are often used to solve regression and classification problems.
  • Neural networks have been researched for over 60 years and have applications in fields such as speech and image recognition, text recognition, medical diagnosis, and fraud detection.

2.8 Neural networks architecture

  • Neural networks have three main layers: input, hidden, and output.
  • The input layer receives the data features.
  • The hidden layer performs complex mathematical calculations (the ""black box"").
  • The number of hidden layers determines the network’s depth; deep neural networks have multiple hidden layers.
  • The output layer provides the network’s results.
  • Neural networks can handle both linear and nonlinear datasets due to their flexible structure.
  • Overfitting is a potential issue with neural networks, requiring careful parameter tuning.

2.9 Convolutional neural network

  • Convolutional Neural Networks (CNNs) are specialized for image and spatial data. They learn by combining local pixel features.
  • CNNs use filters to extract features like edges and textures. These filters are trained through back-propagation.
  • Pooling layers simplify the data by reducing its size. They do this through techniques like max pooling (choosing the highest intensity pixel) or average pooling.
  • Flattening converts 2D feature maps into 1D vectors. This allows the features to be fed into an Artificial Neural Network (ANN) for classification.
  • CNNs extract features from images, which are then passed to ANNs for classification. This combined architecture is shown in the provided figure.

2.10 Unsupervised learning

  • Unsupervised learning differs from supervised learning by using unlabeled data, allowing the machine to interpret data independently.
  • Clustering is a prominent method in unsupervised learning, where data is sorted into groups (clusters) based on similarities and differences.
  • Gene expression analysis is an example of clustering, where genes are grouped according to their expression patterns.
  • Outlier detection identifies data points that deviate from a general pattern, similar to noticing a red car amongst white cars.
  • Applications of outlier detection include disease diagnosis, variation analysis, and detection of human data entry errors.

2.11 K-means clustering

  • Unsupervised Learning: This type of learning differs from supervised learning by not relying on labeled data. Instead, it seeks to find hidden patterns and group similar data points into clusters.
  • K-Means Clustering: This is a popular unsupervised learning algorithm that aims to separate unlabeled data into clusters.
  • The Process:
    • Choosing K: The user specifies the desired number of clusters (K).
    • Random Centroids: K centroids are randomly placed within the dataset.
    • Assignment: Data points are assigned to the closest centroid based on distance.
    • Centroid Update: Centroids are moved to the center of their respective clusters.
    • Iteration: Steps 3 and 4 are repeated until the centroids stop moving significantly.
  • Applications in Biology: K-means clustering finds use in various biological applications, including gene expression analysis, drug repurposing, and organism/protein classification.
  • Benefits: The process reveals hidden patterns within data by grouping similar data points, leading to valuable insights and knowledge discovery.

2.12 Reinforcement learning

Reinforcement Learning: A Key to Decision-Making

  • The Goal: Reinforcement learning empowers machine learning models to make decisions in complex environments.
  • Trial and Error: The process relies on trial and error, with the model being rewarded or penalized for its actions, aiming to maximize rewards over time.
  • No Pre-Defined Solutions: Unlike supervised learning, reinforcement learning doesn’t provide explicit guidance on how to solve problems. The model must discover solutions through exploration.
  • Key Components:
    • Agent: The ML model making decisions.
    • Environment: The context in which the agent operates.
    • Actions: The choices the agent makes that impact the environment.
    • Rewards: The feedback system that evaluates the agent’s actions.
  • Distinctive Approach: Reinforcement learning stands apart from supervised and unsupervised learning by focusing on dynamic decision-making sequences, unlike the static pattern recognition of the other approaches.
  • Balancing Exploration and Exploitation: Successful reinforcement learning agents must balance exploring new paths to find rewards with exploiting already known rewarding strategies.

3 Evaluation of machine learning models

  • Importance of Model Evaluation: After training a machine learning model, it’s crucial to evaluate its performance to understand how well it can generalize to new data.
  • Test Set for Evaluation: A separate ""test set"" (typically 20% of the data) is used to assess the model. This data is kept hidden from the model during training.
  • Model Performance on Unseen Data: The trained model is then applied to the test set, and its predictions are compared to the actual labels.
  • Confusion Matrix: A confusion matrix is a tool to visualize the model’s performance by comparing predicted and actual values.
  • Metrics for Evaluation: Mathematical metrics, such as accuracy, precision, and recall, are derived from the confusion matrix to quantify model performance.

3.1 Accuracy

  • Accuracy: The percentage of correctly predicted values out of all instances in the test set. Calculated as (TP + TN) / (TP + FP + TN + FN).
  • Precision: The proportion of correctly identified positive instances out of all instances the model predicted as positive. Calculated as TP / (TP + FP).
  • Recall (Sensitivity): The proportion of correctly identified positive instances out of all actual positive instances in the dataset. Calculated as TP / (TP + FN).
  • F1 Score: A single value that balances precision and recall, representing their harmonic mean. Calculated as 2 * (Precision * Recall) / (Precision + Recall).

3.2 Receiver Operating Characteristic (ROC) Curvature

  • ROC Curve: A graphical representation used to assess the performance of classifiers.
  • ROC Curve Components:
    • True Positive Rate (TPR): Sensitivity, the proportion of actual positives correctly identified.
    • False Positive Rate (FPR): 1 - Specificity, the proportion of actual negatives incorrectly identified as positives.
  • Ideal ROC Curve: A curve with a large area under the curve (AUC) indicates a more accurate model, showing a significant increase in true positives relative to false positives.
  • Regression Model Evaluation: Regression models use different metrics due to their continuous data nature.
  • Common Regression Metrics:
    • Variance
    • Mean Squared Error
    • R-squared Coefficient

3.3 Cross-validation

  • Cross-validation is a technique used to compare two or more subjects.
  • It addresses the drawback of splitting data into training and test sets, which can lead to loss of information and poor model training.
  • Cross-validation divides the dataset into subsets and trains and evaluates the model using various combinations of these subsets.
  • Two common cross-validation methods are K-fold validation and Leave One Out Cross Validation (LOOCV).

3.4 Testing and validating

  • Data Division for Machine Learning: When working with large datasets, it’s common to split the data into three distinct sets: training, validation, and testing.
  • Training Set: This set forms the bulk of the data and is used to teach the machine learning model.
  • Validation Set: This set is kept separate from the training data and is used for:
    • Evaluating the model’s performance during training.
    • Fine-tuning the model’s parameters.
    • Identifying the best model configurations through experimentation.
  • Testing Set: This set is completely hidden from the model during training and validation. It’s used for a final, unbiased evaluation of the model’s performance.
  • Avoiding Bias: Using the same data for both validation and testing can lead to bias, as the model may become overly tailored to the validation set.

4 Optimization of models

  • Model optimization is crucial for improving the accuracy of predictions.
  • Fine-tuning involves selecting the best model and parameter values.
  • Optimization involves iteratively configuring and validating models with different parameter combinations.
  • Continuous training and validation ensure model optimization and performance.

4.1 Parameter searching

  • Parameter Optimization Methods: The text focuses on two methods for optimizing machine learning models by searching for the best parameter combinations:

    • Grid Search: This method exhaustively tries all possible combinations of hyperparameters within a defined range. It’s effective but can be computationally expensive.
    • Random Search: Instead of trying every combination, this method randomly samples points from the space of hyperparameters. It can be more efficient than grid search, especially when dealing with high-dimensional parameter spaces.
  • Grid Search Details:

    • It requires defining a range of values for each hyperparameter.
    • It’s a general technique applicable to various machine learning models.
    • Domain knowledge can help reduce the search space, making it more efficient.
  • Random Search Details:

    • Introduced by Bergstra and Bengio (2012).
    • It’s a more efficient alternative to grid search when dealing with many hyperparameters.
    • It samples hyperparameter combinations randomly, reducing the computational cost.

4.2 Ensemble methods

  • Ensemble methods combine multiple models to achieve better results than a single model.
  • Voting and averaging are simple ensemble techniques.
  • Voting is used for classification and involves combining the predictions of multiple models.
  • Averaging is used for regression and involves averaging the predictions of multiple models.
  • Base models can be created using various methods, such as different splits of the same dataset, different algorithms, or any other approach.

5 Main challenges of machine learning “The provided text highlights the challenges faced by machine learning projects. Here are some key takeaways:

  • Machine learning offers advantages over traditional methods in decision-making, but it also presents its own challenges.
  • Despite rapid advancements, machine learning technology still has a long way to go.
  • Developing machine learning solutions involves overcoming numerous obstacles and issues.

5.1 Insufficient quantity of training data

  • Insufficient training data can hinder machine learning project success.
  • Lack of data can prevent algorithms from accurately representing real-world scenarios.
  • Data collection is a crucial step in any machine learning project.
  • The amount of data needed depends on the complexity of the problem and the chosen algorithm.
  • While high-throughput technologies increase available data, integrating diverse data types remains a challenge in biology.
  • Despite massive data generation, comprehending biological system mechanisms like phenotypes is difficult due to data integration challenges.

5.2 Non-representative training data

  • Representative training data is crucial: Training data should accurately reflect the population it’s intended to represent.
  • Sample size matters: Insufficient sample size can lead to sampling noise (non-representative data).
  • Sampling procedure impacts accuracy: Even large samples can be biased if the sampling method is flawed.
  • Bias-variance trade-off: Reducing sampling bias increases variance, making the model less generalizable. Conversely, reducing variance increases bias.

5.3 Quality of data

  • Machine learning algorithms are highly sensitive to data quality.
  • High-quality data is crucial for accurate training and testing.
  • Inconsistencies in data can lead to over or underestimation by the algorithm.
  • Even small variations in training data can significantly impact the algorithm’s output.
  • Complex problems require not only large datasets but also diverse and informative data.
  • Missing, misinterpreted, or inaccurate values can negatively affect data quality.

5.4 Irrelevant features

  • Feature Selection is Crucial: Choosing relevant features for machine learning algorithms is essential for optimal performance.
  • Irrelevant Features are Problematic: They mislead algorithms, increase data size, and add unnecessary complexity.
  • Redundant Features are Inefficient: Highly correlated features provide little additional value for training.
  • Dimensionality Reduction Techniques: Methods like PCA can help reduce feature dimensionality while retaining variance.
  • More Features Can Introduce Noise: Using feature selection techniques can help models perform better by reducing noise.

5.5 Overfitting or underfitting on training data

  • Overfitting: A model performs well on training data but poorly on test data due to memorizing the training data’s noise and details. This hinders the model’s ability to generalize to real-world situations. Nonlinear algorithms with many parameters are prone to overfitting. Solutions include fine-tuning parameters.
  • Underfitting: A model fails to capture relationships between features and outputs. This indicates an unsuitable algorithm for the dataset, often due to insufficient data. Solutions include increasing data availability or exploring alternative algorithms.