Introduction

Abstract

Genetic algorithms (GAs) are optimization techniques inspired by the process of natural selection and genetics. They operate by evolving a population of candidate solutions over successive generations, with each individual representing a potential solution to the optimization problem at hand. Through the application of genetic operators such as selection, crossover, and mutation, genetic algorithms iteratively improve the population, eventually converging towards optimal or near-optimal solutions. In the field of genomics, where data sets are often large, complex, and high-dimensional, genetic algorithms offer a promising approach for addressing optimization challenges such as feature selection, parameter tuning, and model optimization. By harnessing the power of evolutionary principles, genetic algorithms can effectively explore the solution space, identify informative features, and optimize model parameters, leading to improved accuracy and interpretability in genomic data analysis. The BioGA package extends the capabilities of genetic algorithms to the realm of genomic data analysis, providing a suite of functions optimized for handling high throughput genomic data. Implemented in C++ for enhanced performance, BioGA offers efficient algorithms for tasks such as feature selection, classification, clustering, and more. By integrating seamlessly with the Bioconductor ecosystem, BioGA empowers researchers and analysts to leverage the power of genetic algorithms within their genomics workflows, facilitating the discovery of biological insights from large-scale genomic data sets.


Getting Started

In this vignette, we illustrate the usage of BioGA for genetic algorithm optimization in the context of high throughput genomic data analysis. We showcase its interoperability with Bioconductor classes, demonstrating how genetic algorithm optimization can be seamlessly integrated into existing genomics pipelines for improved analysis and interpretation.

The BioGA package provides a comprehensive set of functions for genetic algorithm optimization tailored for analyzing high throughput genomic data. This vignette demonstrates the usage of BioGA in the context of selecting the best combination of genes for predicting a certain trait, such as disease susceptibility.

Overview

Genomic data refers to the genetic information stored in an organism’s DNA. It includes the sequence of nucleotides (adenine, thymine, cytosine, and guanine) that make up the DNA molecules. Genomic data can provide valuable insights into various biological processes, such as gene expression, genetic variation, and evolutionary relationships.

Genomic data in this context could consist of gene expression profiles measured across different individuals (e.g., patients).

Here’s an example of genomic data:

      Sample 1   Sample 2   Sample 3   Sample 4
Gene1    0.1        0.2        0.3        0.4
Gene2    1.2        1.3        1.4        1.5
Gene3    2.3        2.2        2.1        2.0

In this example, each row represents a gene (or genomic feature), and each column represents a sample. The values in the matrix represent some measurement of gene expression, such as mRNA levels or protein abundance, in each sample.

Example Scenario

Consider an example scenario of using genetic algorithm optimization to select the best combination of genes for predicting a certain trait, such as disease susceptibility.

# Load necessary packages
library(BioGA)
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians

# Define parameters
num_genes <- 1000
num_samples <- 10

# Define parameters for genetic algorithm
population_size <- 100
generations <- 20
mutation_rate <- 0.1

# Generate example genomic data using SummarizedExperiment
counts <- matrix(rpois(num_genes * num_samples, lambda = 10),
    nrow = num_genes
)
rownames(counts) <- paste0("Gene", 1:num_genes)
colnames(counts) <- paste0("Sample", 1:num_samples)

# Create SummarizedExperiment object
se <- SummarizedExperiment(assays = list(counts = counts))

# Convert SummarizedExperiment to matrix for compatibility with BioGA package
genomic_data <- assay(se)

In this example, counts is a matrix representing the counts of gene expression levels across different samples. Each row corresponds to a gene, and each column corresponds to a sample. We use the SummarizedExperiment class to store this data, which is common Bioconductor class for representing rectangular feature x sample data, such as RNAseq count matrices or microarray data.

head(genomic_data)
#>       Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
#> Gene1      12       8      13       9       7      15      10       6       9
#> Gene2       5      13       7      11       7       7      10      11      11
#> Gene3      13       6       9       5       2      12       9      14       8
#> Gene4      15       9       6      13       9      15       5       7      10
#> Gene5      16      20      10      11      14       7       6       6      12
#> Gene6      10       8       7      13       8      10       8       7      10
#>       Sample10
#> Gene1       11
#> Gene2        7
#> Gene3        6
#> Gene4       12
#> Gene5        6
#> Gene6        9

Initialization

# Initialize population
population <- BioGA::initialize_population_cpp(genomic_data,
    population_size = 5
)

The population represents a set of candidate combinations of genes that could be predictive of the trait. Each individual in the population is represented by a binary vector indicating the presence or absence of each gene. For example, an individual in the population might be represented as [1, 0, 1], indicating the presence of Gene1 and Gene3 but the absence of Gene2. The population undergoes genetic algorithm operations such as selection, crossover, mutation, and replacement to evolve towards individuals with higher predictive power for the trait.

Genetic Algorithm Optimization

# Initialize fitness history
fitness_history <- list()

# Initialize time progress
start_time <- Sys.time()

# Run genetic algorithm optimization
generation <- 0
while (TRUE) {
    generation <- generation + 1

    # Evaluate fitness
    fitness <- BioGA::evaluate_fitness_cpp(genomic_data, population)
    fitness_history[[generation]] <- fitness

    # Check termination condition
    if (generation == generations) { # defined number of generations
        break
    }

    # Selection
    selected_parents <- BioGA::selection_cpp(population,
        fitness,
        num_parents = 2
    )

    # Crossover and Mutation
    offspring <- BioGA::crossover_cpp(selected_parents, offspring_size = 2)
    # (no mutation in this example)
    mutated_offspring <- BioGA::mutation_cpp(offspring, mutation_rate = 0)

    # Replacement
    population <- BioGA::replacement_cpp(population, mutated_offspring,
        num_to_replace = 1
    )

    # Calculate time progress
    elapsed_time <- difftime(Sys.time(), start_time, units = "secs")

    # Print time progress
    cat(
        "\rGeneration:", generation, "- Elapsed Time:",
        format(elapsed_time, units = "secs"), "     "
    )
}
#> Generation: 1 - Elapsed Time: 0.01662779 secs      Generation: 2 - Elapsed Time: 0.02079034 secs      Generation: 3 - Elapsed Time: 0.02122736 secs      Generation: 4 - Elapsed Time: 0.02160215 secs      Generation: 5 - Elapsed Time: 0.02198648 secs      Generation: 6 - Elapsed Time: 0.02236009 secs      Generation: 7 - Elapsed Time: 0.02273679 secs      Generation: 8 - Elapsed Time: 0.02312016 secs      Generation: 9 - Elapsed Time: 0.02349114 secs      Generation: 10 - Elapsed Time: 0.02387404 secs      Generation: 11 - Elapsed Time: 0.02424622 secs      Generation: 12 - Elapsed Time: 0.0596683 secs      Generation: 13 - Elapsed Time: 0.06014085 secs      Generation: 14 - Elapsed Time: 0.06048226 secs      Generation: 15 - Elapsed Time: 0.06115675 secs      Generation: 16 - Elapsed Time: 0.06146169 secs      Generation: 17 - Elapsed Time: 0.06178665 secs      Generation: 18 - Elapsed Time: 0.06209946 secs      Generation: 19 - Elapsed Time: 0.06238627 secs

Fitness Calculation

The fitness calculation described in the provided code calculates a measure of dissimilarity between the gene expression profiles of individuals in the population and the genomic data. This measure of dissimilarity, or “fitness”, quantifies how well the gene expression profile of an individual matches the genomic data.

Mathematically, the fitness calculation can be represented as follows:

Let:

Then, the fitness \(F_i\) for individual \(i\) in the population can be calculated as the sum of squared differences between the gene expression levels of individual \(i\) and the corresponding gene expression levels in the genomic data, across all genes and samples: \[ F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2 \]

This fitness calculation aims to minimize the overall dissimilarity between the gene expression profiles of individuals in the population and the genomic data. Individuals with lower fitness scores are considered to have gene expression profiles that are more similar to the genomic data and are therefore more likely to be selected for further optimization in the genetic algorithm.

# Plot fitness change over generations
BioGA::plot_fitness_history(fitness_history)

This vignette demonstrates how genetic algorithm optimization can be applied to select the best combination of genes for predicting a certain trait using the BioGA package. It showcases the integration of genetic algorithms with genomic data analysis and highlights the potential of genetic algorithms for feature selection in genomics.

Session Info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R Under development (unstable) (2024-03-18 r86148)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-04-02
#>  pandoc   2.7.3 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version     date (UTC) lib source
#>  abind                  1.4-5       2016-07-21 [3] CRAN (R 4.4.0)
#>  animation              2.7         2021-10-07 [3] CRAN (R 4.4.0)
#>  Biobase              * 2.63.1      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  BiocGenerics         * 0.49.1      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  BiocManager            1.30.22     2023-08-08 [2] CRAN (R 4.4.0)
#>  biocViews              1.71.1      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  BioGA                * 0.99.1      2024-04-02 [1] Bioconductor
#>  bitops                 1.0-7       2021-04-24 [3] CRAN (R 4.4.0)
#>  bslib                  0.7.0       2024-03-29 [3] CRAN (R 4.4.0)
#>  cachem                 1.0.8       2023-05-01 [3] CRAN (R 4.4.0)
#>  cli                    3.6.2       2023-12-11 [3] CRAN (R 4.4.0)
#>  colorspace             2.1-0       2023-01-23 [3] CRAN (R 4.4.0)
#>  crayon                 1.5.2       2022-09-29 [3] CRAN (R 4.4.0)
#>  DelayedArray           0.29.9      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  digest                 0.6.35      2024-03-11 [3] CRAN (R 4.4.0)
#>  dplyr                  1.1.4       2023-11-17 [3] CRAN (R 4.4.0)
#>  evaluate               0.23        2023-11-01 [3] CRAN (R 4.4.0)
#>  fansi                  1.0.6       2023-12-08 [3] CRAN (R 4.4.0)
#>  farver                 2.1.1       2022-07-06 [3] CRAN (R 4.4.0)
#>  fastmap                1.1.1       2023-02-24 [3] CRAN (R 4.4.0)
#>  generics               0.1.3       2022-07-05 [3] CRAN (R 4.4.0)
#>  GenomeInfoDb         * 1.39.10     2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  GenomeInfoDbData       1.2.12      2024-03-27 [3] Bioconductor
#>  GenomicRanges        * 1.55.4      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  ggplot2                3.5.0       2024-02-23 [3] CRAN (R 4.4.0)
#>  glue                   1.7.0       2024-01-09 [3] CRAN (R 4.4.0)
#>  graph                  1.81.0      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  gtable                 0.3.4       2023-08-21 [3] CRAN (R 4.4.0)
#>  highr                  0.10        2022-12-22 [3] CRAN (R 4.4.0)
#>  htmltools              0.5.8       2024-03-25 [3] CRAN (R 4.4.0)
#>  IRanges              * 2.37.1      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  jquerylib              0.1.4       2021-04-26 [3] CRAN (R 4.4.0)
#>  jsonlite               1.8.8       2023-12-04 [3] CRAN (R 4.4.0)
#>  knitr                  1.45        2023-10-30 [3] CRAN (R 4.4.0)
#>  labeling               0.4.3       2023-08-29 [3] CRAN (R 4.4.0)
#>  lattice                0.22-6      2024-03-20 [4] CRAN (R 4.4.0)
#>  lifecycle              1.0.4       2023-11-07 [3] CRAN (R 4.4.0)
#>  magrittr               2.0.3       2022-03-30 [3] CRAN (R 4.4.0)
#>  Matrix                 1.7-0       2024-03-22 [4] CRAN (R 4.4.0)
#>  MatrixGenerics       * 1.15.0      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  matrixStats          * 1.2.0       2023-12-11 [3] CRAN (R 4.4.0)
#>  munsell                0.5.0       2018-06-12 [3] CRAN (R 4.4.0)
#>  pillar                 1.9.0       2023-03-22 [3] CRAN (R 4.4.0)
#>  pkgconfig              2.0.3       2019-09-22 [3] CRAN (R 4.4.0)
#>  R6                     2.5.1       2021-08-19 [3] CRAN (R 4.4.0)
#>  RBGL                   1.79.0      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  Rcpp                   1.0.12      2024-01-09 [3] CRAN (R 4.4.0)
#>  RCurl                  1.98-1.14   2024-01-09 [3] CRAN (R 4.4.0)
#>  rlang                  1.1.3       2024-01-10 [3] CRAN (R 4.4.0)
#>  rmarkdown              2.26        2024-03-05 [3] CRAN (R 4.4.0)
#>  RUnit                  0.4.33      2024-02-22 [3] CRAN (R 4.4.0)
#>  S4Arrays               1.3.6       2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  S4Vectors            * 0.41.5      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  sass                   0.4.9       2024-03-15 [3] CRAN (R 4.4.0)
#>  scales                 1.3.0       2023-11-28 [3] CRAN (R 4.4.0)
#>  sessioninfo            1.2.2       2021-12-06 [3] CRAN (R 4.4.0)
#>  SparseArray            1.3.4       2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  SummarizedExperiment * 1.33.3      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  tibble                 3.2.1       2023-03-20 [3] CRAN (R 4.4.0)
#>  tidyselect             1.2.1       2024-03-11 [3] CRAN (R 4.4.0)
#>  utf8                   1.2.4       2023-10-22 [3] CRAN (R 4.4.0)
#>  vctrs                  0.6.5       2023-12-01 [3] CRAN (R 4.4.0)
#>  withr                  3.0.0       2024-01-16 [3] CRAN (R 4.4.0)
#>  xfun                   0.43        2024-03-25 [3] CRAN (R 4.4.0)
#>  XML                    3.99-0.16.1 2024-01-22 [3] CRAN (R 4.4.0)
#>  XVector                0.43.1      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#>  yaml                   2.3.8       2023-12-11 [2] CRAN (R 4.4.0)
#>  zlibbioc               1.49.3      2024-04-01 [3] Bioconductor 3.19 (R 4.4.0)
#> 
#>  [1] /tmp/RtmpRRWYsO/Rinst2e6cb660386bb
#>  [2] /home/pkgbuild/packagebuilder/workers/jobs/3315/R-libs
#>  [3] /home/biocbuild/bbs-3.19-bioc/R/site-library
#>  [4] /home/biocbuild/bbs-3.19-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────