4  Algorithms of Enrichment Analysis

Functional enrichment analysis is a staple in bioinformatics for interpreting lists of genes identified from omics experiments. Behind the scenes of our ecosystem (clusterProfiler, DOSE, etc.), the heavy lifting is powered by the enrichit package.

We built enrichit with a clear philosophy: minimal dependencies, maximum performance, and elegant extensibility. It provides extremely fast, C++-based implementations of classical methods (via Rcpp), while introducing modern, advanced algorithms to address long-standing blind spots in the field.

In this chapter, we will walk through the algorithmic backbone of enrichment analysis, moving from the standard ORA and GSEA to our newly introduced Network-based, multi-omics, and Weighted enrichment frameworks.

At the package architecture level, enrichit serves as the algorithm engine, while clusterProfiler provides the knowledge-base-aware high-level entry points. In practice, this means that topology-aware workflows are now available through wrappers such as nseGO(), mnseGO(), nseKEGG(), and mnseKEGG(), while multi-omics helpers such as aggregate_omics() and aggregate_enrichment() are also re-exported by clusterProfiler for downstream biological interpretation.

4.1 Over-Representation Analysis (ORA)

Over Representation Analysis (ORA) (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

enrichit implements ORA using the hypergeometric distribution (one-sided Fisher’s exact test). The p-value is calculated as the probability of observing at least \(k\) genes from the specific gene set in the selected list of \(n\) genes, given a background population (universe) of \(N\) genes containing \(M\) genes from that set.

\[ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}} \]

Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. From these, 2641 are annotated to gene set of interest. Among the differentially expressed genes, 28 belong to the gene set1.

d <- data.frame(non_DE_gene=c(2613, 15310), DE_gene=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
                non_DE_gene DE_gene
In_category            2613      28
not_in_category       15310      29

Whether the gene set of interest is significantly over-represented in the differentially expressed genes can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test.

fisher.test(d, alternative = "greater")

    Fisher's Exact Test for Count Data

data:  d
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.110242      Inf
sample estimates:
odds ratio 
 0.1767937 

While mathematically sound, this classic model makes a naive assumption: it assumes that every gene has an equal probability of being detected. We will see how to fix this systematic bias later in the Weighted Enrichment Analysis section.

4.2 Gene Set Enrichment Analysis (GSEA)

Unlike ORA, which requires an arbitrary threshold to define “significant” genes and ignores the rest, GSEA uses the entire ranked list of genes. It calculates an Enrichment Score (ES) that reflects the degree to which a gene set is over-represented at the top or bottom of a ranked list.

There are three key elements of the GSEA method:

  • Calculation of an Enrichment Score.
    • The enrichment score (ES) represents the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic (Subramanian et al. 2005).
  • Estimation of Significance Level of ES.
    • The p-value of the ES is calculated using a permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generates a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
  • Adjustment for Multiple Hypothesis Testing.
    • When the entire gene sets are evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing and also q-values are calculated for FDR control.

We implemented the GSEA algorithm proposed by Subramanian (Subramanian et al. 2005). enrichit offers a fast C++ implementation of GSEA. The default method (method = "multilevel") uses an adaptive multi-level splitting Monte Carlo approach (derived from fgsea (Korotkevich et al. 2019)) to estimate low p-values efficiently with high accuracy.

library(enrichit)

# Generate synthetic ranked gene list
set.seed(42)
geneList <- sort(rnorm(1000), decreasing = TRUE)
names(geneList) <- paste0("Gene", 1:1000)

gene_sets <- list(
  PathwayTop = names(geneList)[1:50],
  PathwayBottom = names(geneList)[951:1000]
)

# Run GSEA using the multilevel method
gsea_result <- gsea(
  geneList = geneList,
  gene_sets = gene_sets,
  method = "multilevel"
)

4.2.1 Leading edge analysis and core enriched genes

Leading edge analysis reports Tags to indicate the percentage of genes contributing to the enrichment score, List to indicate where in the list the enrichment score is attained and Signal for enrichment signal strength.

It would also be very interesting to get the core enriched genes that contribute to the enrichment. Our packages (clusterProfiler, DOSE, meshes and ReactomePA) support leading edge analysis and report core enriched genes in GSEA analysis.

4.3 Network-based Set Enrichment Analysis (NSEA)

Traditional enrichment methods treat pathways as independent “bags of genes,” completely ignoring the biological interactions between them. If you only look at the overlap, you are wasting a massive amount of network topological information.

Some previous methods (like EnrichNet) attempted to solve this by calculating network distances. However, they suffered from a fatal flaw: they only provided a relative score without rigorous p-values, and calculating empirical p-values via network permutation was excruciatingly slow. Life is too short to wait for permutations.

To break this bottleneck, we introduced Network-based Set Enrichment Analysis (NSEA), a true Network-based GSEA.

Our core philosophy is to leverage existing strengths seamlessly:

  1. Propagation: We take your input gene scores and diffuse them across a biological network using a Random Walk with Restart (RWR). To squeeze out every drop of performance, the RWR is powered by RcppEigen sparse matrix multiplication in C++. Even networks with hundreds of thousands of edges converge in a flash.
  2. Enrichment: After diffusion, every node in the network receives a new “propagation score” containing global topological context. We take this fully ranked list and feed it directly into our extreme-speed multilevel GSEA engine.

One RWR diffusion + one GSEA test. You get not only network-aware scores but also rigorous p-values and FDRs, with blazing speed.

library(clusterProfiler)

# 1. Fetch the full background PPI network for Human (taxID 9606)
# Default output is "data.frame" (edge list), which nsea() natively supports
net <- get_ppi_network(9606)

# 2. Run NSEA
# (Assuming 'stats' is a named numeric vector of non-negative evidence scores)
nsea_result <- nsea(
    geneList = stats,
    network = net,
    gene_sets = pathways
)

4.3.1 Direction-aware Propagation (Signed Mode)

Historically, RWR mathematically only accepts non-negative energy. This means you could only pass absolute statistics (like -log10(p)), and the result would only tell you if a pathway was “hit”, but not whether it was up-regulated or down-regulated.

To solve this, enrichit introduces a dual-track propagation architecture (mode = "signed"). When you pass signed statistics (like -log10(p) * sign(logFC)), the algorithm automatically splits the input into up-regulated and down-regulated vectors. It runs two independent RWRs on the same network, and then subtracts the two diffused energy vectors to reconstruct a fully signed, network-aware score vector. This seamlessly bridges bidirectional network diffusion with dual-tail GSEA.

# Using signed statistics (e.g., -log10(p) * sign(logFC))
res_signed <- nsea(
    geneList = signed_stats, 
    network = net, 
    gene_sets = pathways, 
    mode = "signed"
)

4.3.2 Multi-layer Network-based Enrichment (MNSEA)

Single-layer propagation is already useful, but real multi-omics systems are rarely confined to one graph. Transcriptomics, proteomics, and phosphoproteomics often live on related yet non-identical network layers. Forcing everything into a single collapsed graph throws away layer-specific topology before the analysis even begins.

To address this, enrichit introduces multi-layer NSEA through mnsea(). The algorithm builds a supra-graph from layer-specific networks plus explicit inter-layer couplings, performs Random Walk with Restart on the combined topology, and then collapses layer-wise diffusion scores into a unified ranked vector for GSEA. In other words, it is still “propagate first, enrich later”, but now the propagation itself respects the multi-layer structure.

# Small multi-layer example with RNA and Protein layers
networks <- list(
  RNA = data.frame(
    from = c("Gene1", "Gene2", "Gene3"),
    to = c("Gene2", "Gene3", "Gene4"),
    weight = c(1, 1, 1)
  ),
  PROT = data.frame(
    from = c("Gene1", "Gene3", "Gene4"),
    to = c("Gene3", "Gene4", "Gene1"),
    weight = c(1, 1, 1)
  )
)

couplings <- data.frame(
  from_layer = c("RNA", "RNA", "RNA"),
  from_id = c("Gene1", "Gene3", "Gene4"),
  to_layer = c("PROT", "PROT", "PROT"),
  to_id = c("Gene1", "Gene3", "Gene4"),
  weight = c(1, 1, 1)
)

seed_list <- list(
  RNA = c(Gene1 = 1.5, Gene2 = 0.8, Gene4 = -0.9),
  PROT = c(Gene1 = 0.5, Gene3 = 1.2, Gene4 = -0.4)
)

ml_gene_sets <- list(
  SignalPath = c("Gene1", "Gene2", "Gene3"),
  DownPath = c("Gene4")
)

mnsea_res <- mnsea(
  seed_list = seed_list,
  networks = networks,
  couplings = couplings,
  gene_sets = ml_gene_sets,
  mode = "signed",
  collapse = "weighted_mean",
  layer_weights = c(RNA = 1, PROT = 1.5),
  minGSSize = 1,
  maxGSSize = 10,
  method = "sample",
  nPerm = 30,
  verbose = FALSE
)

The same topology-aware engine can now be accessed from clusterProfiler using knowledge-base-aware wrappers. For example, the following high-level workflow reuses a single shared example object and performs GO- and KEGG-aware network enrichment without manually constructing the annotation backend.

library(clusterProfiler)
library(org.Hs.eg.db)

demo <- clusterprofiler_enrichit_demo()

go_nse <- nseGO(
  geneList = demo$geneList_evidence,
  network = demo$network,
  OrgDb = org.Hs.eg.db,
  keyType = "ENTREZID",
  ont = "BP",
  minGSSize = 5,
  maxGSSize = 500,
  threshold = 1e-6,
  maxIter = 50,
  verbose = FALSE,
  pvalueCutoff = 1,
  method = "sample",
  nPerm = 30
)

kegg_nse <- nseKEGG(
  geneList = demo$geneList_evidence,
  network = demo$network,
  organism = demo$kegg_gson,
  minGSSize = 5,
  maxGSSize = 500,
  threshold = 1e-6,
  maxIter = 50,
  verbose = FALSE,
  pvalueCutoff = 1,
  method = "sample",
  nPerm = 30
)

4.3.3 Explanation-ready Outputs for Multi-layer Results

Once you start doing topology-aware enrichment, the next question is obvious: which layer actually drove the hit? Instead of baking plotting logic into the engine, enrichit prepares explanation-ready tables that downstream packages such as enrichplot can consume directly.

The mnseaResult object stores precomputed pathway-level and feature-level contribution caches, and also supports extracting a pathway-specific explanation subnetwork.

# Pathway-level contribution table
pathway_tbl <- get_mnsea_contribution(
  mnsea_res,
  level = "pathway"
)

# Feature-level contribution table for one pathway
feature_tbl <- get_mnsea_contribution(
  mnsea_res,
  pathway_id = pathway_tbl$ID[1],
  level = "feature"
)

# Extract an explanation-ready subnetwork
subnet <- extract_mnsea_subnetwork(
  mnsea_res,
  pathway_id = pathway_tbl$ID[1]
)

4.4 Weighted Enrichment Analysis

Have you ever noticed a weird phenomenon? No matter what odd phenotype you study, the enriched pathways always seem to feature the same “usual suspects”—like the ribosome, spliceosome, or broad immune responses.

This is driven by a systematic bias. In real biological networks, genes are not created equal. Hub genes participate in numerous functions and are easily perturbed. If you don’t account for this, traditional ORA will be easily fooled by these highly active Hub genes, yielding false-positive significance for broad pathways.

To correct this, enrichit introduces Weighted Enrichment Analysis, supporting both ORA and GSEA.

4.4.1 Weighted ORA: De-biasing with Wallenius Distribution

How do we correct the bias in ORA? We bring in the Wallenius’ noncentral hypergeometric distribution.

By passing a weight vector (e.g., network centrality), the model calculates an Odds ratio for each pathway. If a pathway is full of high-weight core genes, the model applies a penalty to the p-value. The false significance driven by Hub genes is brought back to reality, allowing truly specific pathways to stand out.

Note: In the early days of RNA-seq, goseq used gene length as a weight to correct ORA. However, in the era of 3’ poly-A sequencing and 10x single-cell, length bias is mostly obsolete. Our implementation is universal—your weights are dictated by your scientific question, not hardcoded to gene length.

4.4.2 Weighted GSEA: Fusing Weights with Statistics

For GSEA, we seamlessly fuse your ranked statistics (like \(\log FC\)) with your external weights before passing them to the C++ engine. This amplifies the signals of core nodes while attenuating unreliable ones, without reinventing the wheel.

library(igraph)

# Fetch network as an igraph object to calculate topological weights
net_ig <- get_ppi_network(9606, output = "igraph")

# Calculate PageRank centrality as weights
pr_weights <- page_rank(net_ig)$vector

# Run Weighted ORA
res_wora <- ora(gene = de_genes, 
                gene_sets = pathways, 
                universe = all_genes, 
                weight = pr_weights)

# Run Weighted GSEA
res_wgsea <- gsea(geneList = stats, 
                  gene_sets = pathways, 
                  weight = pr_weights)

The beauty of weighted enrichment lies in its infinite possibilities. Your weight doesn’t have to be network centrality. It can be a tissue specificity score, an evolutionary conservation score, or a multi-omics confidence score. If you have prior knowledge that “these genes are more important”, simply pass them to the weight parameter and let enrichit handle the rest.

4.5 Bayesian Term Selection

Finally, for ORA results, enrichit provides a Bayesian framework (bayes_enrich()) to estimate the posterior probability of explanatory terms. This helps to select the most likely active pathways while automatically penalizing redundancy among highly overlapping GO terms.

# After obtaining an ORA result object
bayes_res <- bayes_enrich(ora_result)
summary_res <- bayes_summary(bayes_res)

With enrichit acting as the high-performance backend, tools like clusterProfiler can focus on biological interpretation and visualization, providing a seamless and powerful knowledge mining experience.

4.6 Multi-omics Early Integration

With the rapid development of sequencing technologies, it is very common to perform functional analysis on multi-omics data (e.g., Transcriptomics + Proteomics). A naive approach is to perform enrichment analysis on each omics separately and then intersect the enriched pathways. However, this late-integration strategy completely ignores the fact that different omics might provide weak but consistent signals for the same pathway, which would be filtered out by strict thresholds in single-omics analysis.

To solve this, enrichit introduces an Early Integration pipeline. The core philosophy is simple: Integrate first, enrich later. Decouple the integration layer from the enrichment layer.

4.6.1 1. Statistical Aggregation (aggregate_omics)

Instead of finding the intersection of pathways, we aggregate the evidence at the feature level. aggregate_omics() takes multi-source statistics (e.g., p-values or signed scores) and merges them into a unified evidence score. For p-values, it supports classic methods like fisher or stouffer, and internally converts them to a monotonically increasing score suitable for ranking. It also supports Brown’s method for correlation-aware p-value aggregation and weighted_mean for signed statistics with explicit layer weights.

library(enrichit)

# Create mock p-values for RNA and Protein data across 1000 genes
set.seed(123)
omics_gene_ids <- paste0("Gene", 1:1000)

rna_pvals <- runif(1000, 0, 1)
names(rna_pvals) <- omics_gene_ids

prot_pvals <- runif(1000, 0, 1)
names(prot_pvals) <- omics_gene_ids

# Inject consistent but weak signals into the first 20 genes (pathway of interest)
rna_pvals[1:20] <- runif(20, 0.01, 0.08)
prot_pvals[1:20] <- runif(20, 0.01, 0.08)

# Aggregate p-values from RNA and Protein data using Fisher's method
agg_fisher_res <- aggregate_omics(
    x = list(RNA = rna_pvals, PROT = prot_pvals),
    method = "fisher",
    input = "pvalue",
    feature_type = "gene"
)

# Look at the top aggregated scores
head(sort(agg_fisher_res$score, decreasing = TRUE))
 Gene911  Gene254    Gene9  Gene478   Gene74   Gene20 
2.712320 2.551594 2.446322 2.394249 2.366397 2.273862 

4.6.2 1a. Correlation-aware Aggregation with Brown’s Method

Fisher’s method is simple and powerful, but it silently assumes that the evidences are independent. Real multi-omics layers are often correlated: RNA and protein do not live on separate planets. If you pretend they are independent, you are inviting double counting.

Brown’s method fixes this by adjusting the degrees of freedom of Fisher’s statistic using an empirical or user-supplied covariance structure. In enrichit, this is exposed directly through aggregate_omics(method = "brown").

set.seed(123)
omics_pmat <- cbind(RNA = rna_pvals[1:80], PROT = prot_pvals[1:80])
rownames(omics_pmat) <- omics_gene_ids[1:80]

cov_mat <- matrix(
  c(4, 1.2,
    1.2, 4),
  nrow = 2
)

agg_brown_res <- aggregate_omics(
  x = omics_pmat,
  method = "brown",
  input = "pvalue",
  cov_matrix = cov_mat
)

head(sort(agg_brown_res$score, decreasing = TRUE))
   Gene9   Gene74   Gene20    Gene8    Gene5    Gene2 
2.102805 2.038788 1.964581 1.935310 1.897638 1.887833 

4.6.3 2. ID Harmonization (harmonize_ids)

Proteomics data often use UniProt IDs or protein groups, while your pathway annotations are usually gene-based. Directly feeding protein-level data to gene set enrichment will cause a huge ID mismatch.

harmonize_ids() acts as a dedicated ID mapping layer. It maps the protein-level aggregated results to the gene-level, and handles many-to-one mappings gracefully using folding strategies like min_p or max_abs.

# Suppose our proteomics data used UniProt IDs (e.g., Q9Y6N7)
# and we need to map them back to standard Gene Symbols
id_mapping_df <- data.frame(
    source_id = c("P12345", "P12346", "Q9Y6N7"),
    target_id = c("Gene1", "Gene1", "Gene2") # P12345 and P12346 both map to Gene1
)

# For demonstration, we create a mock protein-level aggregated result
protein_agg_res <- aggregate_omics(
    x = list(PROT = prot_pvals[1:3]),
    method = "fisher",
    input = "pvalue",
    feature_type = "protein"
)
names(protein_agg_res$score) <- c("P12345", "P12346", "Q9Y6N7")
protein_agg_res$feature_id <- c("P12345", "P12346", "Q9Y6N7")

# Map and collapse protein-level features to gene-level
harmonized_res <- harmonize_ids(
    x = protein_agg_res, # Assuming this is an aggregated result from proteomics
    mapping = id_mapping_df,
    from = "protein",
    to = "gene",
    collapse = "min_p" # If multiple proteins map to one gene, take the most significant one
)

4.6.4 3. Handling Directional Conflicts (conflict_policy)

When integrating signed statistics across omics (e.g., transcriptomics and proteomics), you often encounter conflicting directions—a gene might be significantly up-regulated in RNA but down-regulated in Protein.

Instead of naively averaging them out to zero, aggregate_omics() introduces a conflict_policy parameter when input = "signed_score". You can choose "strict" to completely drop conflicting features, or "penalty" to halve their aggregated scores. This actively filters out biological noise and reduces false positives before enrichment.

# Example with conflicting directions between RNA and PROT
rna_signed <- -log10(rna_pvals) * rep(c(1, -1), length.out = length(rna_pvals))
prot_signed <- -log10(prot_pvals) * rep(c(1, 1, -1, -1), length.out = length(prot_pvals))
names(rna_signed) <- omics_gene_ids
names(prot_signed) <- omics_gene_ids

agg_strict <- aggregate_omics(
    x = list(RNA = rna_signed, PROT = prot_signed),
    input = "signed_score",
    method = "mean",
    conflict_policy = "strict" # Drops conflicting genes automatically
)

4.6.5 4. Distribution to Enrichment Backends

Once the multi-omics data is aggregated and harmonized, you obtain a unified object containing a ranking score.

For GSEA or NSEA, you can directly extract the sorted score vector and feed it to the engine:

omics_pathways <- list(TargetPathway = omics_gene_ids[1:20])

geneList <- sort(agg_fisher_res$score, decreasing = TRUE)

gsea_res <- gsea(
    geneList = geneList,
    gene_sets = omics_pathways
)

For ORA, which requires discrete gene lists, enrichit provides an adapter function select_features_for_ora(). It takes the aggregated object and a cutoff, and returns the selected gene list and background universe, perfectly bridging the gap between continuous multi-omics scores and traditional hypergeometric tests.

# Define a mock pathway containing the 20 genes we injected signal into
selected_ora_inputs <- select_features_for_ora(
    x = agg_fisher_res, 
    cutoff = 0.05, 
    by = "pvalue"
)

# Only genes with significant aggregated evidence will be selected
early_ora_res <- ora(
    gene = selected_ora_inputs$gene,
    universe = selected_ora_inputs$universe,
    gene_sets = omics_pathways
)

By decoupling ID mapping, statistical integration, and pathway enrichment, enrichit provides a clean, robust, and highly extensible pipeline for multi-omics functional analysis.

These workflow helpers are also available from clusterProfiler, which means the aggregated result can be sent directly to high-level functions such as enrichGO(), gseGO(), or nseGO() without changing the underlying multi-omics logic.

4.6.6 5. Late Fusion at the Pathway Level

Early fusion is elegant when your features align cleanly. But sometimes that assumption collapses immediately: transcriptomics might have a background of 20,000 genes, proteomics might only detect 4,000 proteins, and region-based omics may not even live in gene space before annotation. In these cases, forcing everything into one feature matrix is a great way to manufacture missing values and mapping noise.

This is where Late Fusion comes in. Each omics is enriched independently using its own background and preferred backend; then aggregate_enrichment() merges the pathway-level p-values across results. The mathematics is the same family as feature-level aggregation, but the biology is much cleaner because each layer is allowed to speak in its native space first.

late_fusion_pathways <- list(
  SharedPath = omics_gene_ids[1:8],
  RNAPath = omics_gene_ids[9:16],
  ProtPath = omics_gene_ids[17:24]
)

rna_sig <- c(omics_gene_ids[1:6], omics_gene_ids[9:13])
prot_sig <- c(omics_gene_ids[c(1:4, 7:8)], omics_gene_ids[17:21])

rna_ora_res <- ora(
  gene = rna_sig,
  universe = omics_gene_ids,
  gene_sets = late_fusion_pathways
)

prot_ora_res <- ora(
  gene = prot_sig,
  universe = omics_gene_ids[c(1:60)],
  gene_sets = late_fusion_pathways
)

late_fusion_res <- aggregate_enrichment(
  res_list = list(RNA = rna_ora_res, PROT = prot_ora_res),
  method = "brown"
)

as.data.frame(late_fusion_res)[, c("ID", "pvalue", "p.adjust", "Count")]
                   ID     pvalue   p.adjust Count
SharedPath SharedPath 0.01646848 0.03719786     8
RNAPath       RNAPath 0.02479857 0.03719786     5
ProtPath     ProtPath 0.04647113 0.04647113     5

4.6.7 6. Contribution Tracing

After running a multi-omics integration, you may want to know whether a significant pathway was driven by the transcriptome, the proteome, or their combination. enrichit provides tools to trace the contribution back to the original omics data.

First, you can classify the pattern of each pathway by comparing the merged results with single-omics results:

# Run ORA for individual omics
rna_ora_res <- ora(
    gene = omics_gene_ids[rna_pvals < 0.05],
    universe = omics_gene_ids,
    gene_sets = omics_pathways
)
prot_ora_res <- ora(
    gene = omics_gene_ids[prot_pvals < 0.05],
    universe = omics_gene_ids,
    gene_sets = omics_pathways
)

# Classify contribution patterns
classified_ora_res <- classify_omics_pattern(
    merged_res = early_ora_res,
    single_res = list(RNA = rna_ora_res, PROT = prot_ora_res),
    p_cutoff = 0.05
)

head(as.data.frame(classified_ora_res)[, c("ID", "pvalue", "Omics_Pattern")])

The Omics_Pattern column will label the pathway as Shared, RNA-Specific, PROT-Specific, or Enhanced (1+1>2).

Then, you can extract the gene-level contribution for a specific pathway to see the original statistics of the core enrichment genes:

# Get original multi-omics statistics for genes in the pathway
contribution_df <- get_omics_contribution(
    res = classified_ora_res, 
    agg = agg_fisher_res, 
    pathway_id = "TargetPathway"
)

head(contribution_df, 3)

This data frame is extremely useful for downstream visualization, such as drawing multi-omics heatmaps for the core enriched genes.

References

Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, et al. 2004. GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18): 3710–15. https://doi.org/10.1093/bioinformatics/bth456.
Korotkevich, Gennady, Vladimir Sukhov, and Alexey Sergushichev. 2019. “Fast Gene Set Enrichment Analysis.” bioRxiv, October 22, 060012. https://doi.org/10.1101/060012.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

  1. example adopted from https://guangchuangyu.github.io/2012/04/enrichment-analysis/↩︎