ego2 <- bayes_enrich(ego, candidate = "top", n_terms = 200)
ego3 <- bayes_enrich(ego, candidate = "significant")
ego4 <- bayes_enrich(ego, candidate = "all")
ego5 <- bayes_enrich(ego, candidate = c("GO:0006955", "GO:0002376"))16 Bayesian term selection for enrichment results
Traditional over-representation analysis (ORA) often returns a long list of significant terms. Because each term is tested independently, groups of overlapping GO terms or pathways may all be reported as significant even when they describe the same underlying biological program.
bayes_enrich() adds a Bayesian term-selection layer on top of ORA results. Instead of asking only which terms are individually enriched, it asks which terms are most likely to represent the active biological programs that jointly explain the observed gene list. This procedure does not replace the ORA p-value and does not convert p.adjust into a posterior probability. It reuses the information already stored in an enrichResult object and builds a model based on term-gene coverage.
16.1 Input requirements
bayes_enrich() accepts an enrichResult object and extracts the information it needs directly from that object. In particular, it reuses the observed genes, the background universe, the term-to-gene mapping, and the enrichment table. Users therefore do not need to provide gene or universe again.
Internally, this means the Bayesian layer starts from the same enrichment result that users already trust, but asks a different question: among the candidate enriched terms, which ones provide the best joint explanation for the observed genes?
16.2 Candidate term space
The candidate argument defines the term space used by Bayesian inference. It is not a posterior filter. Instead, it determines which terms are allowed to compete for active-term selection.
candidate = "top"sortsx@resultbybyand takes the firstn_termsrows. By default,by = "p.adjust", so the Bayesian model starts from the most significant ORA terms.candidate = "significant"usesas.data.frame(x), which respects the cutoffs stored in the enrichment object and therefore works on the significant terms that users usually inspect.candidate = "all"uses the fullx@resultslot.- Users can also provide a character vector of term IDs to define a custom candidate set.
Using a restricted candidate space is often helpful in practice because it keeps the model focused, runs faster, and reduces interference from weakly related terms.
16.3 The model
Let \(G\) be the background gene universe and \(T\) the candidate term set. For each gene \(g \in G\), let \(y_g\) denote whether the gene appears in the observed gene list. For each candidate term \(t \in T\), let \(z_t\) denote whether the term is active, and let \(A_t\) be the set of genes covered by that term.
Each candidate term has a latent binary state:
\[ z_t = \begin{cases} 1, & \text{term } t \text{ is active} \\ 0, & \text{term } t \text{ is inactive} \end{cases} \]
The prior is
\[ P(z_t = 1) = \text{prior}, \qquad P(z_t = 0) = 1 - \text{prior}. \]
A gene is considered covered if it belongs to at least one active term:
\[ \mathrm{covered}_g = \mathbb{I}\left(\exists t \in T \text{ such that } z_t = 1 \text{ and } g \in A_t \right). \]
If a gene is covered, the model assumes that it is observed with probability \(1 - \text{false\_negative}\). If it is not covered, it may still appear in the observed gene list with probability false_positive:
\[ P(y_g = 1 \mid \mathrm{covered}_g = 1) = 1 - \text{false\_negative} \]
\[ P(y_g = 1 \mid \mathrm{covered}_g = 0) = \text{false\_positive}. \]
This leads to the posterior quantity of interest:
\[ P(z_t = 1 \mid \text{observed genes, candidate terms, term-gene mapping}), \]
which represents the posterior support that term \(t\) is an active explanatory term under the current candidate space and model parameters.
16.4 MCMC inference
bayes_enrich() uses a lightweight Metropolis-Hastings sampler. In each iteration, one candidate term is selected at random, its active state is flipped, and only the genes covered by that term are updated. The acceptance probability is
\[ \min\left(1, \exp\left[\log p(\text{proposal}) - \log p(\text{current})\right]\right). \]
After burn-in, the sampler records active states every thin iterations. The posterior probability of a term is then estimated as the proportion of saved samples in which that term is active. Because the implementation updates only the genes affected by the flipped term, the computational cost of one iteration is driven mainly by the size of that term rather than the size of the whole universe.
16.5 Why this helps
ORA tests each term independently. In ontologies such as GO, many significant terms overlap heavily. For example, terms such as “immune response”, “immune system process”, “lymphocyte activation”, and “T cell activation” may all be enriched because they share many of the same genes.
bayes_enrich() performs joint explanation instead of independent scoring. If one specific term already explains most of the observed genes, another highly overlapping term may no longer be necessary. Because every active term pays a prior cost, redundant terms compete with one another during MCMC sampling. Terms with high posterior support therefore tend to be the ones that explain a large portion of the observed genes while remaining worth keeping after sparsity is taken into account.
16.6 A worked example
The following example uses a smaller number of iterations than the default settings so that the book can render quickly. For interactive analysis, users can increase n_iter and burnin for more stable posterior estimates.
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichit)
data(geneList, package = "DOSE")
de <- names(geneList)[abs(geneList) > 2]
ego <- enrichGO(
gene = de,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "BP",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2
)
ego2 <- bayes_enrich(
ego,
candidate = "top",
n_terms = 100,
n_iter = 1000,
burnin = 200,
thin = 2,
seed = 1
)
bayes_summary(ego2, n = 10) ID Description
1 GO:1901970 positive regulation of mitotic sister chromatid separation
2 GO:0033045 regulation of sister chromatid segregation
3 GO:0051231 spindle elongation
4 GO:2000816 negative regulation of mitotic sister chromatid separation
5 GO:0007059 chromosome segregation
6 GO:0000022 mitotic spindle elongation
7 GO:0007052 mitotic spindle organization
8 GO:0045841 negative regulation of mitotic metaphase/anaphase transition
9 GO:0051255 spindle midzone assembly
10 GO:0045839 negative regulation of mitotic nuclear division
GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
1 7/194 17/11525 0.4117647 24.461795 12.66643 5.796310e-09
2 15/194 94/11525 0.1595745 9.479875 10.80139 4.160561e-11
3 7/194 11/11525 0.6363636 37.804592 15.97915 1.071608e-10
4 10/194 48/11525 0.2083333 12.376503 10.33436 5.435313e-09
5 33/194 310/11525 0.1064516 6.323994 12.43332 1.187757e-17
6 6/194 10/11525 0.6000000 35.644330 14.34064 4.181987e-09
7 16/194 110/11525 0.1454545 8.641050 10.53610 3.904210e-11
8 10/194 48/11525 0.2083333 12.376503 10.33436 5.435313e-09
9 6/194 10/11525 0.6000000 35.644330 14.34064 4.181987e-09
10 12/194 54/11525 0.2222222 13.201604 11.75930 6.805963e-11
p.adjust qvalue
1 2.884779e-07 8.482451e-08
2 5.383766e-09 1.583051e-09
3 1.113735e-08 3.274844e-09
4 2.758155e-07 8.110124e-08
5 4.391308e-15 1.291227e-15
6 2.352822e-07 6.918277e-08
7 5.317945e-09 1.563697e-09
8 2.758155e-07 8.110124e-08
9 2.352822e-07 6.918277e-08
10 8.006287e-09 2.354182e-09
geneID
1 55143/9787/220134/11065/991/9212/332
2 891/55143/9787/220134/7272/983/11065/1062/10403/991/9212/332/4085/10460/9319
3 55143/9055/3832/9212/332/9493/24137
4 891/55143/220134/7272/10403/991/9212/332/4085/9319
5 891/146909/55143/64151/81930/4751/9787/220134/55355/7272/983/11065/1062/9232/23397/10403/9055/3832/991/81620/9212/332/4085/7153/10460/9493/51203/24137/55839/22974/79019/3833/9319
6 55143/9055/9212/332/9493/24137
7 891/55143/9787/7272/1062/10403/9055/3832/9212/332/10460/9493/6790/24137/22974/3833
8 891/55143/220134/7272/10403/991/9212/332/4085/9319
9 55143/9055/9212/332/9493/24137
10 891/55143/220134/7272/652/10403/1111/991/9212/332/4085/9319
Count posterior posterior_odds bayes_rank bayes_active
1 7 0.5425 1.1857923 1 TRUE
2 15 0.4175 0.7167382 2 FALSE
3 7 0.4125 0.7021277 3 FALSE
4 10 0.3450 0.5267176 4 FALSE
5 33 0.2750 0.3793103 5 FALSE
6 6 0.2550 0.3422819 6 FALSE
7 16 0.2475 0.3289037 7 FALSE
8 10 0.2175 0.2779553 8 FALSE
9 6 0.2050 0.2578616 9 FALSE
10 12 0.2000 0.2500000 10 FALSE
bayes_covered_gene
1 332/991/9212/9787/11065/55143/220134
2 332/891/983/991/1062/4085/7272/9212/9319/9787/10403/10460/11065/55143/220134
3 332/3832/9055/9212/9493/24137/55143
4 332/891/991/4085/7272/9212/9319/10403/55143/220134
5 332/891/983/991/1062/3832/3833/4085/4751/7153/7272/9055/9212/9232/9319/9493/9787/10403/10460/11065/22974/23397/24137/51203/55143/55355/55839/64151/79019/81620/81930/146909/220134
6 332/9055/9212/9493/24137/55143
7 332/891/1062/3832/3833/6790/7272/9055/9212/9493/9787/10403/10460/22974/24137/55143
8 332/891/991/4085/7272/9212/9319/10403/55143/220134
9 332/9055/9212/9493/24137/55143
10 332/652/891/991/1111/4085/7272/9212/9319/10403/55143/220134
bayes_covered_count
1 7
2 15
3 7
4 10
5 33
6 6
7 16
8 10
9 6
10 12
We can also focus on terms that pass a posterior cutoff:
bayes_summary(ego2, active = TRUE) ID Description
1 GO:1901970 positive regulation of mitotic sister chromatid separation
GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
1 7/194 17/11525 0.4117647 24.4618 12.66643 5.79631e-09
p.adjust qvalue geneID Count
1 2.884779e-07 8.482451e-08 55143/9787/220134/11065/991/9212/332 7
posterior posterior_odds bayes_rank bayes_active
1 0.5425 1.185792 1 TRUE
bayes_covered_gene bayes_covered_count
1 332/991/9212/9787/11065/55143/220134 7
16.7 Output columns
bayes_enrich() returns the original enrichment object and adds several columns to the result table:
posterior, the posterior probability that a term is active.posterior_odds, calculated asposterior / (1 - posterior).bayes_rank, the rank induced by ordering terms by posterior support.bayes_active, whetherposterior >= posterior_cutoff.bayes_covered_gene, the observed genes covered by the term.bayes_covered_count, the number of covered observed genes.
These columns make it easy to keep working with the same enrichment object while gaining a second ranking criterion that emphasizes explanation rather than marginal significance.
16.8 Relationship to simplify()
simplify() and bayes_enrich() address different kinds of redundancy.
simplify()uses semantic similarity among GO terms and keeps one representative term from a redundant semantic cluster.bayes_enrich()uses term-gene coverage and asks which terms are jointly needed to explain the observed genes.
These approaches can be used together. For example, one can first compute posterior support and then use it as the selection criterion in simplify():
ego <- enrichGO(gene = de, universe = names(geneList), OrgDb = org.Hs.eg.db, ont = "BP")
ego <- bayes_enrich(ego)
ego <- simplify(ego, by = "posterior", select_fun = max)16.9 Scope and interpretation
The current implementation is designed primarily for ORA and other analyses that operate on a binary observed gene list. Ranked-list methods such as GSEA require an additional step to define which genes count as “observed”, for example by using leading-edge genes or another thresholding rule. For this reason, bayes_enrich() should be viewed as a posterior term-selection layer for ORA-style enrichment rather than a direct replacement for GSEA statistics.
posterior should be interpreted as posterior support for a term being an active explanatory program under the chosen candidate space and model parameters. It is not a Bayesian-adjusted p-value, not direct experimental proof of pathway activation, and not an absolute probability over the entire ontology.
In practice, the main value of bayes_enrich() is that it compresses a long list of overlapping enriched terms into a smaller set of high-posterior programs, which is often much easier to interpret and report.