16 Bayesian term selection for enrichment results

Traditional over-representation analysis (ORA) often returns a long list of significant terms. Because each term is tested independently, groups of overlapping GO terms or pathways may all be reported as significant even when they describe the same underlying biological program.

bayes_enrich() adds a Bayesian term-selection layer on top of ORA results. Instead of asking only which terms are individually enriched, it asks which terms are most likely to represent the active biological programs that jointly explain the observed gene list. This procedure does not replace the ORA p-value and does not convert p.adjust into a posterior probability. It reuses the information already stored in an enrichResult object and builds a model based on term-gene coverage.

16.1 Input requirements

bayes_enrich() accepts an enrichResult object and extracts the information it needs directly from that object. In particular, it reuses the observed genes, the background universe, the term-to-gene mapping, and the enrichment table. Users therefore do not need to provide gene or universe again.

Internally, this means the Bayesian layer starts from the same enrichment result that users already trust, but asks a different question: among the candidate enriched terms, which ones provide the best joint explanation for the observed genes?

16.2 Candidate term space

The candidate argument defines the term space used by Bayesian inference. It is not a posterior filter. Instead, it determines which terms are allowed to compete for active-term selection.

candidate = "top" sorts x@result by by and takes the first n_terms rows. By default, by = "p.adjust", so the Bayesian model starts from the most significant ORA terms.
candidate = "significant" uses as.data.frame(x), which respects the cutoffs stored in the enrichment object and therefore works on the significant terms that users usually inspect.
candidate = "all" uses the full x@result slot.
Users can also provide a character vector of term IDs to define a custom candidate set.

ego2 <- bayes_enrich(ego, candidate = "top", n_terms = 200)
ego3 <- bayes_enrich(ego, candidate = "significant")
ego4 <- bayes_enrich(ego, candidate = "all")
ego5 <- bayes_enrich(ego, candidate = c("GO:0006955", "GO:0002376"))

Using a restricted candidate space is often helpful in practice because it keeps the model focused, runs faster, and reduces interference from weakly related terms.

16.3 The model

Let \(G\) be the background gene universe and \(T\) the candidate term set. For each gene \(g \in G\), let \(y_g\) denote whether the gene appears in the observed gene list. For each candidate term \(t \in T\), let \(z_t\) denote whether the term is active, and let \(A_t\) be the set of genes covered by that term.

Each candidate term has a latent binary state:

\[ z_t = \begin{cases} 1, & \text{term } t \text{ is active} \\ 0, & \text{term } t \text{ is inactive} \end{cases} \]

The prior is

\[ P(z_t = 1) = \text{prior}, \qquad P(z_t = 0) = 1 - \text{prior}. \]

A gene is considered covered if it belongs to at least one active term:

\[ \mathrm{covered}_g = \mathbb{I}\left(\exists t \in T \text{ such that } z_t = 1 \text{ and } g \in A_t \right). \]

If a gene is covered, the model assumes that it is observed with probability \(1 - \text{false\_negative}\). If it is not covered, it may still appear in the observed gene list with probability false_positive:

\[ P(y_g = 1 \mid \mathrm{covered}_g = 1) = 1 - \text{false\_negative} \]

\[ P(y_g = 1 \mid \mathrm{covered}_g = 0) = \text{false\_positive}. \]

This leads to the posterior quantity of interest:

\[ P(z_t = 1 \mid \text{observed genes, candidate terms, term-gene mapping}), \]

which represents the posterior support that term \(t\) is an active explanatory term under the current candidate space and model parameters.

16.4 MCMC inference

bayes_enrich() uses a lightweight Metropolis-Hastings sampler. In each iteration, one candidate term is selected at random, its active state is flipped, and only the genes covered by that term are updated. The acceptance probability is

\[ \min\left(1, \exp\left[\log p(\text{proposal}) - \log p(\text{current})\right]\right). \]

After burn-in, the sampler records active states every thin iterations. The posterior probability of a term is then estimated as the proportion of saved samples in which that term is active. Because the implementation updates only the genes affected by the flipped term, the computational cost of one iteration is driven mainly by the size of that term rather than the size of the whole universe.

16.5 Why this helps

ORA tests each term independently. In ontologies such as GO, many significant terms overlap heavily. For example, terms such as “immune response”, “immune system process”, “lymphocyte activation”, and “T cell activation” may all be enriched because they share many of the same genes.

bayes_enrich() performs joint explanation instead of independent scoring. If one specific term already explains most of the observed genes, another highly overlapping term may no longer be necessary. Because every active term pays a prior cost, redundant terms compete with one another during MCMC sampling. Terms with high posterior support therefore tend to be the ones that explain a large portion of the observed genes while remaining worth keeping after sparsity is taken into account.

16.6 A worked example

The following example uses a smaller number of iterations than the default settings so that the book can render quickly. For interactive analysis, users can increase n_iter and burnin for more stable posterior estimates.

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichit)

data(geneList, package = "DOSE")
de <- names(geneList)[abs(geneList) > 2]

ego <- enrichGO(
    gene = de,
    universe = names(geneList),
    OrgDb = org.Hs.eg.db,
    ont = "BP",
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.2
)

ego2 <- bayes_enrich(
    ego,
    candidate = "top",
    n_terms = 100,
    n_iter = 1000,
    burnin = 200,
    thin = 2,
    seed = 1
)

bayes_summary(ego2, n = 10)

           ID                                                  Description
1  GO:1901970   positive regulation of mitotic sister chromatid separation
2  GO:0033045                   regulation of sister chromatid segregation
3  GO:0051231                                           spindle elongation
4  GO:2000816   negative regulation of mitotic sister chromatid separation
5  GO:0007059                                       chromosome segregation
6  GO:0000022                                   mitotic spindle elongation
7  GO:0007052                                 mitotic spindle organization
8  GO:0045841 negative regulation of mitotic metaphase/anaphase transition
9  GO:0051255                                     spindle midzone assembly
10 GO:0045839              negative regulation of mitotic nuclear division
   GeneRatio   BgRatio RichFactor FoldEnrichment   zScore       pvalue
1      7/194  17/11525  0.4117647      24.461795 12.66643 5.796310e-09
2     15/194  94/11525  0.1595745       9.479875 10.80139 4.160561e-11
3      7/194  11/11525  0.6363636      37.804592 15.97915 1.071608e-10
4     10/194  48/11525  0.2083333      12.376503 10.33436 5.435313e-09
5     33/194 310/11525  0.1064516       6.323994 12.43332 1.187757e-17
6      6/194  10/11525  0.6000000      35.644330 14.34064 4.181987e-09
7     16/194 110/11525  0.1454545       8.641050 10.53610 3.904210e-11
8     10/194  48/11525  0.2083333      12.376503 10.33436 5.435313e-09
9      6/194  10/11525  0.6000000      35.644330 14.34064 4.181987e-09
10    12/194  54/11525  0.2222222      13.201604 11.75930 6.805963e-11
       p.adjust       qvalue
1  2.884779e-07 8.482451e-08
2  5.383766e-09 1.583051e-09
3  1.113735e-08 3.274844e-09
4  2.758155e-07 8.110124e-08
5  4.391308e-15 1.291227e-15
6  2.352822e-07 6.918277e-08
7  5.317945e-09 1.563697e-09
8  2.758155e-07 8.110124e-08
9  2.352822e-07 6.918277e-08
10 8.006287e-09 2.354182e-09
                                                                                                                                                                               geneID
1                                                                                                                                                55143/9787/220134/11065/991/9212/332
2                                                                                                        891/55143/9787/220134/7272/983/11065/1062/10403/991/9212/332/4085/10460/9319
3                                                                                                                                                 55143/9055/3832/9212/332/9493/24137
4                                                                                                                                  891/55143/220134/7272/10403/991/9212/332/4085/9319
5  891/146909/55143/64151/81930/4751/9787/220134/55355/7272/983/11065/1062/9232/23397/10403/9055/3832/991/81620/9212/332/4085/7153/10460/9493/51203/24137/55839/22974/79019/3833/9319
6                                                                                                                                                      55143/9055/9212/332/9493/24137
7                                                                                                  891/55143/9787/7272/1062/10403/9055/3832/9212/332/10460/9493/6790/24137/22974/3833
8                                                                                                                                  891/55143/220134/7272/10403/991/9212/332/4085/9319
9                                                                                                                                                      55143/9055/9212/332/9493/24137
10                                                                                                                        891/55143/220134/7272/652/10403/1111/991/9212/332/4085/9319
   Count posterior posterior_odds bayes_rank bayes_active
1      7    0.5425      1.1857923          1         TRUE
2     15    0.4175      0.7167382          2        FALSE
3      7    0.4125      0.7021277          3        FALSE
4     10    0.3450      0.5267176          4        FALSE
5     33    0.2750      0.3793103          5        FALSE
6      6    0.2550      0.3422819          6        FALSE
7     16    0.2475      0.3289037          7        FALSE
8     10    0.2175      0.2779553          8        FALSE
9      6    0.2050      0.2578616          9        FALSE
10    12    0.2000      0.2500000         10        FALSE
                                                                                                                                                                   bayes_covered_gene
1                                                                                                                                                332/991/9212/9787/11065/55143/220134
2                                                                                                        332/891/983/991/1062/4085/7272/9212/9319/9787/10403/10460/11065/55143/220134
3                                                                                                                                                 332/3832/9055/9212/9493/24137/55143
4                                                                                                                                  332/891/991/4085/7272/9212/9319/10403/55143/220134
5  332/891/983/991/1062/3832/3833/4085/4751/7153/7272/9055/9212/9232/9319/9493/9787/10403/10460/11065/22974/23397/24137/51203/55143/55355/55839/64151/79019/81620/81930/146909/220134
6                                                                                                                                                      332/9055/9212/9493/24137/55143
7                                                                                                  332/891/1062/3832/3833/6790/7272/9055/9212/9493/9787/10403/10460/22974/24137/55143
8                                                                                                                                  332/891/991/4085/7272/9212/9319/10403/55143/220134
9                                                                                                                                                      332/9055/9212/9493/24137/55143
10                                                                                                                        332/652/891/991/1111/4085/7272/9212/9319/10403/55143/220134
   bayes_covered_count
1                    7
2                   15
3                    7
4                   10
5                   33
6                    6
7                   16
8                   10
9                    6
10                  12

We can also focus on terms that pass a posterior cutoff:

bayes_summary(ego2, active = TRUE)

          ID                                                Description
1 GO:1901970 positive regulation of mitotic sister chromatid separation
  GeneRatio  BgRatio RichFactor FoldEnrichment   zScore      pvalue
1     7/194 17/11525  0.4117647        24.4618 12.66643 5.79631e-09
      p.adjust       qvalue                               geneID Count
1 2.884779e-07 8.482451e-08 55143/9787/220134/11065/991/9212/332     7
  posterior posterior_odds bayes_rank bayes_active
1    0.5425       1.185792          1         TRUE
                    bayes_covered_gene bayes_covered_count
1 332/991/9212/9787/11065/55143/220134                   7

16.7 Output columns

bayes_enrich() returns the original enrichment object and adds several columns to the result table:

posterior, the posterior probability that a term is active.
posterior_odds, calculated as posterior / (1 - posterior).
bayes_rank, the rank induced by ordering terms by posterior support.
bayes_active, whether posterior >= posterior_cutoff.
bayes_covered_gene, the observed genes covered by the term.
bayes_covered_count, the number of covered observed genes.

These columns make it easy to keep working with the same enrichment object while gaining a second ranking criterion that emphasizes explanation rather than marginal significance.

16.8 Relationship to `simplify()`

simplify() and bayes_enrich() address different kinds of redundancy.

simplify() uses semantic similarity among GO terms and keeps one representative term from a redundant semantic cluster.
bayes_enrich() uses term-gene coverage and asks which terms are jointly needed to explain the observed genes.

These approaches can be used together. For example, one can first compute posterior support and then use it as the selection criterion in simplify():

ego <- enrichGO(gene = de, universe = names(geneList), OrgDb = org.Hs.eg.db, ont = "BP")
ego <- bayes_enrich(ego)
ego <- simplify(ego, by = "posterior", select_fun = max)

16.9 Scope and interpretation

The current implementation is designed primarily for ORA and other analyses that operate on a binary observed gene list. Ranked-list methods such as GSEA require an additional step to define which genes count as “observed”, for example by using leading-edge genes or another thresholding rule. For this reason, bayes_enrich() should be viewed as a posterior term-selection layer for ORA-style enrichment rather than a direct replacement for GSEA statistics.

posterior should be interpreted as posterior support for a term being an active explanatory program under the chosen candidate space and model parameters. It is not a Bayesian-adjusted p-value, not direct experimental proof of pathway activation, and not an absolute probability over the entire ontology.

In practice, the main value of bayes_enrich() is that it compresses a long list of overlapping enriched terms into a smaller set of high-posterior programs, which is often much easier to interpret and report.

16.1 Input requirements

16.2 Candidate term space

16.3 The model

16.4 MCMC inference

16.5 Why this helps

16.6 A worked example

16.7 Output columns

16.8 Relationship to simplify()

16.9 Scope and interpretation

16.8 Relationship to `simplify()`