15 AI-Assisted Biological Interpretation
Traditional enrichment analysis typically results in a list of significant pathways or GO terms. While statistically sound, these lists often leave researchers asking, “So what?” What is the underlying biological mechanism? Who are the key drivers? Is this a pro-survival or pro-death signal?
To bridge the gap between statistical results and biological insights, clusterProfiler introduces an AI-powered interpretation module. By leveraging Large Language Models (LLMs) and a multi-agent system, clusterProfiler can now act as a virtual bioinformatician, converting dry enrichment lists into coherent, evidence-based biological narratives.
15.1 The interpret Function
The core function for this feature is interpret(). It accepts enrichment results (e.g., from enrichKEGG, enrichGO, or compareCluster) and uses an LLM to generate a structured report.
To use this feature, you need to configure an API key for a supported LLM provider (e.g., DeepSeek).
#| eval: false
library(clusterProfiler)
# Basic usage
# 'edo' is your enrichment result object
res <- interpret(edo)
print(res)15.1.1 Tasks and Inputs
interpret() is not just for explaining enrichment results. It breaks down LLM capabilities into three distinct tasks:
task = "interpretation": (Default) Converts enrichment results into a mechanistic narrative suitable for publication (What -> So What).task = "annotation": Performs cell type annotation for single-cell clusters using both marker genes and enrichment terms as evidence.task = "phenotyping": Assigns a “state/phenotype label” to a group (e.g., “Pro-inflammatory” or “Senescent-like”).
To strengthen the evidence, interpret() supports “evidence synthesis” from multiple sources:
- Single Object:
enrichResult,gseaResult, orcompareClusterResult. - Multiple Objects: A
list()of results (e.g.,list(kegg_res, go_res)orlist(cellmarker_res, go_res)). - Batch Processing: If the input is a
compareClusterresult, it automatically splits by cluster and generates a report for each.
The key features of interpret() include:
- Prompt Skeleton: A fixed structure to guide the LLM.
- Structured Output: Enforced structure for parsing, comparison, and batch processing.
- Reasoning First: Encourages “deduction before writing” to avoid merely listing pathway names.
15.1.2 Cell Type Annotation
For example, we can use Seurat to identify marker genes for each cluster in a single-cell RNA-seq dataset. Then we can use compareCluster to perform enrichment analysis for each cluster. Finally, we can use interpret to annotate cell types based on the enrichment results and marker genes.
library(Seurat)
dir = "data/filtered_gene_bc_matrices/hg19"
pbmc.data <- Read10X(data.dir = dir)
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
min.cells=3, min.features=200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc,
subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5
)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)
pbmc <- ScaleData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst",
nfeatures = 2000)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
libray(dplyr)
topN_marker <- function(markers, n) {
markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = n) %>%
ungroup()
}
top20 <- topN_marker(pbmc.markers, 20)
# downloaded from: http://www.bio-bigdata.center/CellMarker_download_files/file/Cell_marker_Human.xlsx
cm <- rio::import("Cell_marker_Human.xlsx")
x <- compareCluster(gene~cluster, data=top10, fun=enricher, TERM2GENE=cm[,c("cell_name", "marker")])
y <- interpret(x, task="annotation")The output y is a list of interpretation results, one for each cluster. We can extract the inferred cell types.
> sapply(y, \(x) x$cell_type)
0
"Naive T Cell"
1
"Classical Monocyte"
2
"CD4+ T cell"
3
"Follicular B cell"
4
"CD8+ Cytotoxic T Cell"
5
"CD16+ monocyte (Non-classical monocyte)"
6
"Natural Killer (NK) cell"
7
"Plasmacytoid Dendritic Cell (pDC)"
8
"Megakaryocyte"
This result is highly consistent with the manual annotation from the Seurat pbmc3k tutorial:
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R, CCR7 | Naive CD4+ T |
| 1 | CD14, LYZ | CD14+ Mono |
| 2 | IL7R, S100A4 | Memory CD4+ |
| 3 | MS4A1 | B |
| 4 | CD8A | CD8+ T |
| 5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
| 6 | GNLY, NKG7 | NK |
| 7 | FCER1A, CST3 | DC |
| 8 | PPBP | Platelet |
The full report provides detailed reasoning, confidence levels, and supporting evidence (markers/pathways) for each cluster assignment, offering transparency and explainability that simple label transfer methods lack.
print(y)15.2 The Multi-Agent System
Instead of relying on a single prompt, clusterProfiler employs a Multi-Agent System (MAS) to ensure accuracy and depth. This system consists of three specialized agents that work in a pipeline:
- Agent Cleaner: Acts as a curator. It filters out “housekeeping” pathways (e.g., Ribosome, Spliceosome) that may be statistically significant but irrelevant to the specific biological context (e.g., tumor immunology), reducing noise.
- Agent Detective: Acts as a systems biologist. It analyzes the gene list, looks for Hub Genes in Protein-Protein Interaction (PPI) networks, and combines this with Fold Change data to identify Key Drivers and infer regulatory mechanisms.
- Agent Storyteller: Acts as a scientific writer. It synthesizes the findings from the Cleaner and Detective into a logical narrative, distinguishing between observations (“What”), mechanisms (“How”), and implications (“So What”).
You can activate this deep mode using the interpret_agent() function.
#| eval: false
# Provide biological context to help Agent Cleaner
context <- "scRNA-seq analysis of CD8+ T cells in Tumor Microenvironment, comparing Exhausted vs. Naive states."
res <- interpret_agent(edo, context = context)15.3 Knowledge-Guided Interpretation
Enrichment analysis often treats genes as a “bag of words,” ignoring their interactions and expression changes. The Knowledge-Guided Interpretation mode injects external knowledge to empower the AI’s reasoning.
15.3.1 1. PPI Networks and Hub Genes
By setting add_ppi = TRUE, the system fetches protein-protein interaction data (from STRING). The AI can then identify functional modules (e.g., a TCR signaling complex) rather than just isolated genes.
15.3.2 2. Expression Trends
By providing gene_fold_change, the AI can infer the direction of pathway activity. For example, if the “Apoptosis” pathway is enriched but pro-apoptotic genes are downregulated while anti-apoptotic genes are upregulated, the AI will correctly interpret this as “Apoptosis Resistance.”
#| eval: false
# Prepare a named vector of fold changes
gene_list <- c("CD8A" = 2.5, "PDCD1" = 1.8, "GZMB" = 3.2)
res <- interpret(edo,
task = "cell_type",
add_ppi = TRUE, # Enable PPI network analysis
gene_fold_change = gene_list # Inject expression data
)15.4 Reference-Guided Interpretation
For cell type annotation, LLMs can sometimes hallucinate. To prevent this, clusterProfiler supports Reference-Guided Interpretation.
15.4.1 Prior Knowledge Injection
You can provide “prior knowledge” (e.g., results from SingleR, scGPT, or manual rough annotation) to the AI. The AI acts as a validator and refiner: * Validation: Checks if pathway evidence supports the prior label. * Refinement: Refines a broad label (e.g., “T cell”) into a specific state (e.g., “Proliferating CD8+ T cell”) based on pathway activity. * Correction: Flags potential misannotations if the evidence contradicts the prior.
#| eval: false
# Prior knowledge from SingleR
my_priors <- c("Cluster1" = "T cells")
res <- interpret(edo, prior = my_priors, task = "cell_type")15.4.2 Hierarchical Interpretation
For complex datasets, interpret_hierarchical() mimics the human thought process of annotating major lineages first (e.g., Myeloid) and then subtypes (e.g., M1 Macrophage). It enforces lineage constraints to prevent impossible annotations (e.g., a T cell subtype appearing within a Myeloid cluster).
This approach is also highly applicable to Single-cell Trajectory Inference. Developmental processes inherently follow a hierarchical structure (e.g., Stem Cell -> Progenitor -> Terminally Differentiated Cell). By utilizing this hierarchical relationship, interpret_hierarchical() can provide context-aware interpretations that respect the biological differentiation path, ensuring that downstream states are interpreted within the context of their upstream progenitors.
#| eval: false
# Mapping between minor and major clusters
cluster_mapping <- c(
"SubCluster1_1" = "MajorCluster1",
"SubCluster1_2" = "MajorCluster1",
"SubCluster2_1" = "MajorCluster2"
)
# Hierarchical interpretation
res_hier <- interpret_hierarchical(
x_minor = enrich_minor,
x_major = enrich_major,
mapping = cluster_mapping
)15.5 Gene-Based Fallback Mode
In real-world research, enrichment analysis sometimes fails to return significant pathways due to small gene sets or background noise.
clusterProfiler introduces a Gene-Based Fallback Mode. When no enriched pathways are found, the agent does not simply error out. Instead, it: 1. Directly analyzes the function of the input genes. 2. Retrieves PPI networks for these genes. 3. Infers biological function based on gene connectivity and function, providing a “Medium Confidence” report instead of an empty result.
#| eval: false
# When enrichment fails, this still works
res <- interpret_agent(tough_genes, add_ppi = TRUE)15.6 Visualization: From Story to Figure
Text reports are great, but graphical representations are often preferred for presentations. The plot() method for interpretation results uses ggtangle (a grammar of graphics for networks) to visualize the AI-inferred regulatory network.
The resulting plot highlights: * Key Drivers: Central nodes identified by the AI. * Activation (Green) / Inhibition (Red): Regulatory relationships inferred from data. * Interactions (Grey): Physical associations.
#| eval: false
# Visualize the interpretation result
plot(res)This feature creates a closed loop: from Enrichment (Statistics) to Interpretation (Insight) and finally to Visualization (Communication).