15  AI-Assisted Biological Interpretation

Traditional enrichment analysis typically results in a list of significant pathways or GO terms. While statistically sound, these lists often leave researchers asking, “So what?” What is the underlying biological mechanism? Who are the key drivers? Is this a pro-survival or pro-death signal?

To bridge the gap between statistical results and biological insights, clusterProfiler introduces an AI-powered interpretation module. By leveraging Large Language Models (LLMs) and a multi-agent system, clusterProfiler can now act as a virtual bioinformatician, converting dry enrichment lists into coherent, evidence-based biological narratives.

15.1 The interpret Function

The core function for this feature is interpret(). It accepts enrichment results (e.g., from enrichKEGG, enrichGO, or compareCluster) and uses an LLM to generate a structured report.

To use this feature, you need to configure an API key for a supported LLM provider (e.g., DeepSeek).

#| eval: false
library(clusterProfiler)

# Basic usage
# 'edo' is your enrichment result object
res <- interpret(edo)
print(res)

15.1.1 Tasks and Inputs

interpret() is not just for explaining enrichment results. It breaks down LLM capabilities into three distinct tasks:

  • task = "interpretation": (Default) Converts enrichment results into a mechanistic narrative suitable for publication (What -> So What).
  • task = "annotation": Performs cell type annotation for single-cell clusters using both marker genes and enrichment terms as evidence.
  • task = "phenotyping": Assigns a “state/phenotype label” to a group (e.g., “Pro-inflammatory” or “Senescent-like”).

To strengthen the evidence, interpret() supports “evidence synthesis” from multiple sources:

  • Single Object: enrichResult, gseaResult, or compareClusterResult.
  • Multiple Objects: A list() of results (e.g., list(kegg_res, go_res) or list(cellmarker_res, go_res)).
  • Batch Processing: If the input is a compareCluster result, it automatically splits by cluster and generates a report for each.

The key features of interpret() include:

  • Prompt Skeleton: A fixed structure to guide the LLM.
  • Structured Output: Enforced structure for parsing, comparison, and batch processing.
  • Reasoning First: Encourages “deduction before writing” to avoid merely listing pathway names.

15.1.2 Cell Type Annotation

For example, we can use Seurat to identify marker genes for each cluster in a single-cell RNA-seq dataset. Then we can use compareCluster to perform enrichment analysis for each cluster. Finally, we can use interpret to annotate cell types based on the enrichment results and marker genes.

library(Seurat)
dir = "data/filtered_gene_bc_matrices/hg19"
pbmc.data <- Read10X(data.dir = dir)
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                          min.cells=3, min.features=200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc,
  subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5
)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize",
                      scale.factor = 10000)
pbmc <- ScaleData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst",
                             nfeatures = 2000)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)

libray(dplyr)
topN_marker <- function(markers, n) {
    markers %>%
        group_by(cluster) %>%
        dplyr::filter(avg_log2FC > 1) %>%
        slice_head(n = n) %>%
        ungroup()
}
top20 <- topN_marker(pbmc.markers, 20)
# downloaded from: http://www.bio-bigdata.center/CellMarker_download_files/file/Cell_marker_Human.xlsx
cm <- rio::import("Cell_marker_Human.xlsx") 
x <- compareCluster(gene~cluster, data=top10, fun=enricher, TERM2GENE=cm[,c("cell_name", "marker")])
y <- interpret(x, task="annotation")

The output y is a list of interpretation results, one for each cluster. We can extract the inferred cell types.

> sapply(y, \(x) x$cell_type)
                                        0 
                           "Naive T Cell"
                                        1
                     "Classical Monocyte"
                                        2
                            "CD4+ T cell"
                                        3
                      "Follicular B cell"
                                        4
                  "CD8+ Cytotoxic T Cell" 
                                        5
"CD16+ monocyte (Non-classical monocyte)"
                                        6
               "Natural Killer (NK) cell"
                                        7
      "Plasmacytoid Dendritic Cell (pDC)"
                                        8 
                          "Megakaryocyte"

This result is highly consistent with the manual annotation from the Seurat pbmc3k tutorial:

Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 CD14, LYZ CD14+ Mono
2 IL7R, S100A4 Memory CD4+
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet

The full report provides detailed reasoning, confidence levels, and supporting evidence (markers/pathways) for each cluster assignment, offering transparency and explainability that simple label transfer methods lack.

print(y)
NoteEnrichment Interpretation / Annotation Report

Cell Type Annotation

Cluster: 0

Cell Type: Naive T Cell Confidence: High

Reasoning: The cluster is definitively identified as a naive T cell based on the co-expression of canonical pan-T cell markers (CD3D, CD3E) and the master regulator of naive T cell identity, TCF7 (cited in top terms: Naive CD8+ T cell, Naive CD4+ T cell, etc.). The high expression of CCR7, a critical homing receptor for naive and central memory T cells, and LEF1, another Wnt-pathway TF co-operating with TCF7, further solidifies this identity. The enrichment list is dominated by naive and central memory T cell subtypes, with effector/cytotoxic terms ranking lower and lacking their specific markers (e.g., GZMB, PRF1). The presence of both CD4+ and CD8+ associated terms suggests a mixed population or a shared naive state before lineage commitment, but the core identity is naive T cell.

Supporting Markers/Pathways: - CD3D - CD3E - TCF7 - CCR7 - LEF1 - NOSIP - MAL


Cell Type Annotation

Cluster: 1

Cell Type: Classical Monocyte Confidence: High

Reasoning: The enrichment list contains many related myeloid cell types, but the specific marker gene profile is definitive. The cluster expresses the core classical monocyte signature: high expression of CD14, S100A8, S100A9, FCN1, and LYZ (Top Specific/Marker Genes). While ‘Myeloid cell’ and ‘Macrophage’ are top-ranked by p-value, they are broad categories. The specific ‘Classical monocyte’ term (GeneRatio: 5/20, p.adjust: 1.637326e-07) is strongly supported by its gene list (S100A9/FCN1/CD14/S100A8/LYZ), which perfectly matches the top markers. The cluster lacks definitive markers to distinguish it as a Dendritic Cell (e.g., no FLT3, CD1C, CLEC9A), Macrophage (e.g., low/absent MRC1/CD163), or Neutrophil (e.g., absent MPO, ELANE). The presence of FCN1 and CD14 together is a hallmark of classical monocytes, and the absence of FCGR3A (CD16) argues against non-classical monocytes.

Supporting Markers/Pathways: - CD14 - S100A8 - S100A9 - FCN1 - LYZ - CST3 - TYROBP - MS4A6A


Cell Type Annotation

Cluster: 2

Cell Type: CD4+ T cell Confidence: High

Reasoning: The cluster is definitively a T cell, as the top enriched term is ‘T cell’ (p.adjust: 2.86e-17) and the marker list includes the core T-cell receptor complex genes CD3D, CD3E, CD3G, and CD247 (LAT). Among T-cell subtypes, the evidence strongly favors a CD4+ lineage over CD8+. The second most significant term is ‘CD4+ T cell’ (p.adjust: 2.40e-14), and its gene list (IL32, CD3E, IL7R, CD27, CD3D, TNFRSF4, MAL, CD2, LTB, CD40LG, CD3G) is almost entirely contained within the top ‘T cell’ markers. Key CD4+ T-cell markers IL7R and CD27 are among the top specific genes. While ‘CD8+ T cell’ is also enriched, its signature genes (like AQP3) are present but lower in the marker list, and definitive cytotoxic CD8+ markers (e.g., GZMB, PRF1) are absent. The presence of CD40LG and TNFRSF4 (OX40), which are associated with CD4+ T helper and regulatory functions, further supports this assignment. The cluster lacks exclusive markers for NK cells (e.g., NCAM1, KLR genes) or Tregs (FOXP3), though it shows some regulatory association.

Supporting Markers/Pathways: - CD3D - CD3E - CD3G - IL7R - CD27 - CD40LG - IL32 - LTB - TNFRSF4


Cell Type Annotation

Cluster: 3

Cell Type: Follicular B cell Confidence: High

Reasoning: The top enriched term is ‘Follicular B cell’ (p.adjust: 2.95e-16), and its gene list contains definitive B cell lineage markers (CD79A, CD79B, MS4A1, BANK1, FCER2, TCL1A) that are also present in the cluster’s top specific genes. While other top terms like ‘Secretory cell’ and ‘Classical monocyte’ are enriched, they are driven almost exclusively by MHC Class II genes (HLA-DRA, HLA-DRB1, etc.), which are not specific to those cell types but are also expressed by antigen-presenting B cells. The presence of core B cell receptor components (CD79A/B) and mature B cell markers (MS4A1, FCER2, TCL1A) that are absent from monocyte/dendritic cell definitions, combined with the lack of specific monocyte (e.g., CD14, FCGR3A) or secretory cell markers, confirms the identity as a Follicular B cell.

Supporting Markers/Pathways: - CD79A - MS4A1 - CD79B - TCL1A - FCER2 - BANK1 - CD37 - HLA-DRA - HLA-DRB1 - CD74


Cell Type Annotation

Cluster: 4

Cell Type: CD8+ Cytotoxic T Cell Confidence: High

Reasoning: The top enriched terms are a mixture of ‘Natural killer cell’ and various T cell subtypes, indicating shared cytotoxic function. However, the specific marker gene list is definitive. It includes the core T cell receptor complex genes CD3D, CD8A, and CD8B (present in ‘CD8+ T cell’ and ‘Cytotoxic T cell’ enrichments), which are lineage-defining for CD8+ T cells and absent from NK cells. While NKG7, PRF1, GZMA, GZMK, and GZMH are shared cytotoxic molecules, the co-expression of CD3D with CD8A/CD8B specifically identifies a cytotoxic T cell lineage. The absence of definitive NK-specific markers (e.g., NCAM1/CD56, KLRD1/CD94, FCGR3A/CD16) from the top marker list, and the presence of the T cell-specific signaling adaptor HCST (DAP10), supports a T cell identity. The ‘Cytotoxic CD8+ T cell’ enrichment term (GeneRatio: 8/20) provides the most precise functional and lineage match.

Supporting Markers/Pathways: - CD3D - CD8A - CD8B - GZMK - NKG7 - CCL5 - PRF1 - GZMA - GZMH - CST7 - LAG3 - KLRG1


Cell Type Annotation

Cluster: 5

Cell Type: CD16+ monocyte (Non-classical monocyte) Confidence: High

Reasoning: The top enriched term is ‘CD1C-CD141- dendritic cell’ (p.adjust: 3.68e-23), but this is likely a misannotation due to shared myeloid markers. The gene list for this term (LST1, CKB, HCK, CSF1R, IFITM3, MS4A7, SERPINA1, LILRB1, CDKN1C, PILRA, FCGR3A, HMOX1, RHOC, LRRC25, SIGLEC10, MS4A4A) is a composite of pan-myeloid and monocyte-specific genes, and lacks definitive dendritic cell markers (e.g., CD1C, CLEC9A, BATF3). The cluster’s specific marker list is dominated by canonical markers for non-classical monocytes: FCGR3A (CD16) is the defining marker, supported by MS4A7, CSF1R, and HES4 (enriched in ‘CD16+ monocyte’ and ‘Non-classical monocyte’ terms). The absence of T/NK/B cell markers and the presence of macrophage/pan-myeloid genes (e.g., CST3, CTSL) confirm a myeloid lineage, while the high expression of FCGR3A, MS4A7, and HES4 specifically distinguishes the non-classical monocyte subset from classical monocytes, macrophages, and dendritic cells.

Supporting Markers/Pathways: - FCGR3A (CD16) - MS4A7 - HES4 - CSF1R - LST1 - IFITM3 - SIGLEC10 - PILRA


Cell Type Annotation

Cluster: 6

Cell Type: Natural Killer (NK) cell Confidence: High

Reasoning: The top enriched terms include both ‘Natural killer cell’ (17/20 genes, p=2.79e-27) and ‘Cytotoxic CD4+ T cell’ (14/20 genes, p=5.24e-28). While the latter has a slightly better p-value, the discriminatory marker analysis strongly favors NK cells. The cluster’s top specific genes include canonical NK markers FGFBP2, SPON2, XCL2, XCL1, SH2D1B, and FCGR3A (CD16a), which are not specific to T cells. Critically, the cluster lacks definitive T-cell lineage markers (e.g., CD3D, CD3E, CD4, CD8A). The shared cytotoxic genes (PRF1, GNLY, GZMB, GZMA, NKG7) are expressed by both NK cells and cytotoxic T cells, but the presence of NK-specific markers and absence of T-cell receptor complex genes confirms NK cell identity.

Supporting Markers/Pathways: - FGFBP2 - SPON2 - XCL2 - XCL1 - SH2D1B - FCGR3A (CD16a) - KLRD1 (CD94) - GNLY - PRF1 - GZMB - GZMA - NKG7


Cell Type Annotation

Cluster: 7

Cell Type: Plasmacytoid Dendritic Cell (pDC) Confidence: High

Reasoning: The enrichment list contains two distinct dendritic cell lineages: conventional/myeloid DCs (cDC2, CD1C+ DCs) and plasmacytoid DCs (pDC). While the top-ranked term is ’CD1C+_A dendritic cell’ (a cDC2 subtype), the cluster’s specific marker genes are definitive for pDC identity. The pDC-specific markers LILRA4, CLEC4C (BDCA-2), SERPINF1, and P2RY6 are present in the top marker list and are the defining genes for the ‘Plasmacytoid dendritic cell(pDC)’ and ‘Plasmacytoid dendritic cell’ enrichment terms. Critically, the cluster lacks the core, non-overlapping markers for cDC2s: while it expresses CD1C and FCER1A (which can be expressed at low levels in some pDCs), it does NOT express the definitive cDC2 markers CLEC10A (in the cDC2 enrichment term but not in the pDC-specific marker set from the top genes) and CD1C at high specificity relative to pDC markers. The rule of exclusion applies: the top term is a cDC2 type, but the specific marker gene list is dominated by pDC markers and lacks exclusive cDC2 commitment.

Supporting Markers/Pathways: - LILRA4 (ILT7) - CLEC4C (BDCA-2) - SERPINF1 - P2RY6 - CLIC2 - SCT - LRRC26


Cell Type Annotation

Cluster: 8

Cell Type: Megakaryocyte Confidence: High

Reasoning: The assignment is based on the definitive convergence of enrichment terms and marker genes. The top enriched term is ‘Megakaryocyte’ (p.adjust: 3.58e-10) with genes SPARC, GNG11, PF4, GP9, ITGA2B, GP1BA. The related terms ‘Platelet’ and ‘Progenitor cell’ are lower-ranked and share subsets of these genes (e.g., PF4, GP9, ITGA2B), which is expected as platelets are anucleate fragments of megakaryocytes. The top specific marker gene list is dominated by canonical megakaryocyte/platelet markers (GP9, ITGA2B, GP1BA, PF4, ITGB3, SPARC, GNG11) and lacks definitive markers for other hematopoietic lineages that could challenge this identity (e.g., no CD3, CD19, CD14, ELANE). The ‘Progenitor cell’ term is likely reflective of the shared SPARC and PF4 expression in some progenitor states, but the presence of terminal differentiation markers like GP1BA and ITGA2B confirms a mature megakaryocyte identity.

Supporting Markers/Pathways: - GP9 - ITGA2B - GP1BA - PF4 - SPARC - GNG11 - ITGB3 - CLDN5 - CMTM5 - SDPR


15.2 The Multi-Agent System

Instead of relying on a single prompt, clusterProfiler employs a Multi-Agent System (MAS) to ensure accuracy and depth. This system consists of three specialized agents that work in a pipeline:

  1. Agent Cleaner: Acts as a curator. It filters out “housekeeping” pathways (e.g., Ribosome, Spliceosome) that may be statistically significant but irrelevant to the specific biological context (e.g., tumor immunology), reducing noise.
  2. Agent Detective: Acts as a systems biologist. It analyzes the gene list, looks for Hub Genes in Protein-Protein Interaction (PPI) networks, and combines this with Fold Change data to identify Key Drivers and infer regulatory mechanisms.
  3. Agent Storyteller: Acts as a scientific writer. It synthesizes the findings from the Cleaner and Detective into a logical narrative, distinguishing between observations (“What”), mechanisms (“How”), and implications (“So What”).

You can activate this deep mode using the interpret_agent() function.

#| eval: false
# Provide biological context to help Agent Cleaner
context <- "scRNA-seq analysis of CD8+ T cells in Tumor Microenvironment, comparing Exhausted vs. Naive states."

res <- interpret_agent(edo, context = context)

15.3 Knowledge-Guided Interpretation

Enrichment analysis often treats genes as a “bag of words,” ignoring their interactions and expression changes. The Knowledge-Guided Interpretation mode injects external knowledge to empower the AI’s reasoning.

15.3.1 1. PPI Networks and Hub Genes

By setting add_ppi = TRUE, the system fetches protein-protein interaction data (from STRING). The AI can then identify functional modules (e.g., a TCR signaling complex) rather than just isolated genes.

15.4 Reference-Guided Interpretation

For cell type annotation, LLMs can sometimes hallucinate. To prevent this, clusterProfiler supports Reference-Guided Interpretation.

15.4.1 Prior Knowledge Injection

You can provide “prior knowledge” (e.g., results from SingleR, scGPT, or manual rough annotation) to the AI. The AI acts as a validator and refiner: * Validation: Checks if pathway evidence supports the prior label. * Refinement: Refines a broad label (e.g., “T cell”) into a specific state (e.g., “Proliferating CD8+ T cell”) based on pathway activity. * Correction: Flags potential misannotations if the evidence contradicts the prior.

#| eval: false
# Prior knowledge from SingleR
my_priors <- c("Cluster1" = "T cells")

res <- interpret(edo, prior = my_priors, task = "cell_type")

15.4.2 Hierarchical Interpretation

For complex datasets, interpret_hierarchical() mimics the human thought process of annotating major lineages first (e.g., Myeloid) and then subtypes (e.g., M1 Macrophage). It enforces lineage constraints to prevent impossible annotations (e.g., a T cell subtype appearing within a Myeloid cluster).

This approach is also highly applicable to Single-cell Trajectory Inference. Developmental processes inherently follow a hierarchical structure (e.g., Stem Cell -> Progenitor -> Terminally Differentiated Cell). By utilizing this hierarchical relationship, interpret_hierarchical() can provide context-aware interpretations that respect the biological differentiation path, ensuring that downstream states are interpreted within the context of their upstream progenitors.

#| eval: false
# Mapping between minor and major clusters
cluster_mapping <- c(
  "SubCluster1_1" = "MajorCluster1", 
  "SubCluster1_2" = "MajorCluster1",
  "SubCluster2_1" = "MajorCluster2"
)

# Hierarchical interpretation
res_hier <- interpret_hierarchical(
    x_minor = enrich_minor, 
    x_major = enrich_major, 
    mapping = cluster_mapping
)

15.5 Gene-Based Fallback Mode

In real-world research, enrichment analysis sometimes fails to return significant pathways due to small gene sets or background noise.

clusterProfiler introduces a Gene-Based Fallback Mode. When no enriched pathways are found, the agent does not simply error out. Instead, it: 1. Directly analyzes the function of the input genes. 2. Retrieves PPI networks for these genes. 3. Infers biological function based on gene connectivity and function, providing a “Medium Confidence” report instead of an empty result.

#| eval: false
# When enrichment fails, this still works
res <- interpret_agent(tough_genes, add_ppi = TRUE)

15.6 Visualization: From Story to Figure

Text reports are great, but graphical representations are often preferred for presentations. The plot() method for interpretation results uses ggtangle (a grammar of graphics for networks) to visualize the AI-inferred regulatory network.

The resulting plot highlights: * Key Drivers: Central nodes identified by the AI. * Activation (Green) / Inhibition (Red): Regulatory relationships inferred from data. * Interactions (Grey): Physical associations.

#| eval: false
# Visualize the interpretation result
plot(res)

This feature creates a closed loop: from Enrichment (Statistics) to Interpretation (Insight) and finally to Visualization (Communication).