12  Biological theme comparison

The clusterProfiler package was developed for biological theme comparison (Yu et al. 2012; Wu et al. 2021), and it provides a function, compareCluster, to automatically calculate enriched functional profiles of each gene clusters and aggregate the results into a single object. Comparing functional profiles can reveal functional consensus and differences among different experiments and helps in identifying differential functional modules in omics datasets.

12.1 Comparing multiple gene lists

The compareCluster() function applies selected function (via the fun parameter) to perform enrichment analysis for each gene list.

data(gcSample)
str(gcSample) 
List of 8
 $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
 $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
 $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
 $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
 $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
 $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
 $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
 $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...

Users can use a named list of gene IDs as the input that passed to the geneCluster parameter.

ck <- compareCluster(geneCluster = gcSample, fun = enrichGO, 
                    OrgDb = org.Hs.eg.db, ont="BP")
ck <- setReadable(ck, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
head(ck) 
  Cluster         ID                               Description GeneRatio
1      X2 GO:0140014                  mitotic nuclear division    36/761
2      X2 GO:0000070      mitotic sister chromatid segregation    28/761
3      X2 GO:0045787         positive regulation of cell cycle    34/761
4      X2 GO:0000819              sister chromatid segregation    29/761
5      X2 GO:0090068 positive regulation of cell cycle process    29/761
6      X2 GO:0042113                         B cell activation    28/761
    BgRatio RichFactor FoldEnrichment   zScore       pvalue     p.adjust
1 281/18842  0.1281139       3.172039 7.525809 9.189373e-10 5.403351e-06
2 205/18842  0.1365854       3.381789 7.034372 1.678353e-08 4.934359e-05
3 312/18842  0.1089744       2.698154 6.205112 1.559847e-07 3.057299e-04
4 246/18842  0.1178862       2.918806 6.214725 2.445215e-07 3.594466e-04
5 252/18842  0.1150794       2.849311 6.063223 4.079352e-07 3.995299e-04
6 238/18842  0.1176471       2.912886 6.092666 4.100013e-07 3.995299e-04
        qvalue
1 5.403351e-06
2 4.934359e-05
3 3.057299e-04
4 3.594466e-04
5 3.995299e-04
6 3.995299e-04
                                                                                                                                                                                                                    geneID
1 BIRC5/BUB1B/NCAPG/PSRC1/CLASP2/CENPA/BORA/CUL3/KIF2C/NCAPG2/SMC2/CDC20/KIF18B/MKI67/CUL9/KIF11/CHEK2/BUB1/ZWILCH/APC/IGF1/PIBF1/NDE1/NCAPD2/SH2B1/KIF4A/SPDL1/SPAG5/CDC14B/PKMYT1/DLGAP5/KNTC1/KIF14/NCAPD3/NUSAP1/ESPL1
2                                               BIRC5/BUB1B/NCAPG/PSRC1/CLASP2/CENPA/CUL3/KIF2C/NCAPG2/SMC2/CDC20/KIF18B/KIF11/CHEK2/BUB1/ZWILCH/APC/PIBF1/NCAPD2/KIF4A/SPDL1/SPAG5/DLGAP5/KNTC1/KIF14/NCAPD3/NUSAP1/ESPL1
3                      BIRC5/NCAPG/CUL3/NCAPG2/TBX2/SMC2/CDC20/MSX1/RAD51AP1/CDC7/CCNE2/CCPG1/BUB1/DBF4/IGF1/RARA/EZH2/EGFR/SH2B1/GPSM2/CDC6/SPAG5/EXOC7/CDC14B/DLGAP5/PLSCR1/SMC5/HSF1/KIF14/TP63/NUSAP1/STIL/MEIS2/ESPL1
4                                          BIRC5/BUB1B/NCAPG/PSRC1/CLASP2/CENPA/CUL3/KIF2C/NCAPG2/SMC2/CDC20/KIF18B/KIF11/CHEK2/BUB1/ZWILCH/APC/PIBF1/NCAPD2/CDC6/KIF4A/SPDL1/SPAG5/DLGAP5/KNTC1/KIF14/NCAPD3/NUSAP1/ESPL1
5                                                 BIRC5/NCAPG/CUL3/NCAPG2/TBX2/SMC2/CDC20/RAD51AP1/CDC7/CCNE2/BUB1/DBF4/IGF1/EZH2/EGFR/SH2B1/GPSM2/CDC6/SPAG5/EXOC7/CDC14B/DLGAP5/PLSCR1/SMC5/KIF14/TP63/NUSAP1/STIL/ESPL1
6                                                       HHEX/LYN/SLAMF8/IL27RA/POU2AF1/MALT1/SMC2/TNFAIP3/LAX1/MEF2C/THEMIS2/PRLR/CD86/RABL3/CDKN2A/ZBTB7A/PARP3/EZH2/CD27/NBN/SYK/CD79A/CD40/EPHB2/RIF1/IL6ST/HDAC4/SFRP1
  Count
1    36
2    28
3    34
4    29
5    29
6    28

12.2 Formula interface of compareCluster

As an alternative to using named list, the compareCluster() function also supports passing a formula to describe more complicated experimental designs (e.g., \(Gene \sim time + treatment\)).

data(geneList, package='DOSE')
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"

formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, 
                            fun=enrichGO, OrgDb = org.Hs.eg.db, ont="BP")
  
head(formula_res)
          Cluster         group othergroup         ID
1 downregulated.A downregulated          A GO:0030198
2 downregulated.A downregulated          A GO:0043062
3 downregulated.A downregulated          A GO:0045229
4 downregulated.A downregulated          A GO:0001501
5 downregulated.A downregulated          A GO:0008015
6 downregulated.A downregulated          A GO:0030199
                                    Description GeneRatio   BgRatio RichFactor
1             extracellular matrix organization    38/631 278/18842  0.1366906
2          extracellular structure organization    38/631 279/18842  0.1362007
3 external encapsulating structure organization    38/631 280/18842  0.1357143
4                   skeletal system development    35/631 341/18842  0.1026393
5                             blood circulation    42/631 480/18842  0.0875000
6                  collagen fibril organization    14/631  65/18842  0.2153846
  FoldEnrichment   zScore       pvalue     p.adjust       qvalue
1       4.081656 9.635419 1.617110e-13 3.986987e-10 3.986987e-10
2       4.067027 9.607167 1.814206e-13 3.986987e-10 3.986987e-10
3       4.052502 9.579047 2.034177e-13 3.986987e-10 3.986987e-10
4       3.064865 7.162604 4.392006e-09 6.456248e-06 6.456248e-06
5       2.612797 6.662545 1.479521e-08 1.739917e-05 1.739917e-05
6       6.431501 8.165131 2.435878e-08 2.387161e-05 2.387161e-05
                                                                                                                                                                                                            geneID
1              1805/1281/10266/2331/2200/2621/8292/4035/1294/1513/3908/2006/7373/7049/7177/7075/3913/4856/1842/11117/4319/1471/10218/10631/1300/1296/11096/4016/1289/4313/1290/3249/50509/1287/165/2192/81035/2191
2              1805/1281/10266/2331/2200/2621/8292/4035/1294/1513/3908/2006/7373/7049/7177/7075/3913/4856/1842/11117/4319/1471/10218/10631/1300/1296/11096/4016/1289/4313/1290/3249/50509/1287/165/2192/81035/2191
3              1805/1281/10266/2331/2200/2621/8292/4035/1294/1513/3908/2006/7373/7049/7177/7075/3913/4856/1842/11117/4319/1471/10218/10631/1300/1296/11096/4016/1289/4313/1290/3249/50509/1287/165/2192/81035/2191
4                           1281/81029/5125/2200/2487/4653/2202/5364/3952/1746/2121/658/4856/7481/4982/2697/3479/284266/2690/1300/63923/1909/1289/7704/55112/1009/5744/7227/4488/50509/6926/116039/85477/6424/1462
5 5348/5138/8864/5350/26509/775/18/10266/2946/7122/5125/2200/4922/9607/185/1359/3778/3952/2006/3357/6833/6505/7349/7220/6444/9365/857/3751/11117/6387/23171/2697/9590/6863/290/1909/55800/4313/23327/477/63895/150
6                                                                                                                                           1805/1281/2331/2200/1294/7373/4856/11117/4016/1289/1290/50509/1287/165
  Count
1    38
2    38
3    38
4    35
5    42
6    14

12.3 Functional analysis of single-cell marker genes

The compareCluster() function is highly versatile and can be directly integrated into single-cell analysis workflows. For example, after identifying marker genes for each cluster using Seurat, the results can be directly used for functional enrichment analysis.

12.3.1 Using Seurat results

Typically, FindAllMarkers() from Seurat returns a data frame where rows are genes and columns include cluster information and statistical metrics.

library(Seurat)
library(SeuratData)
library(dplyr)
library(clusterProfiler)

# Load example data
data("pbmc3k")
sce <- pbmc3k.final

# Identify marker genes
sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, 
                              min.pct = 0.25, 
                              thresh.use = 0.25)

# Filter markers
markers <- sce.markers |> 
    group_by(cluster) |> 
    filter(p_val_adj < 0.001) |> 
    ungroup()

# ID conversion (if needed)
gid <- bitr(unique(markers$gene), 'SYMBOL', 'ENTREZID', OrgDb = 'org.Hs.eg.db')
markers <- full_join(markers, gid, by = c('gene' = 'SYMBOL'))

# Perform comparison using formula interface
x <- compareCluster(ENTREZID ~ cluster, data = markers, fun = 'enrichKEGG')

# Visualization
dotplot(x, label_format = 40) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

12.3.2 Using COSG results

Methods like COSG return marker genes as a list or data frame where columns represent clusters. Since a data frame is essentially a list of equal-length vectors, it can be passed directly to compareCluster().

library(COSG)

# Identify markers using COSG
marker_cosg <- cosg(
    sce,
    groups = 'all',
    assay = 'RNA',
    slot = 'data',
    mu = 1,
    n_genes_user = 100
)

# The first element is a data frame of gene symbols
# Columns correspond to clusters
head(marker_cosg[[1]])

# Directly use the data frame for enrichment
y <- compareCluster(marker_cosg[[1]], 
                    fun = 'enrichGO', 
                    OrgDb = 'org.Hs.eg.db', 
                    keyType = 'SYMBOL', 
                    ont = "MF")

# Visualization
dotplot(y, label_format = 60) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

This demonstrates that compareCluster() can seamlessly handle various data structures commonly produced by single-cell analysis tools, simplifying the downstream functional interpretation.

12.4 Visualization of functional profile comparison

12.4.1 Dot plot

We can visualize the result using the dotplot() method.

dotplot(ck)
dotplot(formula_res)
Figure 12.1: Comparing enrichment results of multiple gene lists. (A) Using a named list of gene clusters, the results were displayed as multiple columns with each one represents an enrichment result of a gene cluster. (B) Using formula interface, the columns represent gene clusters defined by the formula.

The formula interface allows more complicated gene cluster definition. In Figure 12.1(B), the gene clusters were defined by two variables (i.e. group that divides genes into upregulated and downregulated and othergroup that divides the genes into two categories of A and B.). The dotplot() function allows us to use one variable to divide the result into different facet and plot the result with other variables in each facet panel (Figure 12.2).

dotplot(formula_res, x="group") + facet_grid(~othergroup)
Figure 12.2: Comparing enrichment results of multiple gene lists defined by multiple variable. The results were visualized as a dot plot with an x-axis representing one level of conditions (the group variable) and a facet panel indicating another level of conditions (the othergroup variable).

By default, only the top 5 (most significant) categories of each cluster are shown. Users can change showCategory to specify how many categories should be displayed for each cluster, and if showCategory is set to NULL, the whole result will be plotted. The showCategory parameter also accepts a character vector of selected categories to plot pathways of interest.

dotplot() tries to make the comparison among different clusters more informative and reasonable. After extracting, for example, 10 categories for each cluster, clusterProfiler collects overlap of these categories among clusters. For example, term A may be enriched in all gene clusters (e.g., g1 and g2) and ranked among the 10 most significant categories of g1 but not of g2. In this case, clusterProfiler will still include term A in the g2 cluster to make the comparison more reasonable.

This feature ensures that if a term is significant and selected for one cluster, its statistics in other clusters are also displayed, allowing for a valid comparison. Consequently, the number of categories shown for some clusters might exceed the showCategory limit (e.g., 15 categories shown for g2 while showCategory is 10). Disabling this by setting includeAll = FALSE in dotplot() may result in a plot that suggests zero overlap between clusters, which can be misleading (see Figure 12.3 A vs B) and is not recommended.

The same selection logic is used by cnetplot() for compareCluster() results. In both functions, a numeric showCategory selects the top terms within each cluster by default, while includeAll controls whether matched terms from other clusters should be retained for comparison.

p1 <- dotplot(ck, showCategory=5, includeAll=FALSE) + ggtitle("includeAll=FALSE")
p2 <- dotplot(ck, showCategory=5) + ggtitle("default")
plot_list(p1, p2, ncol=2, tag_levels='A')
Figure 12.3: Comparison of dotplot with and without includeAll parameter. (A) includeAll=FALSE produces a plot that strictly follows showCategory but may hide overlapping terms. (B) The default behavior (includeAll=TRUE) recovers the missing overlap information, making the comparison more reasonable.

The dotplot() function accepts a parameter size for setting the scale of dot sizes. The default parameter size is set to geneRatio, which corresponds to the GeneRatio column of the output. If it is set to count, the comparison will be based on gene counts, while if set to rowPercentage, the dot sizes will be normalized by count/(sum of each row). Users can also map the dot size to other variables or derived variables (see Chapter 16).

To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is set to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is set to geneRatio, as shown in Figure 12.1.

The p-values indicate which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding adjusted p-values. Color gradient ranging from red to blue corresponds to the order of increasing adjusted p-values. That is, red indicates low p-values (high enrichment), and blue indicates high p-values (low enrichment). Adjusted p-values were filtered out by the threshold given by the parameter pvalueCutoff, and FDR can be estimated by qvalue.

12.4.2 Gene-Concept Network

The cnetplot also works with compareCluster() result. For compareClusterResult objects, enriched terms are drawn as pies so that the composition of each term across gene clusters can be displayed directly in the network. Like dotplot(), a numeric showCategory selects the top enriched terms within each cluster by default, and includeAll = TRUE keeps comparable terms from other clusters in the plot.

cluster_cols <- setNames(
    c("#0072B2", "#D55E00")[seq_along(unique(as.data.frame(ck)$Cluster))],
    unique(as.data.frame(ck)$Cluster)
)

p1 <- cnetplot(ck, showCategory = 3)
p2 <- cnetplot(ck, showCategory = 3, includeAll = FALSE)
p3 <- cnetplot(ck, showCategory = 3, categorySizeBy = ~ -log10(p.adjust))
p4 <- cnetplot(ck, showCategory = 3, categorySizeBy = 2) +
    scale_fill_manual(values = cluster_cols)

plot_list(p1, p2, p3, p4, ncol = 2, tag_levels = "A")
Figure 12.4: cnetplot() for comparing functional profiles of multiple gene clusters. Default per-cluster selection (A), includeAll = FALSE (B), categorySizeBy = ~ -log10(p.adjust) (C), and custom pie colors (D).

Compared with ordinary enrichment results, there are several useful parameters to keep in mind for compareCluster() output:

  • showCategory = 3 means that the top 3 enriched terms are selected within each cluster by default.
  • showCategory can also be a character vector if you want to display a manually selected set of terms.
  • includeAll = TRUE keeps comparable terms that also appear in other clusters; set includeAll = FALSE if you only want the directly selected terms.
  • categorySizeBy controls the size of category pies and supports expressions such as ~itemNum, ~p.adjust, or ~ -log10(p.adjust).
  • Pie colors are mapped to Cluster, so custom cluster colors can be supplied with ggplot2::scale_fill_manual().

As in dotplot(), the split parameter can be used when term selection should be carried out within another grouping variable rather than only by cluster. For example, when GO enrichment is performed with ont = "ALL", users can select the top terms within each ontology:

ck_all <- compareCluster(
    geneCluster = gcSample,
    fun = enrichGO,
    OrgDb = org.Hs.eg.db,
    ont = "ALL"
)
ck_all <- setReadable(ck_all, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")

cnetplot(ck_all, showCategory = 3, split = "ONTOLOGY")

12.5 Summary

The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only groupGO, enrichGO, enrichKEGG, enrichMKEGG, enrichWP and enricher that were provided in this package, but also other biological and biomedical ontologies, including but not limited to enrichPathway, enrichDO, enrichNCG, enrichDGN and enrichMeSH.

In (Yu et al. 2012), we analyzed the publicly available expression dataset of breast tumor tissues from 200 patients (GSE11121, Gene Expression Omnibus) (Schmidt et al. 2008). We identified 8 gene clusters from differentially expressed genes, and used the compareCluster() function to compare these gene clusters by their enriched biological process. In (Wu et al. 2021), we analyzed the GSE8057 dataset which contains expression data from ovarian cancer cells at multiple time points and under two treatment conditions. Eight groups of DEG lists were analyzed simultaneously using compareCluster() with WikiPathways. The result indicates that the two drugs have distinct effects at the beginning but consistent effects in the later stages (Fig. 4 of (Wu et al. 2021)).

References

Schmidt, Marcus, Daniel B?hm, Christian von T?rne, et al. 2008. “The Humoral Immune System Has a Key Prognostic Impact in Node-Negative Breast Cancer.” Cancer Research 68 (13): 5405–13. https://doi.org/10.1158/0008-5472.CAN-07-5206.
Wu, Tianzhi, Erqiang Hu, Shuangbin Xu, et al. 2021. clusterProfiler 4.0: A Universal Enrichment Tool for Interpreting Omics Data.” The Innovation 2 (3): 100141. https://doi.org/10.1016/j.xinn.2021.100141.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.