KEGG的分类数据更新

Dear clusterProfiler maintainers,

First, thank you for developing and maintaining this incredibly useful package! I’ve been using clusterProfiler extensively for enrichment analysis and it’s been invaluable for my research.

I’m encountering an issue with KEGG enrichment analysis where some pathways show NA values for both category and subcategory fields, even though the pathways themselves are valid and statistically significant.

Description

When performing KEGG enrichment analysis using enrichKEGG followed by setReadable, some pathways show NA values for both category and subcategory fields, even though the pathways themselves are valid and significant.

Problem

The pathways mmu04082 (Neuroactive ligand-receptor interaction) and mmu04517 (IgSF CAM signaling) return NA for category and subcategory columns in the results, while other pathways have proper classification information.

Questions

Is this a known issue with specific KEGG pathways?

Is there a way to automatically retrieve the missing category information?

Should the category information be populated from a different source or database?

I’ve verified that these pathways exist in the KEGG database and are valid, but the classification metadata seems to be missing in the enrichment results.

Any guidance on how to resolve this would be greatly appreciated. Thank you for your time and continued support of this excellent package!

Best regards,

Sophia

formula <- compareCluster(ENTREZID~cluster, data=GeneClusterDF,
                          fun='enrichKEGG', 
                          organism = "mmu",  # 小鼠的KEGG organism code
                          pvalueCutoff=0.05,
                          pAdjustMethod = "BH")
formula <- setReadable(formula, 
                       OrgDb = "org.Mm.eg.db", 
                       keyType = "ENTREZID")
head(as.data.frame(formula))
> head(as.data.frame(formula))
  Cluster cluster                             category         subcategory       ID                  Description GeneRatio   BgRatio RichFactor FoldEnrichment
1       0       0                                 <NA>                <NA> mmu04082 Neuroactive ligand signaling    14/113 196/10632 0.07142857       6.720607
2       0       0                                 <NA>                <NA> mmu04517           IgSF CAM signaling    16/113 300/10632 0.05333333       5.018053
3       0       0                   Organismal Systems      Nervous system mmu04724        Glutamatergic synapse    10/113 117/10632 0.08547009       8.041752
4       0       0 Environmental Information Processing Signal transduction mmu04020    Calcium signaling pathway    13/113 255/10632 0.05098039       4.796668
5       0       0                   Organismal Systems      Nervous system mmu04727            GABAergic synapse     8/113  91/10632 0.08791209       8.271516
6       0       0 Environmental Information Processing Signal transduction mmu04024       cAMP signaling pathway    11/113 224/10632 0.04910714       4.620417
    zScore       pvalue     p.adjust       qvalue                                                                                              geneID Count
1 8.378035 1.822353e-08 4.355424e-06 3.356966e-06          Adcy2/Gabbr2/Gabra2/Gabrb1/Gnao1/Gria2/Grin2c/Grm3/Hrh1/Plcb1/Slc1a2/Slc1a3/Slc6a1/Slc6a11    14
2 7.316833 1.016818e-07 1.215097e-05 9.365424e-06 Actr3b/Ank2/Cables1/Cadm1/Cadm2/Cntn1/Kcnq3/Kirrel3/Lrrc4c/Mapk10/Ncam1/Nrcam/Nrp1/Srgap1/Tjp1/Vav3    16
3 7.937944 4.238134e-07 3.376380e-05 2.602363e-05                                        Adcy2/Glul/Gnao1/Gria2/Grin2c/Grm3/Itpr2/Plcb1/Slc1a2/Slc1a3    10
4 6.360275 2.911406e-06 1.739565e-04 1.340779e-04                     Adcy2/Atp2b2/Camk1d/Camk2g/Fgf1/Fgfr3/Grin2c/Hrh1/Itpr2/Phka1/Phkg1/Plcb1/Vegfa    13
5 7.220110 5.181671e-06 2.476839e-04 1.909037e-04                                                Adcy2/Gabbr2/Gabra2/Gabrb1/Glul/Gnao1/Slc6a1/Slc6a11     8
6 5.675943 2.505014e-05 9.978306e-04 7.690833e-04                               Adcy2/Atp1a2/Atp2b2/Camk2g/Gabbr2/Gli3/Gria2/Grin2c/Mapk10/Ptch1/Vav3    11
> sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22000)
 
Matrix products: default
  LAPACK version 3.12.1
 
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8  LC_CTYPE=Chinese (Simplified)_China.utf8    LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                                LC_TIME=Chinese (Simplified)_China.utf8    
 
time zone: Asia/Shanghai
tzcode source: internal
 
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
 
other attached packages:
 [1] dplyr_1.1.4            ReactomePA_1.54.0      GseaVis_0.0.5          pathview_1.50.0        DOSE_4.4.0             enrichplot_1.30.0     
 [7] ggplot2_4.0.1          msigdbr_25.1.1         org.Mm.eg.db_3.22.0    AnnotationDbi_1.72.0   IRanges_2.44.0         S4Vectors_0.48.0      
[13] Biobase_2.70.0         BiocGenerics_0.56.0    generics_0.1.4         clusterProfiler_4.18.0
 
loaded via a namespace (and not attached):
  [1] pheatmap_1.0.13         pak_0.9.0               DBI_1.2.3               httr_1.4.7              BiocParallel_1.44.0     yulab.utils_0.2.1      
  [7] ggplotify_0.1.3         babelgene_22.9          pillar_1.11.1           Rgraphviz_2.54.0        R6_2.6.1                mime_0.13              
 [13] reticulate_1.44.1       viridis_0.6.5           ROCR_1.0-11             graphite_1.56.0         S7_0.2.1                parallelly_1.45.1      
 [19] GlobalOptions_0.1.2     polyclip_1.10-7         htmltools_0.5.8.1       remotes_2.5.0           ggrepel_0.9.6           fgsea_1.36.0           
 [25] forcats_1.0.1           spatstat.utils_3.2-0    fitdistrplus_1.2-4      tidyselect_1.2.1        utf8_1.2.6              RSQLite_2.4.4          
 [31] cowplot_1.2.0           scattermore_1.2         sessioninfo_1.2.3       spatstat.data_3.1-9     gridExtra_2.3           fs_1.6.6               
 [37] sctransform_0.4.2       RColorBrewer_1.1-3      future.apply_1.20.0     graph_1.88.0            R.oo_1.27.1             RcppHNSW_0.6.0         
 [43] reactome.db_1.94.0      Rtsne_0.17              lazyeval_0.2.2          scales_1.4.0            treeio_1.34.0           R.utils_2.13.0         
 [49] KEGGgraph_1.70.0        bitops_1.0-9            R.methodsS3_1.8.2       KEGGREST_1.50.0         promises_1.5.0          shape_1.4.6.1          
 [55] zoo_1.8-14              RSpectra_0.16-2         assertthat_0.2.1        tools_4.5.2             ape_5.8-1               shiny_1.11.1           
 [61] rlang_1.1.6             ggridges_0.5.7          evaluate_1.0.5          otel_0.2.0              reshape2_1.4.5          devtools_2.4.6         
 [67] colorspace_2.1-2        ellipsis_0.3.2          data.table_1.17.8       withr_3.0.2             tibble_3.3.0            RCurl_1.98-1.17        
 [73] xtable_1.8-4            plyr_1.8.9              aplot_0.2.9             systemfonts_1.3.1       httpuv_1.6.16           MASS_7.3-65            
 [79] stringr_1.6.0           openxlsx_4.2.8.1        GO.db_3.22.0            vctrs_0.6.5             lifecycle_1.0.4         codetools_0.2-20       
 [85] fastDummies_1.7.5       nlme_3.1-168            Seqinfo_1.0.0           future_1.68.0           pkgload_1.4.1           Rcpp_1.1.0             
 [91] rstudioapi_0.17.1       patchwork_1.3.2         stringi_1.8.7           pbapply_1.7-4           cachem_1.1.0            BiocManager_1.30.27    
 [97] tidytree_0.4.6          listenv_0.10.0          XVector_0.50.0          plotly_4.11.0           ggtree_4.0.1            pkgbuild_1.4.8         
[103] ggfun_0.2.0             ggtangle_0.0.8          htmlwidgets_1.6.4       memoise_2.0.1           crayon_1.5.3            gridGraphics_0.5-1     
[109] rappdirs_0.3.3          GOSemSim_2.36.0         png_0.1-8               progressr_0.18.0        fastmap_1.2.0           tidygraph_1.3.1        
[115] tidyr_1.3.1             pkgconfig_2.0.3         cli_3.6.5               ggforce_0.5.0           ggiraph_0.9.2           lmtest_0.9-40          
[121] usethis_3.2.1           RcppAnnoy_0.0.22        gdtools_0.4.4           viridisLite_0.4.2       splines_4.5.2           blob_1.2.4             
[127] XML_3.99-0.20           globals_0.18.0          knitr_1.50              ica_1.0-3               spam_2.11-1             dichromat_2.0-0.1      
[133] compiler_4.5.2          grid_4.5.2              bit_4.6.0               ggpp_0.5.9              glue_1.8.0              sp_2.2-0               
[139] digest_0.6.38           irlba_2.3.5.1           graphlayouts_1.2.2      fontLiberation_0.1.0    fontBitstreamVera_0.1.1 dotCall64_1.2          
[145] tweenr_2.0.3            lattice_0.22-7          ggraph_2.2.2            gson_0.1.0              igraph_2.2.1            ggnewscale_0.5.2       
[151] qvalue_2.42.0           later_1.4.4             parallel_4.5.2          fontquiver_0.2.1        miniUI_0.1.2            gtable_0.3.6           
[157] xfun_0.54               Biostrings_2.78.0       curl_7.0.0              org.Hs.eg.db_3.22.0     KernSmooth_2.23-26      survival_3.8-3         
[163] jsonlite_2.0.0          magrittr_2.0.4          purrr_1.2.0             matrixStats_1.5.0       Matrix_1.7-4            SeuratObject_5.2.0     
[169] fastmatch_1.1-6         RANN_2.6.2              circlize_0.4.16         polynom_1.4-1           bit64_4.6.0-1           cluster_2.1.8.1        
[175] farver_2.1.2            zip_2.3.3

来自用户的灵魂拷问，连着三个问句，这必须是有经历过探索的。我看了一下，这个KEGG category的映射关系是预存在包里的，然后在Bioconductor 3.22新发行版的时候，我忘记去更一下数据了。

那么KEGG本来就是在线抓取数据，为什么这个要预存数据呢？也在线爬它不香吗？这个我肯定是有原因的。

因为这个category，不管你是什么物种，都是这样的分类，所以它是通用的。所有人一样用，这就适合存一份。
另一方面，爬这个category的信息，我用的rvest去写的爬虫，我不想放到clusterProfiler的代码里，因为放进去，就会增加依赖包。

我给用户的回复：

The KEGG pathway information is cached within the package. I apologize for not updating it before the latest release. Thank you for bringing this to my attention; the issue has now been resolved.

> head(x, 2)
                                     category                         subcategory       ID                  Description GeneRatio   BgRatio RichFactor FoldEnrichment   zScore       pvalue     p.adjust       qvalue
mmu04082 Environmental Information Processing Signaling molecules and interaction mmu04082 Neuroactive ligand signaling     30/30 196/10632  0.1530612       54.24490 40.02145 9.349530e-54 1.084545e-51 4.527141e-52
mmu05032                       Human Diseases                Substance dependence mmu05032           Morphine addiction     13/30  93/10632  0.1397849       49.53978 25.00883 7.801411e-20 4.524818e-18 1.888763e-18
                                                                                                                                                                                             geneID Count
mmu04082 216227/11423/14678/14701/11549/14654/18750/57385/11539/15559/14683/11515/11541/104111/18442/242425/14806/13491/21334/63993/13488/213788/57014/108015/14396/21337/53623/110886/11513/210044    30
mmu05032

昨晚给我的留言，早上就许愿成功。

这个锅我是认的。

正如前文所的，预存数据是有原因的，但这也带来了要定期更新的问题。因为这才更没多久，又来反映没更了。这样子手工更，就太费叔了，一下子把自己搞手残了。

所谓那里有压迫，那里就有反抗，不能让它把我们给欺负了。

我的解决方案就是用github action。写个workflow来干这个事情。

触发条件：每周日 UTC 时间 00:00 ( cron: ‘0 0 * * 0’ ) 或手动触发 ( workflow_dispatch )。
环境配置：自动安装 R 环境和必要的依赖（包括 rvest , tibble , tidyr 以及 clusterProfiler 自身）。
执行更新：安装当前包后，完全按照我写好的 make updatedata 逻辑运行。
自动 PR ：如果有文件变动（ .rda 文件更新），自动创建一个名为 update-kegg-data 的分支并提交 PR。

我手动触发，试一下它work不work。显然是没有问题的，pull request已经正常出现。那么我只要merge就行了。做到数据每周定期检查，如有更新，就自动更新。

好了，活人终究不能被尿憋死。

后续

终究我还是懒，KEGG更新还挺频繁的，我不想手工merge，再push到Bioconductor。于是数据和包分离，数据是预存没错，但不放在包里，而是放到github上，然后GitHub Actions自动更新。包里不带数据，总是从github上获取，这样最终可以做到不用管。

YGC

Explorer

KEGG的分类数据更新

后续

Graph View