WikiPathways is a continuously updated pathway database curated by a community of researchers and pathway enthusiasts. WikiPathways produces monthly releases of GMT files for supported organisms at data.wikipathways.org. The clusterProfiler package (Yu et al. 2012) supports enrichment analysis (either ORA or GSEA) for WikiPathways using the enrichWP() and gseWP() functions. These functions will automatically download and parse the latest WikiPathways GMT file for the selected organism.
#
# over-representation test
#
#...@organism Homo sapiens
#...@ontology WikiPathways
#...@keytype ENTREZID
#...@gene chr [1:207] "4312" "8318" "10874" "55143" "55388" "991" "6280" "2305" ...
#...pvalues adjusted by 'BH' with cutoff < 0.05
#...7 enriched terms found
'data.frame': 7 obs. of 12 variables:
$ ID : chr "WP2446" "WP2361" "WP3942" "WP179" ...
$ Description : chr "Retinoblastoma gene in cancer" "Gastric cancer network 1" "PPAR signaling" "Cell cycle" ...
$ GeneRatio : chr "11/122" "6/122" "7/122" "10/122" ...
$ BgRatio : chr "89/9008" "28/9008" "50/9008" "120/9008" ...
$ RichFactor : num 0.1236 0.2143 0.14 0.0833 0.2857 ...
$ FoldEnrichment: num 9.13 15.82 10.34 6.15 21.1 ...
$ zScore : num 9.03 9.2 7.76 6.66 8.82 ...
$ pvalue : num 2.67e-08 1.61e-06 4.33e-06 4.74e-06 2.89e-05 ...
$ p.adjust : num 2.13e-05 6.42e-04 9.45e-04 9.45e-04 4.60e-03 ...
$ qvalue : num 2.13e-05 6.42e-04 9.45e-04 9.45e-04 4.60e-03 ...
$ geneID : chr "6241/891/8318/7272/983/9133/1111/24137/890/81620/7153" "6286/4605/6790/11065/22974/7153" "5346/9415/5105/4312/2167/3158/9370" "891/8318/7272/9232/983/9133/4174/1111/991/890" ...
$ Count : int 11 6 7 10 4 7 8
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
6.1.2 WikiPathways gene set enrichment analysis
gseWP(geneList, organism ="Homo sapiens")
#
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType WikiPathways
#...@keytype ENTREZID
#...@geneList Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm 1000
#...pvalues adjusted by 'BH' with cutoff < 0.05
#...158 enriched terms found
'data.frame': 158 obs. of 11 variables:
$ ID : chr "WP2446" "WP179" "WP466" "WP2361" ...
$ Description : chr "Retinoblastoma gene in cancer" "Cell cycle" "DNA replication" "Gastric cancer network 1" ...
$ setSize : int 84 111 42 23 62 41 78 36 14 63 ...
$ enrichmentScore: num 0.731 0.663 0.792 0.837 0.665 ...
$ NES : num 2.88 2.75 2.74 2.52 2.52 ...
$ pvalue : num 3.09e-12 2.99e-12 2.81e-12 4.97e-09 3.01e-09 ...
$ p.adjust : num 8.20e-10 8.20e-10 8.20e-10 6.60e-07 4.79e-07 ...
$ qvalue : num 6.64e-10 6.64e-10 6.64e-10 5.35e-07 3.88e-07 ...
$ rank : int 1333 1234 1002 854 1111 1750 1627 2454 360 1961 ...
$ leading_edge : chr "tags=51%, list=11%, signal=46%" "tags=40%, list=10%, signal=36%" "tags=55%, list=8%, signal=51%" "tags=52%, list=7%, signal=49%" ...
$ core_enrichment: chr "8318/9133/7153/6241/890/983/81620/7272/1111/891/24137/993/898/4998/10733/9134/4175/4173/6502/5984/994/7298/3015"| __truncated__ "8318/991/9133/890/983/7272/1111/891/4174/9232/4171/993/990/5347/9700/898/23594/4998/9134/4175/4173/10926/6502/9"| __truncated__ "8318/55388/81620/4174/4171/990/23594/4998/4175/4173/10926/5984/5111/51053/8317/5427/23649/4176/5982/5557/5558/4172/5424" "4605/7153/11065/22974/6286/6790/1894/56992/4173/1063/9585/8607" ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
If your input gene ID type is not Entrez gene ID, you can use the bitr() function to convert gene ID. If you want to convert the gene IDs in output result to gene symbols, you can use the setReadable() function, see also Section 17.2.
6.2 DAVID functional analysis
DAVID is a popular bioinformatics resource for functional annotation and enrichment analysis. Although we recommend using enrichGO and enrichKEGG which use up-to-date data maintained by Bioconductor, clusterProfiler provides enrichDAVID to support DAVID analysis for users who prefer it.
Users need to register an account on DAVID website and use the email address to access the web service.
As highlighted in a Nature Methods study (Wadi et al. (2016)), DAVID has a very low update frequency, often going more than 5 years between updates, leading to a continuous decline in knowledge completeness. This means analysis results may only reflect outdated biological knowledge, which could introduce biases in current research. As shown in previous comparisons, using clusterProfiler with Bioconductor annotation data often yields more comprehensive results (more annotated genes and enriched pathways) compared to DAVID.
6.3 Generalized Enrichment Analysis with GSON format
The gson (Gene Set Object Notation) format provides a standardized way to store gene set information along with metadata, making it easier to share and use custom gene sets for enrichment analysis. The gson package defines the GSON class to store this information.
6.3.1 Helper functions for GSON
The clusterProfiler and related packages provide a series of helper functions to download and convert data from various sources into GSON objects. These functions typically start with the gson_ prefix. Users can use tab completion in R (e.g., typing gson_ and pressing Tab) to explore available functions. Some common examples include:
gson_KEGG(): Download KEGG pathway/module data.
gson_WP(): Download WikiPathways data.
gson_GO(): Download Gene Ontology data.
Additionally, users can use the gson::gson() function to manually create GSON objects from their own data tables.
6.3.2 Analyzing multiple knowledge bases with gsonList
One of the powerful features of the GSON format is the ability to perform enrichment analysis on multiple knowledge bases simultaneously. The gsonList() function allows users to combine multiple GSON objects into a list, which can then be passed to the enricher() function.
library(gson)library(clusterProfiler)# Example: Prepare GSON objects (assuming you have downloaded or created them)# kegg_gson <- gson_KEGG(species = "hsa")# wp_gson <- gson_WP(species = "Homo sapiens")# go_gson <- gson_GO(OrgDb = org.Hs.eg.db, keyType = "ENTREZID", ont = "BP")# Combine them into a list# gsons <- gsonList(kegg = kegg_gson, wp = wp_gson, go = go_gson)# Perform enrichment analysis# res <- enricher(gene = geneList, gson = gsons)
The result of enricher() with a gsonList input is effectively a list of enrichment results, where each component corresponds to one of the input knowledge bases. You can access individual results by name or index.
6.3.3 Visualization with autofacet
To visualize the results from multiple knowledge bases together, enrichplot provides the autofacet() function. This function can be added to a dotplot to automatically facet the results by knowledge base.
# dotplot(res) + autofacet()
This approach allows for a comprehensive view of enrichment results across different databases, facilitating comparison and interpretation. For example, you might observe that GO terms cover a broader range of biological processes due to higher annotation coverage, while KEGG pathways might provide more specific mechanistic insights.
References
Wadi, Lina, Mona Meyer, Joel Weiser, Lincoln D. Stein, and Jüri Reimand. 2016. “Impact of Outdated Gene Annotations on Pathway Enrichment Analysis.”Nature Methods 13 (9): 705–6. https://doi.org/10.1038/nmeth.3963.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.”OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.