6  Analysis with Other Knowledge Bases

6.1 WikiPathways analysis

WikiPathways is a continuously updated pathway database curated by a community of researchers and pathway enthusiasts. WikiPathways produces monthly releases of GMT files for supported organisms at data.wikipathways.org. The clusterProfiler package (Yu et al. 2012) supports enrichment analysis (either ORA or GSEA) for WikiPathways using the enrichWP() and gseWP() functions. These functions will automatically download and parse the latest WikiPathways GMT file for the selected organism.

Supported organisms can be listed by:

library(clusterProfiler)
get_wp_organisms()
 [1] "Anopheles gambiae"        "Arabidopsis thaliana"    
 [3] "Bos taurus"               "Caenorhabditis elegans"  
 [5] "Canis familiaris"         "Danio rerio"             
 [7] "Drosophila melanogaster"  "Equus caballus"          
 [9] "Gallus gallus"            "Homo sapiens"            
[11] "Mus musculus"             "Pan troglodytes"         
[13] "Populus trichocarpa"      "Rattus norvegicus"       
[15] "Saccharomyces cerevisiae" "Solanum lycopersicum"    
[17] "Sus scrofa"               "Zea mays"                

6.1.1 WikiPathways over-representation analysis

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

enrichWP(gene, organism = "Homo sapiens") 
#
# over-representation test
#
#...@organism    Homo sapiens 
#...@ontology    WikiPathways 
#...@keytype     ENTREZID 
#...@gene    chr [1:207] "4312" "8318" "10874" "55143" "55388" "991" "6280" "2305" ...
#...pvalues adjusted by 'BH' with cutoff < 0.05
#...7 enriched terms found
'data.frame':   7 obs. of  12 variables:
 $ ID            : chr  "WP2446" "WP2361" "WP3942" "WP179" ...
 $ Description   : chr  "Retinoblastoma gene in cancer" "Gastric cancer network 1" "PPAR signaling" "Cell cycle" ...
 $ GeneRatio     : chr  "11/122" "6/122" "7/122" "10/122" ...
 $ BgRatio       : chr  "89/9008" "28/9008" "50/9008" "120/9008" ...
 $ RichFactor    : num  0.1236 0.2143 0.14 0.0833 0.2857 ...
 $ FoldEnrichment: num  9.13 15.82 10.34 6.15 21.1 ...
 $ zScore        : num  9.03 9.2 7.76 6.66 8.82 ...
 $ pvalue        : num  2.67e-08 1.61e-06 4.33e-06 4.74e-06 2.89e-05 ...
 $ p.adjust      : num  2.13e-05 6.42e-04 9.45e-04 9.45e-04 4.60e-03 ...
 $ qvalue        : num  2.13e-05 6.42e-04 9.45e-04 9.45e-04 4.60e-03 ...
 $ geneID        : chr  "6241/891/8318/7272/983/9133/1111/24137/890/81620/7153" "6286/4605/6790/11065/22974/7153" "5346/9415/5105/4312/2167/3158/9370" "891/8318/7272/9232/983/9133/4174/1111/991/890" ...
 $ Count         : int  11 6 7 10 4 7 8
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320 

6.1.2 WikiPathways gene set enrichment analysis

gseWP(geneList, organism = "Homo sapiens")
#
# Gene Set Enrichment Analysis
#
#...@organism    Homo sapiens 
#...@setType     WikiPathways 
#...@keytype     ENTREZID 
#...@geneList    Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
 - attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm    1000 
#...pvalues adjusted by 'BH' with cutoff < 0.05
#...158 enriched terms found
'data.frame':   158 obs. of  11 variables:
 $ ID             : chr  "WP2446" "WP179" "WP466" "WP2361" ...
 $ Description    : chr  "Retinoblastoma gene in cancer" "Cell cycle" "DNA replication" "Gastric cancer network 1" ...
 $ setSize        : int  84 111 42 23 62 41 78 36 14 63 ...
 $ enrichmentScore: num  0.731 0.663 0.792 0.837 0.665 ...
 $ NES            : num  2.88 2.75 2.74 2.52 2.52 ...
 $ pvalue         : num  3.09e-12 2.99e-12 2.81e-12 4.97e-09 3.01e-09 ...
 $ p.adjust       : num  8.20e-10 8.20e-10 8.20e-10 6.60e-07 4.79e-07 ...
 $ qvalue         : num  6.64e-10 6.64e-10 6.64e-10 5.35e-07 3.88e-07 ...
 $ rank           : int  1333 1234 1002 854 1111 1750 1627 2454 360 1961 ...
 $ leading_edge   : chr  "tags=51%, list=11%, signal=46%" "tags=40%, list=10%, signal=36%" "tags=55%, list=8%, signal=51%" "tags=52%, list=7%, signal=49%" ...
 $ core_enrichment: chr  "8318/9133/7153/6241/890/983/81620/7272/1111/891/24137/993/898/4998/10733/9134/4175/4173/6502/5984/994/7298/3015"| __truncated__ "8318/991/9133/890/983/7272/1111/891/4174/9232/4171/993/990/5347/9700/898/23594/4998/9134/4175/4173/10926/6502/9"| __truncated__ "8318/55388/81620/4174/4171/990/23594/4998/4175/4173/10926/5984/5111/51053/8317/5427/23649/4176/5982/5557/5558/4172/5424" "4605/7153/11065/22974/6286/6790/1894/56992/4173/1063/9585/8607" ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320 

If your input gene ID type is not Entrez gene ID, you can use the bitr() function to convert gene ID. If you want to convert the gene IDs in output result to gene symbols, you can use the setReadable() function, see also Section 17.2.

6.2 DAVID functional analysis

DAVID is a popular bioinformatics resource for functional annotation and enrichment analysis. Although we recommend using enrichGO and enrichKEGG which use up-to-date data maintained by Bioconductor, clusterProfiler provides enrichDAVID to support DAVID analysis for users who prefer it.

Users need to register an account on DAVID website and use the email address to access the web service.

require(clusterProfiler)
data(geneList, package="DOSE")
gene = names(geneList)[abs(geneList) > 2]
david = enrichDAVID(gene = gene, 
                    idType="ENTREZ_GENE_ID", 
                    listType="Gene", 
                    annotation="KEGG_PATHWAY",
                    david.user = "clusterProfiler@hku.hk")

The result can be visualized using the same functions as other enrichment results.

barplot(david)
cnetplot(david, foldChange=geneList)

6.2.1 DAVID Web Service Limitations

DAVID Web Service has the following limitations:

  • A job with more than 3000 genes to generate gene or term cluster report will not be handled by DAVID due to resource limit.
  • No more than 200 jobs in a day from one user or computer.
  • DAVID Team reserves right to suspend any improper uses of the web service without notice.

For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.

6.2.2 Custom Background

The enrichDAVID function also supports a custom background (universe) using the universe parameter.

enrichDAVID(gene = gene,
            universe = names(geneList),
            idType = "ENTREZ_GENE_ID",
            listType = "Gene",
            annotation = "KEGG_PATHWAY")

6.2.3 Comparing DAVID Functional Profiles

Comparison of functional profiles among different gene clusters is also supported via compareCluster.

data(gcSample)
x=compareCluster(gcSample, fun="enrichDAVID", annotation="KEGG_PATHWAY")
plot(x)

6.2.4 Note on Data Updates

As highlighted in a Nature Methods study (Wadi et al. (2016)), DAVID has a very low update frequency, often going more than 5 years between updates, leading to a continuous decline in knowledge completeness. This means analysis results may only reflect outdated biological knowledge, which could introduce biases in current research. As shown in previous comparisons, using clusterProfiler with Bioconductor annotation data often yields more comprehensive results (more annotated genes and enriched pathways) compared to DAVID.

6.3 Generalized Enrichment Analysis with GSON format

The gson (Gene Set Object Notation) format provides a standardized way to store gene set information along with metadata, making it easier to share and use custom gene sets for enrichment analysis. The gson package defines the GSON class to store this information.

6.3.1 Helper functions for GSON

The clusterProfiler and related packages provide a series of helper functions to download and convert data from various sources into GSON objects. These functions typically start with the gson_ prefix. Users can use tab completion in R (e.g., typing gson_ and pressing Tab) to explore available functions. Some common examples include:

  • gson_KEGG(): Download KEGG pathway/module data.
  • gson_WP(): Download WikiPathways data.
  • gson_GO(): Download Gene Ontology data.

Additionally, users can use the gson::gson() function to manually create GSON objects from their own data tables.

6.3.2 Analyzing multiple knowledge bases with gsonList

One of the powerful features of the GSON format is the ability to perform enrichment analysis on multiple knowledge bases simultaneously. The gsonList() function allows users to combine multiple GSON objects into a list, which can then be passed to the enricher() function.

library(gson)
library(clusterProfiler)

# Example: Prepare GSON objects (assuming you have downloaded or created them)
# kegg_gson <- gson_KEGG(species = "hsa")
# wp_gson <- gson_WP(species = "Homo sapiens")
# go_gson <- gson_GO(OrgDb = org.Hs.eg.db, keyType = "ENTREZID", ont = "BP")

# Combine them into a list
# gsons <- gsonList(kegg = kegg_gson, wp = wp_gson, go = go_gson)

# Perform enrichment analysis
# res <- enricher(gene = geneList, gson = gsons)

The result of enricher() with a gsonList input is effectively a list of enrichment results, where each component corresponds to one of the input knowledge bases. You can access individual results by name or index.

6.3.3 Visualization with autofacet

To visualize the results from multiple knowledge bases together, enrichplot provides the autofacet() function. This function can be added to a dotplot to automatically facet the results by knowledge base.

# dotplot(res) + autofacet()

This approach allows for a comprehensive view of enrichment results across different databases, facilitating comparison and interpretation. For example, you might observe that GO terms cover a broader range of biological processes due to higher annotation coverage, while KEGG pathways might provide more specific mechanistic insights.

References

Wadi, Lina, Mona Meyer, Joel Weiser, Lincoln D. Stein, and Jüri Reimand. 2016. “Impact of Outdated Gene Annotations on Pathway Enrichment Analysis.” Nature Methods 13 (9): 705–6. https://doi.org/10.1038/nmeth.3963.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.