The KEGG FTP service has not been freely available for academic use since 2012, and there are many software packages using outdated KEGG annotation data. The clusterProfiler package supports downloading the latest online version of KEGG data using the KEGG website, which is freely available for academic users. Both KEGG pathways and modules are supported in clusterProfiler.
5.1 Supported organisms
The clusterProfiler package supports all organisms that have KEGG annotation data available in the KEGG database. Users should pass an abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html. KEGG Orthology (KO) Database is also supported by specifying organism = "ko".
The clusterProfiler package provides the search_kegg_organism() function to help search for supported organisms.
5.2 Finding correct parameters: A case study with Rice
The enrichKEGG() function requires three key parameters: gene (gene IDs), organism (species code), and keyType (type of gene ID). Many users struggle with determining the correct values for organism and keyType, especially for non-model organisms. Here, we use rice (Oryza sativa) as an example to demonstrate how to find these parameters.
5.2.1 Determine the organism code
The organism parameter accepts a KEGG organism code (e.g., ‘hsa’ for human, ‘mmu’ for mouse). To find the code for your species, you can search the KEGG Organism Catalog or use the search_kegg_organism() function as mentioned above.
For rice, searching for “rice” or “Oryza sativa” in the catalog reveals two main subspecies: * Oryza sativa japonica (Japanese rice) -> code: osa * Oryza sativa indica (Indian rice) -> code: dosa
Choose the one that matches your data. For example, if we choose dosa.
5.2.2 Determine the keyType
Once the organism is determined, we need to know what gene ID types (keyType) are supported for that organism. A practical trick is to use enrichKEGG() as a query tool. By inputting a dummy gene ID and a likely keyType, the function will return an error message listing the expected ID format if the input is invalid.
# Try 'kegg' as keyType with a dummy geneenrichKEGG(gene ="abc", organism ="dosa", keyType ="kegg")# Output:# --> No gene can be mapped....# --> Expected input gene ID: Os06t0664200-01,Os02t0534400-01,Os06t0185100-00...# --> return NULL...# Try 'ncbi-geneid'enrichKEGG(gene ="abc", organism ="dosa", keyType ="ncbi-geneid")# Output:# Error in KEGG_convert("kegg", keyType, species) :# ncbi-geneid is not supported for dosa ...# Try 'uniprot'enrichKEGG(gene ="abc", organism ="dosa", keyType ="uniprot")# Output:# Error in KEGG_convert("kegg", keyType, species) :# uniprot is not supported for dosa ...
From the output, we learn that for dosa, the supported keyType is kegg, and the expected gene ID format looks like Os06t0664200-01 (RAP-ID). For osa, the gene IDs are typically NCBI Gene IDs (e.g., 3131385).
5.2.3 ID Conversion
If your gene list uses a different ID type (e.g., LOC4334374 or LOC_Os01g01010.1), you need to convert them to the supported KEGG ID.
For LOC4334374, you can use AnnotationHub to retrieve the rice OrgDb and convert symbols to Entrez IDs (which correspond to osa KEGG IDs).
library(AnnotationHub)ah <-AnnotationHub()# Query for Oryza sativaquery(ah, 'oryza')oryza <- ah[['AH55775']] # Example AH ID, check for latest# Check keyshead(keys(oryza, keytype ="SYMBOL"))# Convert SYMBOL to ENTREZID# ...
For RAP-ID conversion (e.g., LOC_Os01g01010.1 to Os01t0100100-01), you might need to download mapping files from the Rice Annotation Project Database (RAP-DB) or Oryzabase and perform the conversion manually or using scripts.
5.2.4 Perform Enrichment Analysis
With the correct organism, keyType, and converted gene IDs, you can now run the analysis:
# For osa (using Entrez IDs)kk <-enrichKEGG(gene = gene_list_entrez,organism ="osa",keyType ="kegg", # or 'ncbi-geneid' if supportedpvalueCutoff =0.05)# For dosa (using RAP-IDs)kk <-enrichKEGG(gene = gene_list_rap,organism ="dosa",keyType ="kegg",pvalueCutoff =0.05)
This approach of identifying organism code and verifying keyType via error messages applies to any species available in the KEGG database.
5.3 KEGG Data Localization
The clusterProfiler package retrieves KEGG data online via the KEGG API, which ensures that users analyze their data with the latest knowledge. However, this dependency on internet access can be a limitation if the network is unstable or unavailable (e.g., in some cloud environments). Furthermore, for the sake of reproducibility, it is often desirable to freeze the KEGG data version used in an analysis.
To address these issues, we provide two methods to perform KEGG analysis locally.
5.3.1 Using gson
The gson package provides the gson_KEGG() function to download the latest KEGG pathway and module data for a specific organism and store it in a GSON object. This object contains all necessary information for enrichment analysis and can be saved to a file for offline use.
library(gson)# download KEGG data for humankk_gson <-gson_KEGG(species ="hsa")# save to a filewrite.gson(kk_gson, file ="kegg_hsa.gson")# read from filekk_gson <-read.gson("kegg_hsa.gson")
The GSON object or the file can be used in enricher() and GSEA() functions (via gson parameter or by parsing it). For more detailed information on using the GSON format, including how to perform generalized enrichment analysis with multiple knowledge bases using gsonList, please refer to Section 6.3.
5.3.2 Using createKEGGdb
Another approach is to create a local KEGG.db package containing the KEGG data. While the official KEGG.db has not been updated since 2011, we can generate a new one using the createKEGGdb package.
First, install createKEGGdb from GitHub:
remotes::install_github("YuLab-SMU/createKEGGdb")
Then, use create_kegg_db() to download data and package it. You can specify one or more species, or even “all” to download data for all supported species.
library(createKEGGdb)# create KEGG.db for human and Arabidopsiscreate_kegg_db(c("hsa", "ath"))
This command will generate a KEGG.db_1.0.tar.gz file in the current directory. Install it as a standard R package:
This ensures that the analysis runs offline and is fully reproducible using the installed KEGG.db version.
5.4 KEGG ID Conversion
The clusterProfiler package provides the bitr_kegg() function to support ID conversion via the KEGG API. This is particularly useful for species that do not have an OrgDb object but are supported by KEGG.
5.4.1 Basic Usage
Here is an example of converting KEGG IDs to NCBI Protein IDs and then to UniProt IDs for human genes.
The ID type (both fromType and toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’, or ‘uniprot’. The ‘kegg’ ID is the primary ID used in the KEGG database. A rule of thumb is that ‘kegg’ ID corresponds to entrezgene ID for eukaryote species and Locus ID for prokaryotes.
For many prokaryote species, entrezgene IDs are not available. For example, ece:Z5100 (E. coli O157:H7 EDL933) has NCBI-ProteinID and UniProt links, but not NCBI-GeneID. Attempting to convert Z5100 to ncbi-geneid will result in an error stating that ncbi-geneid is not supported. However, conversion to ncbi-proteinid or uniprot is possible.
# Example for prokaryote ID conversionbitr_kegg("Z5100", fromType="kegg", toType='ncbi-proteinid', organism='ece')
A common confusion exists between K numbers (KEGG Orthology) and ko numbers (KEGG Pathway maps). K numbers represent orthologous groups, while ko numbers represent pathway maps. Users often want to map K numbers to pathways.
To retrieve the names of the pathways (e.g., “Glutathione metabolism” instead of “ko00480”), we can define a simple helper function ko2name to query the KEGG API.
x <-bitr_kegg("K00799", "kegg", "Path", "ko")y <-ko2name(x$Path)merge(x, y, by.x='Path', by.y='ko')
Path kegg name
1 ko00480 K00799 Glutathione metabolism
2 ko00980 K00799 Metabolism of xenobiotics by cytochrome P450
3 ko00982 K00799 Drug metabolism - cytochrome P450
4 ko00983 K00799 Drug metabolism - other enzymes
5 ko01100 K00799 Metabolic pathways
6 ko01524 K00799 Platinum drug resistance
7 ko04212 K00799 Longevity regulating pathway - worm
8 ko05200 K00799 Pathways in cancer
9 ko05204 K00799 Chemical carcinogenesis - DNA adducts
10 ko05207 K00799 Chemical carcinogenesis - receptor activation
11 ko05208 K00799 Chemical carcinogenesis - reactive oxygen species
12 ko05225 K00799 Hepatocellular carcinoma
13 ko05418 K00799 Fluid shear stress and atherosclerosis
Input ID type can be kegg, ncbi-geneid, ncbi-proteinid or uniprot (see also ?sec-kegg-id-conversion). Unlike enrichGO(), there is no readable parameter for enrichKEGG(). However, users can use the setReadable() function (see also Section 17.2) if there is an OrgDb available for the species.
5.6 Translating Gene IDs to Names
For GO analysis, the readable parameter controls whether to translate the IDs to human-readable gene names. This parameter is not available for KEGG analysis. However, the setReadable() function can translate input gene IDs to gene names if the corresponding OrgDb object is available.
library(org.Hs.eg.db)# 'kk' is the enrichKEGG result from the previous sectionkk <-setReadable(kk, OrgDb = org.Hs.eg.db, keyType="ENTREZID")head(kk)
KEGG pathways are organized into a hierarchical structure with top-level categories such as “Metabolism”, “Genetic Information Processing”, “Environmental Information Processing”, “Cellular Processes”, “Organismal Systems”, “Human Diseases”, and “Drug Development”. This classification helps users interpret enrichment results by providing higher-level biological context. For instance, users might be interested specifically in signaling pathways (“Environmental Information Processing”) or metabolic pathways (“Metabolism”) while excluding disease-related pathways.
The clusterProfiler package integrates this classification information directly into the enrichment results. The output data.frame contains category and subcategory columns, which can be used to filter or group the results.
5.8.1 Data Update Strategy
While clusterProfiler queries the KEGG API for species-specific gene sets to ensure the latest data is used, the pathway classification structure is relatively stable and universal across species. To optimize performance and avoid unnecessary dependencies (e.g., web scraping packages), clusterProfiler caches this classification data within the package.
To prevent this cached data from becoming outdated, the package utilizes GitHub Actions for automated maintenance. A workflow is triggered weekly to crawl the latest pathway classification from the KEGG website. If any discrepancies are found between the online data and the cached data, the workflow automatically creates a Pull Request to update the package. This automation ensures that the classification data in clusterProfiler remains synchronized with KEGG updates efficiently and reliably.
5.9 KEGG module over-representation analysis
KEGG Module is a collection of manually defined functional units. In some situations, KEGG Modules have a more straightforward interpretation.
The enrichplot package implements several methods to visualize enriched terms. Most of them are general methods that can be used on GO, KEGG, MSigDb, and other gene set annotations. Here, we introduce the clusterProfiler::browseKEGG() and pathview::pathview() functions to help users explore enriched KEGG pathways with genes of interest.
To view the KEGG pathway, users can use the browseKEGG() function, which will open a web browser and highlight enriched genes. See Figure 5.1 for a screenshot example.
browseKEGG(kk, 'hsa04110')
Figure 5.1: Explore selected KEGG pathway. Differentially expressed genes that are enriched in the selected pathway will be highlighted.
Figure 5.2: Visualize selected KEGG pathway by pathview(). Gene expression values can be mapped to gradient color scale.
References
Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.”Bioinformatics 29 (July): 1830–31. https://doi.org/10.1093/bioinformatics/btt285.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.”OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.