17  Useful utilities

17.1 Translating Biological ID

17.1.1 bitr: Biological Id TranslatoR

The clusterProfiler package provides the bitr() and bitr_kegg() functions for converting ID types. Both bitr() and bitr_kegg() support many species including model and many non-model organisms.

x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2",
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1",
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1",
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)
  SYMBOL ENTREZID
1   GPX3     2878
2   GLRX     2745
3    LBP     3929
4  CRYAB     1410
5  DEFB1     1672
6  HCLS1     3059

Users should provide an annotation package, and both fromType and toType can accept any supported types.

Users can use keytypes to list all supported types.

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     

We can translate from one type to other types.

library(clusterProfiler)
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")
Warning in bitr(x, fromType = "SYMBOL", toType = c("UNIPROT", "ENSEMBL"), :
1.79% of input gene IDs are fail to map...
head(ids)
  SYMBOL UNIPROT         ENSEMBL
1   GPX3  P22352 ENSG00000211445
2   GPX3  O43787 ENSG00000211445
3   GPX3  Q86W78 ENSG00000211445
4   GPX3  Q9NZ74 ENSG00000211445
5   GPX3  Q9UEL1 ENSG00000211445
6   GLRX  B2R4L2 ENSG00000173221

For GO analysis, users don’t need to convert IDs, as all ID types provided by OrgDb can be used in groupGO, enrichGO, and gseGO by specifying the keyType parameter.

Users can use the bitr function to convert IDs using any ID types available in the OrgDb object. For example, users may want to know the genes in the background that belong to a specific GO term. Such information can be easily accessed using bitr.

m <- bitr("GO:0006805", fromType="GO", toType = "SYMBOL", OrgDb=org.Hs.eg.db)
dim(m)
[1] 170   4
head(m)
          GO EVIDENCE ONTOLOGY SYMBOL
1 GO:0006805      TAS       BP   NAT1
2 GO:0006805      TAS       BP   NAT2
3 GO:0006805      TAS       BP  AADAC
4 GO:0006805      TAS       BP    ADA
5 GO:0006805      IEA       BP    AHR
6 GO:0006805      TAS       BP    AHR

Note: If you want to extract genes in your input gene list that are belong to a specific term/pathway, you can use the geneInCategory function.

17.1.2 bitr_kegg: converting biological IDs using KEGG API

The bitr_kegg() function supports ID conversion via the KEGG API, which is useful for species supported by KEGG but lacking an OrgDb object. For detailed usage and examples, please refer to the KEGG ID Conversion section in the KEGG analysis chapter.

17.2 setReadable: translating gene IDs to human readable symbols

Some of the functions, especially those internally supported for DO, GO, and Reactome Pathway, support a parameter, readable. If readable = TRUE, all the gene IDs will be translated to gene symbols. The readable parameter is not available for enrichment analysis of KEGG or using user’s own annotation. KEGG analysis using enrichKEGG and gseKEGG, which internally query annotation information from KEEGG database and thus support all species if it is available in the KEGG database. However, KEGG database doesn’t provide gene ID to symbol mapping information. For analysis using user’s own annotation data, we even don’t know what species is in analyzed. Translating gene IDs to gene symbols is partly supported using the setReadable function if and only if there is an OrgDb available. The setReadable() function also works with compareCluster() output.

data(geneList, package="DOSE")
de <- names(geneList)[1:100]
x <- enrichKEGG(de)
## The geneID column is ENTREZID
head(x, 3)
                   category           subcategory       ID         Description
hsa04110 Cellular Processes Cell growth and death hsa04110          Cell cycle
hsa04218 Cellular Processes Cell growth and death hsa04218 Cellular senescence
hsa04814 Cellular Processes         Cell motility hsa04814      Motor proteins
         GeneRatio  BgRatio RichFactor FoldEnrichment    zScore       pvalue
hsa04110     12/60 158/9370 0.07594937      11.860759 11.052466 2.418534e-10
hsa04218      7/60 157/9370 0.04458599       6.962845  6.048535 5.946670e-05
hsa04814      7/60 194/9370 0.03608247       5.634880  5.236728 2.233892e-04
             p.adjust       qvalue
hsa04110 8.513240e-08 8.513240e-08
hsa04218 1.046614e-02 1.046614e-02
hsa04814 2.098581e-02 2.098581e-02
                                                             geneID Count
hsa04110 9212/81620/983/890/9319/10403/7272/991/9133/8318/1111/4085    12
hsa04218                          2305/983/890/4605/9133/1111/51806     7
hsa04814                     3833/10112/3832/9493/146909/81930/1062     7
y <- setReadable(x, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
## The geneID column is translated to symbol
head(y, 3)
                   category           subcategory       ID         Description
hsa04110 Cellular Processes Cell growth and death hsa04110          Cell cycle
hsa04218 Cellular Processes Cell growth and death hsa04218 Cellular senescence
hsa04814 Cellular Processes         Cell motility hsa04814      Motor proteins
         GeneRatio  BgRatio RichFactor FoldEnrichment    zScore       pvalue
hsa04110     12/60 158/9370 0.07594937      11.860759 11.052466 2.418534e-10
hsa04218      7/60 157/9370 0.04458599       6.962845  6.048535 5.946670e-05
hsa04814      7/60 194/9370 0.03608247       5.634880  5.236728 2.233892e-04
             p.adjust       qvalue
hsa04110 8.513240e-08 8.513240e-08
hsa04218 1.046614e-02 1.046614e-02
hsa04814 2.098581e-02 2.098581e-02
                                                                        geneID
hsa04110 AURKB/CDT1/CDK1/CCNA2/TRIP13/NDC80/TTK/CDC20/CCNB2/CDC45/CHEK1/MAD2L1
hsa04218                             FOXM1/CDK1/CCNA2/MYBL2/CCNB2/CHEK1/CALML5
hsa04814                          KIFC1/KIF20A/KIF11/KIF23/KIF18B/KIF18A/CENPE
         Count
hsa04110    12
hsa04218     7
hsa04814     7

For those functions that internally support readable parameter, user can also use setReadable for translating gene IDs.

17.3 Parsing GMT files

The GMT (Gene Matrix Transposed) file format is a tab delimited file format that is widely used to describe gene sets. Each row in the GMT format represents one gene set and each gene set is described by a name (or ID), a description and the genes in the gene set as illustrated in Figure 17.1.

The clusterProfiler package implemented a function, read.gmt(), to parse GMT file into a data.frame. The WikiPathway GMT file encodes information of version, wpid and species into the Description column. The clusterProfiler provides the read.gmt.wp() function to parse WikiPathway GMT file and supports parsing information that encoded in the Description column.

# use `wget -c` in `download.file`
wget::wget_set()
url <- "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=COVID-19_Related_Gene_Sets"
download.file(url, destfile = "COVID19_GeneSets.gmt")

covid19_gs <- read.gmt("COVID19_GeneSets.gmt")
head(covid19_gs)
                                    term    gene
1 COVID19-E protein host PPI from Krogan    BRD4
2 COVID19-E protein host PPI from Krogan    BRD2
3 COVID19-E protein host PPI from Krogan SLC44A2
4 COVID19-E protein host PPI from Krogan  ZC3H18
5 COVID19-E protein host PPI from Krogan   AP3B1
6 COVID19-E protein host PPI from Krogan   CWC27

Gene set resources:

There are many gene sets available online. After parsing by the read.gmt() function, the data can be directly used to perform enrichment analysis using enricher() and GSEA() functions (see also Chapter 12).

17.4 Data frame interface for accessing enriched results

Enrichment result objects in clusterProfiler support standard data frame operations including [, [[, and $ operators, allowing users to subset and access results using familiar data frame syntax.

17.4.1 Subsetting with [ operator

The [ operator can be used to subset enrichment results based on row indices or logical conditions:

# Subset first 5 rows
x_first5 <- x[1:5, ]

# Subset based on p-value threshold
x_sig <- x[x$pvalue < 0.05, ]

# Subset specific terms by ID
x_specific <- x[x$ID %in% c("hsa04110", "hsa04218"), ]

By default, the [ operator returns a data.frame. To preserve the enrichment result object structure for further analysis and visualization, use the asis = TRUE parameter:

# Preserve as enrichResult object
x_filtered <- x[x$pvalue < 0.05, asis = TRUE]
class(x_filtered)
[1] "enrichResult"
attr(,"package")
[1] "enrichit"

17.4.2 Accessing columns with [[ and $ operators

The $ operator provides convenient access to specific columns:

# Access Description column using $
descriptions <- x$Description
head(descriptions)
[1] "Cell cycle"                                                   
[2] "Cellular senescence"                                          
[3] "Motor proteins"                                               
[4] "Oocyte meiosis"                                               
[5] "IL-17 signaling pathway"                                      
[6] "Viral protein interaction with cytokine and cytokine receptor"
# Access p-value column using
pvalues <- x$pvalue
head(pvalues)
[1] 2.418534e-10 5.946670e-05 2.233892e-04 2.384751e-04 3.225419e-04
[6] 4.296270e-04
# Access geneID column
gene_ids <- x$geneID
head(gene_ids)
[1] "9212/81620/983/890/9319/10403/7272/991/9133/8318/1111/4085"
[2] "2305/983/890/4605/9133/1111/51806"                         
[3] "3833/10112/3832/9493/146909/81930/1062"                    
[4] "983/991/6790/9133/51806/4085"                              
[5] "6279/6280/4312/3627/6278"                                  
[6] "6362/6373/10563/4283/3627"                                 

17.4.3 Accessing genes of a specific term using [[ operator

When applied with a term/pathway ID, the [[ operator returns the vector of genes that belong to that specific category in the enrichment result.

# Genes annotated in a single term (e.g., hsa04218)
genes_in_term <- x[["hsa04218"]]
head(genes_in_term)
[1] "2305" "983"  "890"  "4605" "9133" "1111"
# Retrieve genes for multiple terms
ids <- c("hsa04218", "hsa04814")
genes_by_term <- geneInCategory(x)[ids]
str(genes_by_term)
List of 2
 $ hsa04218: chr [1:7] "2305" "983" "890" "4605" ...
 $ hsa04814: chr [1:7] "3833" "10112" "3832" "9493" ...

17.4.4 Practical filtering examples

Users can filter enrichment results based on various criteria:

# Filter by adjusted p-value
x_padj <- x[x$p.adjust < 0.05, asis = TRUE]

# Filter by gene count (minimum 5 genes per term)
x_min5 <- x[x$Count >= 5, asis = TRUE]

# Filter specific biological processes
x_bp <- x[grepl("process", x$Description, ignore.case = TRUE), asis = TRUE]

# Filter terms containing specific genes
x_containing_gene <- x[grepl("6280", x$geneID), asis = TRUE]

This data frame interface provides flexible filtering capabilities while maintaining compatibility with enrichplot visualization functions when using asis = TRUE.

17.5 Alternative filtering approaches with dplyr

While base R operators provide familiar syntax for filtering enrichment results, users may prefer the more readable and chainable approach offered by dplyr verbs. For comprehensive examples of using filter, arrange, select, mutate, and other dplyr functions with enrichment result objects, please refer to the dplyr verbs for manipulating enrichment result chapter.

The dplyr approach is particularly useful for complex filtering operations and pipeline-style data manipulation.

References