15  Useful utilities

15.1 Translating Biological ID

15.1.1 bitr: Biological Id TranslatoR

The clusterProfiler package provides the bitr() and bitr_kegg() functions for converting ID types. Both bitr() and bitr_kegg() support many species including model and many non-model organisms.

x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2",
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1",
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1",
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)
  SYMBOL ENTREZID
1   GPX3     2878
2   GLRX     2745
3    LBP     3929
4  CRYAB     1410
5  DEFB1     1672
6  HCLS1     3059

User should provides an annotation package, both fromType and toType can accept any types that supported.

User can use keytypes to list all supporting types.

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     

We can translate from one type to other types.

library(clusterProfiler)
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")
Warning in bitr(x, fromType = "SYMBOL", toType = c("UNIPROT", "ENSEMBL"), :
1.79% of input gene IDs are fail to map...
head(ids)
  SYMBOL UNIPROT         ENSEMBL
1   GPX3  P22352 ENSG00000211445
2   GPX3  O43787 ENSG00000211445
3   GPX3  Q86W78 ENSG00000211445
4   GPX3  Q9NZ74 ENSG00000211445
5   GPX3  Q9UEL1 ENSG00000211445
6   GLRX  B2R4L2 ENSG00000173221

For GO analysis, user don’t need to convert ID, all ID type provided by OrgDb can be used in groupGO, enrichGO and gseGO by specifying keyType parameter.

Users can use the bitr function to convert IDs using any ID types that available in the OrbDb object. For example, users may want to know the genes in background that are belong to a specific GO term. Such information can be easily accessed using bitr.

m <- bitr("GO:0006805", fromType="GO", toType = "SYMBOL", OrgDb=org.Hs.eg.db)
dim(m)
[1] 170   4
head(m)
          GO EVIDENCE ONTOLOGY SYMBOL
1 GO:0006805      TAS       BP   NAT1
2 GO:0006805      TAS       BP   NAT2
3 GO:0006805      TAS       BP  AADAC
4 GO:0006805      TAS       BP    ADA
5 GO:0006805      IEA       BP    AHR
6 GO:0006805      TAS       BP    AHR

Note: If you want to extract genes in your input gene list that are belong to a specific term/pathway, you can use the geneInCategory function.

15.1.2 bitr_kegg: converting biological IDs using KEGG API

data(gcSample)
hg <- gcSample[[1]]
head(hg)
[1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"
eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid", organism
= "hsa"): 4.17% of input gene IDs are fail to map...
head(eg2np)
   kegg ncbi-proteinid
1 10001      NP_005457
2 10209      NP_005792
3 10232      NP_037536
4 10324      NP_006054
5 10411   NP_001092001
6 10614      NP_006451

The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.

Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of ece:Z5100 in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID and UnitProt links in the Other DBs Entry, but not NCBI-GeneID.

If we try to convert Z5100 to ncbi-geneid, bitr_kegg will throw error of ncbi-geneid is not supported.

bitr_kegg("Z5100", fromType="kegg", toType='ncbi-geneid', organism='ece')

## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...

We can of course convert it to ncbi-proteinid and uniprot:

bitr_kegg("Z5100", fromType="kegg", toType='ncbi-proteinid', organism='ece')
   kegg ncbi-proteinid
1 Z5100       AAG58814
bitr_kegg("Z5100", fromType="kegg", toType='uniprot', organism='ece')
   kegg    uniprot
1 Z5100 A0A9Q7EE73

15.2 setReadable: translating gene IDs to human readable symbols

Some of the functions, especially those internally supported for DO, GO, and Reactome Pathway, support a parameter, readable. If readable = TRUE, all the gene IDs will be translated to gene symbols. The readable parameter is not available for enrichment analysis of KEGG or using user’s own annotation. KEGG analysis using enrichKEGG and gseKEGG, which internally query annotation information from KEEGG database and thus support all species if it is available in the KEGG database. However, KEGG database doesn’t provide gene ID to symbol mapping information. For analysis using user’s own annotation data, we even don’t know what species is in analyzed. Translating gene IDs to gene symbols is partly supported using the setReadable function if and only if there is an OrgDb available. The setReadable() function also works with compareCluster() output.

data(geneList, package="DOSE")
de <- names(geneList)[1:100]
x <- enrichKEGG(de)
## The geneID column is ENTREZID
head(x, 3)
                   category           subcategory       ID         Description
hsa04110 Cellular Processes Cell growth and death hsa04110          Cell cycle
hsa04218 Cellular Processes Cell growth and death hsa04218 Cellular senescence
hsa04814 Cellular Processes         Cell motility hsa04814      Motor proteins
         GeneRatio  BgRatio RichFactor FoldEnrichment    zScore       pvalue
hsa04110     12/59 158/9500 0.07594937      12.229135 11.251512 1.679507e-10
hsa04218      7/59 157/9500 0.04458599       7.179100  6.171458 4.885886e-05
hsa04814      7/59 197/9500 0.03553299       5.721414  5.293582 2.029593e-04
             p.adjust       qvalue
hsa04110 2.116179e-08 1.944693e-08
hsa04218 3.078108e-03 2.828671e-03
hsa04814 6.610189e-03 6.074527e-03
                                                             geneID Count
hsa04110 8318/991/9133/10403/890/983/4085/81620/7272/9212/1111/9319    12
hsa04218                          2305/4605/9133/890/983/51806/1111     7
hsa04814                     9493/1062/81930/3832/3833/146909/10112     7
y <- setReadable(x, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
## The geneID column is translated to symbol
head(y, 3)
                   category           subcategory       ID         Description
hsa04110 Cellular Processes Cell growth and death hsa04110          Cell cycle
hsa04218 Cellular Processes Cell growth and death hsa04218 Cellular senescence
hsa04814 Cellular Processes         Cell motility hsa04814      Motor proteins
         GeneRatio  BgRatio RichFactor FoldEnrichment    zScore       pvalue
hsa04110     12/59 158/9500 0.07594937      12.229135 11.251512 1.679507e-10
hsa04218      7/59 157/9500 0.04458599       7.179100  6.171458 4.885886e-05
hsa04814      7/59 197/9500 0.03553299       5.721414  5.293582 2.029593e-04
             p.adjust       qvalue
hsa04110 2.116179e-08 1.944693e-08
hsa04218 3.078108e-03 2.828671e-03
hsa04814 6.610189e-03 6.074527e-03
                                                                        geneID
hsa04110 CDC45/CDC20/CCNB2/NDC80/CCNA2/CDK1/MAD2L1/CDT1/TTK/AURKB/CHEK1/TRIP13
hsa04218                             FOXM1/MYBL2/CCNB2/CCNA2/CDK1/CALML5/CHEK1
hsa04814                          KIF23/CENPE/KIF18A/KIF11/KIFC1/KIF18B/KIF20A
         Count
hsa04110    12
hsa04218     7
hsa04814     7

For those functions that internally support readable parameter, user can also use setReadable for translating gene IDs.

15.3 Parsing GMT files

The GMT (Gene Matrix Transposed) file format is a tab delimited file format that is widely used to describe gene sets. Each row in the GMT format represents one gene set and each gene set is described by a name (or ID), a description and the genes in the gene set as illustrated in Figure 15.1.

The clusterProfiler package implemented a function, read.gmt(), to parse GMT file into a data.frame. The WikiPathway GMT file encodes information of version, wpid and species into the Description column. The clusterProfiler provides the read.gmt.wp() function to parse WikiPathway GMT file and supports parsing information that encoded in the Description column.

# use `wget -c` in `download.file`
wget::wget_set()
url <- "https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=COVID-19_Related_Gene_Sets"
download.file(url, destfile = "COVID19_GeneSets.gmt")

covid19_gs <- read.gmt("COVID19_GeneSets.gmt")
head(covid19_gs)
                                    term    gene
1 COVID19-E protein host PPI from Krogan    BRD4
2 COVID19-E protein host PPI from Krogan    BRD2
3 COVID19-E protein host PPI from Krogan SLC44A2
4 COVID19-E protein host PPI from Krogan  ZC3H18
5 COVID19-E protein host PPI from Krogan   AP3B1
6 COVID19-E protein host PPI from Krogan   CWC27

Gene set resources:

There are many gene sets available online. After parsing by the read.gmt() function, the data can be directly used to perform enrichment analysis using enricher() and GSEA() functions (see also Chapter 12).

15.4 GO utilities

15.4.1 drop specific GO terms or level

The enrichGO() function tests the whole GO corpus and enriched result may contains very general terms. With dropGO function, user can remove specific GO terms or GO level from results obtained from both enrichGO() and compareCluster().

15.4.2 test GO at sepcific level

The enrichGO() doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function, gofilter(), to restrict the result at specific GO level. It works with results obtained from both enrichGO() and compareCluster().

15.4.3 reduce redundancy of enriched GO terms

GO is organized in parent-child structure, thus a parent term can be overlapped with a large proportion with all its child terms. This can result in redundant findings. To solve this issue, clusterProfiler implements simplify method to reduce redundant GO terms from the outputs of enrichGO() and gseGO(). The function internally called GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term. The simplify() method apply select_fun (which can be a user defined function) to feature by to select one representative term from redundant terms (which have similarity higher than cutoff).

In Figure 15.2(A), we can found that there are many redundant terms form a highly condense network. After removing redundant terms using the simplify() method, the result is more clear to view the whole story.

library(enrichplot)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
bp <- enrichGO(de, ont="BP", OrgDb = 'org.Hs.eg.db')
bp <- pairwise_termsim(bp)
bp2 <- simplify(bp, cutoff=0.7, by="p.adjust", select_fun=min)
p1 <- emapplot(bp)
p2 <- emapplot(bp2)
plot_list(p1, p2, ncol=2, tag_levels = 'A')
Figure 15.2: Visualize enriched terms by EnrichmentMap using the emapplot() function. (A) original result. (B) simplify result.

Alternatively, users can use slim version of GO and use the enricher() o r gseGO() functions to analyze.

15.5 Data frame interface for accessing enriched results

The [ and [[ (asis parameter).

References

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An r Package for Measuring Semantic Similarity Among GO Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.