Chapter 14 Useful utilities
14.1 bitr: Biological Id TranslatoR
clusterProfiler provides bitr and bitr_kegg for converting ID types. Both bitr and bitr_kegg support many species including model and many non-model organisms.
x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2",
"ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1",
"NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1",
"COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC",
"ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT",
"MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK",
"STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)## SYMBOL ENTREZID
## 1 GPX3 2878
## 2 GLRX 2745
## 3 LBP 3929
## 4 CRYAB 1410
## 5 DEFB1 1672
## 6 HCLS1 3059
User should provides an annotation package, both fromType and toType can accept any types that supported.
User can use keytypes to list all supporting types.
## [1] "ACCNUM" "ALIAS" "ENSEMBL"
## [4] "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
## [7] "ENZYME" "EVIDENCE" "EVIDENCEALL"
## [10] "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [19] "PFAM" "PMID" "PROSITE"
## [22] "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
We can translate from one type to other types.
## SYMBOL UNIPROT ENSEMBL
## 1 GPX3 P22352 ENSG00000211445
## 2 GLRX A0A024RAM2 ENSG00000173221
## 3 GLRX P35754 ENSG00000173221
## 4 LBP P18428 ENSG00000129988
## 5 LBP Q8TCF0 ENSG00000129988
## 6 CRYAB P02511 ENSG00000109846
For GO analysis, user don’t need to convert ID, all ID type provided by OrgDb can be used in groupGO, enrichGO and gseGO by specifying keyType parameter.
14.1.1 bitr_kegg: converting biological IDs using KEGG API
## [1] "4597" "7111" "5266" "2175" "755" "23046"
## kegg ncbi-proteinid
## 1 10001 NP_005457
## 2 10209 NP_005792
## 3 10232 NP_037536
## 4 10324 NP_006054
## 5 10411 NP_001092002
## 6 10614 NP_006451
The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.
Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of ece:Z5100 in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID and UnitProt links in the Other DBs Entry, but not NCBI-GeneID.
If we try to convert Z5100 to ncbi-geneid, bitr_kegg will throw error of ncbi-geneid is not supported.
## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...
We can of course convert it to ncbi-proteinid and uniprot:
## kegg ncbi-proteinid
## 1 Z5100 AAG58814
## kegg uniprot
## 1 Z5100 A0A4Q2TPW7
14.2 setReadable: translating gene IDs to human readable symbols
Some of the functions, especially those internally supported for DO, GO, and Reactome Pathway, support a parameter, readable. If readable = TRUE, all the gene IDs will be translated to gene symbols. The readable parameter is not available for enrichment analysis of KEGG or using user’s own annotation. KEGG analysis using enrichKEGG and gseKEGG, internally query annotation information from KEEGG database and thus support all species if it is available in the KEGG database. However, KEGG database doesn’t provide gene ID to symbol mapping information. For analysis using user’s own annotation data, we even don’t know what species is in analyzed. Translating gene IDs to gene symbols is partly supported using the setReadable function if and only if there is an OrgDb available.
library(org.Hs.eg.db)
library(clusterProfiler)
data(geneList, package="DOSE")
de <- names(geneList)[1:100]
x <- enrichKEGG(de)
## The geneID column is ENTREZID
head(x, 3)## ID Description GeneRatio BgRatio
## hsa04110 hsa04110 Cell cycle 8/48 124/7932
## hsa04218 hsa04218 Cellular senescence 7/48 160/7932
## hsa04114 hsa04114 Oocyte meiosis 6/48 128/7932
## pvalue p.adjust qvalue
## hsa04110 6.356283e-07 7.182599e-05 6.490099e-05
## hsa04218 4.377944e-05 2.473538e-03 2.235055e-03
## hsa04114 1.105828e-04 4.165285e-03 3.763695e-03
## geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111 8
## hsa04218 2305/4605/9133/890/983/51806/1111 7
## hsa04114 991/9133/983/4085/51806/6790 6
y <- setReadable(x, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
## The geneID column is translated to symbol
head(y, 3)## ID Description GeneRatio BgRatio
## hsa04110 hsa04110 Cell cycle 8/48 124/7932
## hsa04218 hsa04218 Cellular senescence 7/48 160/7932
## hsa04114 hsa04114 Oocyte meiosis 6/48 128/7932
## pvalue p.adjust qvalue
## hsa04110 6.356283e-07 7.182599e-05 6.490099e-05
## hsa04218 4.377944e-05 2.473538e-03 2.235055e-03
## hsa04114 1.105828e-04 4.165285e-03 3.763695e-03
## geneID
## hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1
## hsa04218 FOXM1/MYBL2/CCNB2/CCNA2/CDK1/CALML5/CHEK1
## hsa04114 CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA
## Count
## hsa04110 8
## hsa04218 7
## hsa04114 6
For those functions that internally support readable parameter, user can also use setReadable for translating gene IDs.