12 Universal enrichment analysis
The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.
The clusterProfiler package provides enricher()
function for hypergeometric test and GSEA()
function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE
and TERM2NAME
. As indicated in the parameter names, TERM2GENE
is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME
is a data.frame
with first column of term ID and second column of corresponding term name. TERM2NAME
is optional.
12.1 Input data
For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).
For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList
which was derived from R
package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList
dataset. If you want to prepare your own geneList
, please refer to the FAQ.
We can load the sample data into R via:
## 4312 8318 10874 55143 55388 991
## 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
Suppose we define fold change greater than 2 as DEGs:
## [1] "4312" "8318" "10874" "55143" "55388" "991"
12.2 Cell Marker
cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
dplyr::select(cellName, geneID) %>%
dplyr::mutate(geneID = strsplit(geneID, ', ')) %>%
tidyr::unnest()
12.2.1 Cell Marker over-presentaton analysis
## ID Description GeneRatio
## Neural progenitor cell Neural progenitor cell Neural progenitor cell 33/153
## MKI67+ progenitor cell MKI67+ progenitor cell MKI67+ progenitor cell 20/153
## DCLK1+ progenitor cell DCLK1+ progenitor cell DCLK1+ progenitor cell 7/153
## BgRatio pvalue p.adjust qvalue
## Neural progenitor cell 176/11582 3.395466e-29 3.259648e-27 2.716373e-27
## MKI67+ progenitor cell 111/11582 1.426062e-17 6.845100e-16 5.704250e-16
## DCLK1+ progenitor cell 99/11582 3.298755e-04 1.055601e-02 8.796679e-03
## geneID
## Neural progenitor cell 991/9493/9833/9133/10403/7153/259266/6241/9787/11065/55355/55872/51203/83461/22974/10460/79019/55839/890/983/4085/5080/332/3832/9212/51659/9055/891/4174/9232/1602/2018/4137
## MKI67+ progenitor cell 991/9133/10403/7153/259266/6241/9787/11065/51203/22974/10460/890/983/332/9212/9055/891/9232/2167/51705
## DCLK1+ progenitor cell 1307/730/54829/10234/79901/57758/4969
## Count
## Neural progenitor cell 33
## MKI67+ progenitor cell 20
## DCLK1+ progenitor cell 7
12.2.2 Cell Marker gene set enrichment analysis
## ID
## Neural progenitor cell Neural progenitor cell
## FOXN4+ cell FOXN4+ cell
## Paneth cell Paneth cell
## Leydig precursor cell Leydig precursor cell
## Ciliated epithelial cell Ciliated epithelial cell
## FGFR1HighNME5- epithelial cell FGFR1HighNME5- epithelial cell
## Description setSize
## Neural progenitor cell Neural progenitor cell 149
## FOXN4+ cell FOXN4+ cell 63
## Paneth cell Paneth cell 261
## Leydig precursor cell Leydig precursor cell 184
## Ciliated epithelial cell Ciliated epithelial cell 217
## FGFR1HighNME5- epithelial cell FGFR1HighNME5- epithelial cell 131
## enrichmentScore NES pvalue
## Neural progenitor cell 0.7160571 3.059580 1.000000e-10
## FOXN4+ cell 0.7136090 2.666139 1.000000e-10
## Paneth cell 0.5381280 2.473211 1.000000e-10
## Leydig precursor cell -0.5710583 -2.305756 1.000000e-10
## Ciliated epithelial cell -0.5012031 -2.056223 1.888989e-10
## FGFR1HighNME5- epithelial cell -0.5739472 -2.205135 2.745857e-10
## p.adjust qvalues rank
## Neural progenitor cell 4.125000e-09 2.421053e-09 684
## FOXN4+ cell 4.125000e-09 2.421053e-09 1419
## Paneth cell 4.125000e-09 2.421053e-09 2187
## Leydig precursor cell 4.125000e-09 2.421053e-09 2609
## Ciliated epithelial cell 6.233665e-09 3.658674e-09 1440
## FGFR1HighNME5- epithelial cell 7.551107e-09 4.431910e-09 1404
## leading_edge
## Neural progenitor cell tags=47%, list=5%, signal=45%
## FOXN4+ cell tags=51%, list=11%, signal=45%
## Paneth cell tags=45%, list=18%, signal=38%
## Leydig precursor cell tags=45%, list=21%, signal=36%
## Ciliated epithelial cell tags=29%, list=12%, signal=27%
## FGFR1HighNME5- epithelial cell tags=39%, list=11%, signal=35%
## core_enrichment
## Neural progenitor cell 991/9493/9833/9133/10403/7153/259266/6241/9787/11065/55355/55872/51203/83461/22974/10460/79019/55839/890/983/4085/5080/332/3832/9212/51659/9055/891/4174/9232/4171/1033/1164/11004/51512/11130/79682/26227/57405/10615/3149/23594/56992/9768/11339/3070/4288/1163/4175/4173/2237/29028/3838/54478/3015/699/81611/1063/5111/26271/51053/10535/1719/8543/9289/4176/1841/7112/58516/3148
## FOXN4+ cell 79733/4751/79019/983/81620/4069/8836/3070/10733/81831/3932/10926/56938/7298/6491/262/1029/1719/8894/5902/10950/5496/2023/124222/10263/51142/11168/3638/51491/8751/6637/86
## Paneth cell 597/3627/8140/1844/4283/7850/2921/7941/4318/4069/3576/1075/10797/56833/713/1230/200315/712/51338/64581/57823/10437/8767/5341/91543/6351/3055/6614/10859/330/9332/313/51765/1051/22797/4067/80896/864/5226/9046/5336/8942/3162/4688/29887/5641/11314/79888/942/6916/1439/1116/10288/7127/4064/1794/483/963/3689/5788/3383/962/23250/3937/2207/3394/5552/54504/1514/79879/64092/409/7805/4360/3936/4689/5880/7128/10105/9308/51296/10148/10287/8477/6503/929/945/10261/4794/9775/8013/5912/7124/5329/10875/7097/7852/1378/5743/6398/7454/3119/951/6039/1240/5142/1536/3071/834/8676/7305/7133/3310/958/5724/728/11309/3059
## Leydig precursor cell 7169/120/2949/22871/79822/7070/3624/10761/3671/80114/6326/25959/2013/9902/8654/4017/4811/26010/1291/92689/126393/7474/7168/27303/51454/1306/678/8425/7837/590/5919/23635/8829/1831/30008/1277/7078/4747/6591/56944/8406/1281/11167/1290/9037/2946/283298/1009/4313/6320/3075/6387/5159/1289/1292/116039/3908/5101/1909/83716/5376/633/6857/10631/54674/54587/30846/83468/3487/8404/2200/1634/1287/3479/5348/2006/7373/730/10234/4239/79901/4969
## Ciliated epithelial cell 9666/1490/9284/7840/51281/126070/5021/55779/80823/79781/11001/81890/79645/79411/8777/4753/2701/2259/9854/8100/323/55152/51364/9940/23345/216/146542/22832/582/57728/4246/7220/55064/79819/54681/5002/79884/10610/26018/79659/6542/53832/7033/8938/23245/26960/55638/9576/114327/54585/1287/9358/55259/643314/64799/6038/80129/4857/80736/2066/79083/79846/11122/79838
## FGFR1HighNME5- epithelial cell 3931/6586/23284/7106/5919/316/8829/4060/10559/8642/2846/58494/79987/8835/8076/64221/5118/83700/79776/50509/1513/7058/4313/55314/91851/10536/2674/3305/5159/443/3908/5101/2690/55733/2550/6857/10631/81035/4330/83468/576/2200/5157/7049/51380/1307/196740/2532/1308/8483/4239
12.3 MSigDb analysis
Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:
- H: hallmark gene sets
- C1: positional gene sets
- C2: curated gene sets
- C3: motif gene sets
- C4: computational gene sets
- C5: GO gene sets
- C6: oncogenic signatures
- C7: immunologic signatures
Users can download GMT files from Broad Institute and use the read.gmt()
function to parse the file to be used in enricher()
and GSEA()
.
There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).
It supports several specices:
## [1] "Anolis carolinensis" "Bos taurus"
## [3] "Caenorhabditis elegans" "Canis lupus familiaris"
## [5] "Danio rerio" "Drosophila melanogaster"
## [7] "Equus caballus" "Felis catus"
## [9] "Gallus gallus" "Homo sapiens"
## [11] "Macaca mulatta" "Monodelphis domestica"
## [13] "Mus musculus" "Ornithorhynchus anatinus"
## [15] "Pan troglodytes" "Rattus norvegicus"
## [17] "Saccharomyces cerevisiae" "Schizosaccharomyces pombe 972h-"
## [19] "Sus scrofa" "Xenopus tropicalis"
We can retrieve all human gene sets:
## gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
## 1 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABCC4 10257 ENSG00000125257
## 2 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2 23172 ENSG00000165660
## human_gene_symbol human_entrez_gene human_ensembl_gene gs_id gs_pmid
## 1 ABCC4 10257 ENSG00000125257 M12609
## 2 ABRAXAS2 23172 ENSG00000165660 M12609
## gs_geoid gs_exact_source gs_url
## 1
## 2
## gs_description
## 1 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
## 2 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
Or specific collection. Here we use C6, oncogenic gene sets as an example:
m_t2g <- msigdbr(species = "Homo sapiens", category = "C6") %>%
dplyr::select(gs_name, entrez_gene)
head(m_t2g)
## # A tibble: 6 × 2
## gs_name entrez_gene
## <chr> <int>
## 1 AKT_UP_MTOR_DN.V1_DN 25864
## 2 AKT_UP_MTOR_DN.V1_DN 95
## 3 AKT_UP_MTOR_DN.V1_DN 137872
## 4 AKT_UP_MTOR_DN.V1_DN 134
## 5 AKT_UP_MTOR_DN.V1_DN 55326
## 6 AKT_UP_MTOR_DN.V1_DN 55326
12.3.1 MSigDb over-presentaton analysis
## ID Description GeneRatio
## RPS14_DN.V1_DN RPS14_DN.V1_DN RPS14_DN.V1_DN 22/183
## GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP 16/183
## PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN 15/183
## VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN 15/183
## RB_P107_DN.V1_UP RB_P107_DN.V1_UP RB_P107_DN.V1_UP 10/183
## E2F1_UP.V1_UP E2F1_UP.V1_UP E2F1_UP.V1_UP 12/183
## BgRatio pvalue p.adjust qvalue
## RPS14_DN.V1_DN 186/10922 4.657342e-13 7.870908e-11 6.422230e-11
## GCNP_SHH_UP_LATE.V1_UP 181/10922 5.765032e-08 4.871452e-06 3.974838e-06
## PRC2_EZH2_UP.V1_DN 192/10922 7.574546e-07 3.419660e-05 2.790255e-05
## VEGF_A_UP.V1_DN 193/10922 8.093869e-07 3.419660e-05 2.790255e-05
## RB_P107_DN.V1_UP 130/10922 6.394609e-05 2.013355e-03 1.642787e-03
## E2F1_UP.V1_UP 188/10922 7.477287e-05 2.013355e-03 1.642787e-03
## geneID
## RPS14_DN.V1_DN 10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
## GCNP_SHH_UP_LATE.V1_UP 55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
## PRC2_EZH2_UP.V1_DN 8318/55388/4605/23397/9787/55355/10460/6362/81620/2146/7272/9212/11182/3887/24137
## VEGF_A_UP.V1_DN 8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
## RB_P107_DN.V1_UP 8318/23397/79733/6241/4085/8208/9055/24137/4174/1307
## E2F1_UP.V1_UP 55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
## Count
## RPS14_DN.V1_DN 22
## GCNP_SHH_UP_LATE.V1_UP 16
## PRC2_EZH2_UP.V1_DN 15
## VEGF_A_UP.V1_DN 15
## RB_P107_DN.V1_UP 10
## E2F1_UP.V1_UP 12
12.3.2 MSigDb gene set enrichment analysis
In over-presentaton analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motif using GSEA approach.
C3_t2g <- msigdbr(species = "Homo sapiens", category = "C3") %>%
dplyr::select(gs_name, entrez_gene)
head(C3_t2g)
## # A tibble: 6 × 2
## gs_name entrez_gene
## <chr> <int>
## 1 AAACCAC_MIR140 10257
## 2 AAACCAC_MIR140 23172
## 3 AAACCAC_MIR140 81
## 4 AAACCAC_MIR140 81
## 5 AAACCAC_MIR140 90
## 6 AAACCAC_MIR140 8754
## ID Description setSize
## HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES 401
## E2F1_Q3 E2F1_Q3 E2F1_Q3 187
## E2F1_Q6 E2F1_Q6 E2F1_Q6 185
## SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 129
## E2F_Q4 E2F_Q4 E2F_Q4 193
## E2F_Q6 E2F_Q6 E2F_Q6 190
## enrichmentScore NES pvalue p.adjust
## HSD17B8_TARGET_GENES 0.6368099 3.065382 1.000000e-10 1.503229e-07
## E2F1_Q3 0.4983913 2.222218 1.000000e-10 1.503229e-07
## E2F1_Q6 0.4906657 2.191042 1.362443e-10 1.503229e-07
## SGCGSSAAA_E2F1DP2_01 0.5408182 2.298186 4.030580e-10 3.335305e-07
## E2F_Q4 0.4685394 2.100798 1.297983e-09 8.592650e-07
## E2F_Q6 0.4672345 2.086952 4.380543e-09 1.318145e-06
## qvalues rank leading_edge
## HSD17B8_TARGET_GENES 1.161662e-07 1042 tags=36%, list=8%, signal=34%
## E2F1_Q3 1.161662e-07 2045 tags=36%, list=16%, signal=30%
## E2F1_Q6 1.161662e-07 1819 tags=34%, list=15%, signal=30%
## SGCGSSAAA_E2F1DP2_01 2.577450e-07 1797 tags=38%, list=14%, signal=33%
## E2F_Q4 6.640210e-07 1663 tags=29%, list=13%, signal=26%
## E2F_Q6 1.018634e-06 1663 tags=29%, list=13%, signal=25%
## core_enrichment
## HSD17B8_TARGET_GENES 55143/55388/991/2305/9493/1062/4605/9833/9133/10403/7153/23397/79733/259266/6241/55165/9787/11065/55355/9582/220134/55872/83461/10460/4751/10635/55839/890/9415/983/54821/4085/9837/5080/332/7272/64151/2842/9212/51659/9319/9055/3833/146909/10112/51514/4174/9232/1033/9928/3161/11004/993/4603/79801/990/5347/55215/701/55723/51512/55635/9156/11130/10024/57405/10615/1894/2491/8438/9700/5888/7083/898/3149/56992/9768/4288/10733/1163/4175/5307/2237/29899/55010/59336/56942/5984/29028/3838/1058/54478/3015/699/6491/81611/1063/27346/64785/9401/26271/51026/641/1869/10535/1029/28998/1763/8970/54892/55159/864/93594/1789/4176/7111/3148/116832/9585/9735/55732/81624/23310/1871/1031/79915/995/10051/30012/1104/80178/78995/7468/284403/5230/8549/7884/5558/4172/5424/79866/3182/56902/83990
## E2F1_Q3 79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
## E2F1_Q6 79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
## SGCGSSAAA_E2F1DP2_01 79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
## E2F_Q4 8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7473/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
## E2F_Q6 8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636