12 Universal enrichment analysis

The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.

The clusterProfiler package provides enricher() function for hypergeometric test and GSEA() function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.

12.1 Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).

For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset. If you want to prepare your own geneList, please refer to the FAQ.

We can load the sample data into R via:

data(geneList, package="DOSE")
head(geneList)
##     4312     8318    10874    55143    55388      991 
## 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857

Suppose we define fold change greater than 2 as DEGs:

gene <- names(geneList)[abs(geneList) > 2]
head(gene)
## [1] "4312"  "8318"  "10874" "55143" "55388" "991"

12.2 Cell Marker

cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')

## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
    dplyr::select(cellName, geneID) %>%
    dplyr::mutate(geneID = strsplit(geneID, ', ')) %>%
    tidyr::unnest()

12.2.1 Cell Marker over-presentaton analysis

x <- enricher(gene, TERM2GENE = cells)
head(x)
##                                            ID            Description GeneRatio
## Neural progenitor cell Neural progenitor cell Neural progenitor cell    33/153
## MKI67+ progenitor cell MKI67+ progenitor cell MKI67+ progenitor cell    20/153
## DCLK1+ progenitor cell DCLK1+ progenitor cell DCLK1+ progenitor cell     7/153
##                          BgRatio       pvalue     p.adjust       qvalue
## Neural progenitor cell 176/11582 3.395466e-29 3.259648e-27 2.716373e-27
## MKI67+ progenitor cell 111/11582 1.426062e-17 6.845100e-16 5.704250e-16
## DCLK1+ progenitor cell  99/11582 3.298755e-04 1.055601e-02 8.796679e-03
##                                                                                                                                                                                              geneID
## Neural progenitor cell 991/9493/9833/9133/10403/7153/259266/6241/9787/11065/55355/55872/51203/83461/22974/10460/79019/55839/890/983/4085/5080/332/3832/9212/51659/9055/891/4174/9232/1602/2018/4137
## MKI67+ progenitor cell                                                                       991/9133/10403/7153/259266/6241/9787/11065/51203/22974/10460/890/983/332/9212/9055/891/9232/2167/51705
## DCLK1+ progenitor cell                                                                                                                                        1307/730/54829/10234/79901/57758/4969
##                        Count
## Neural progenitor cell    33
## MKI67+ progenitor cell    20
## DCLK1+ progenitor cell     7

12.2.2 Cell Marker gene set enrichment analysis

y <- GSEA(geneList, TERM2GENE = cells)
head(y)
##                                                            ID
## Neural progenitor cell                 Neural progenitor cell
## FOXN4+ cell                                       FOXN4+ cell
## Paneth cell                                       Paneth cell
## Leydig precursor cell                   Leydig precursor cell
## Ciliated epithelial cell             Ciliated epithelial cell
## FGFR1HighNME5- epithelial cell FGFR1HighNME5- epithelial cell
##                                                   Description setSize
## Neural progenitor cell                 Neural progenitor cell     149
## FOXN4+ cell                                       FOXN4+ cell      63
## Paneth cell                                       Paneth cell     261
## Leydig precursor cell                   Leydig precursor cell     184
## Ciliated epithelial cell             Ciliated epithelial cell     217
## FGFR1HighNME5- epithelial cell FGFR1HighNME5- epithelial cell     131
##                                enrichmentScore       NES       pvalue
## Neural progenitor cell               0.7160571  3.059580 1.000000e-10
## FOXN4+ cell                          0.7136090  2.666139 1.000000e-10
## Paneth cell                          0.5381280  2.473211 1.000000e-10
## Leydig precursor cell               -0.5710583 -2.305756 1.000000e-10
## Ciliated epithelial cell            -0.5012031 -2.056223 1.888989e-10
## FGFR1HighNME5- epithelial cell      -0.5739472 -2.205135 2.745857e-10
##                                    p.adjust      qvalues rank
## Neural progenitor cell         4.125000e-09 2.421053e-09  684
## FOXN4+ cell                    4.125000e-09 2.421053e-09 1419
## Paneth cell                    4.125000e-09 2.421053e-09 2187
## Leydig precursor cell          4.125000e-09 2.421053e-09 2609
## Ciliated epithelial cell       6.233665e-09 3.658674e-09 1440
## FGFR1HighNME5- epithelial cell 7.551107e-09 4.431910e-09 1404
##                                                  leading_edge
## Neural progenitor cell          tags=47%, list=5%, signal=45%
## FOXN4+ cell                    tags=51%, list=11%, signal=45%
## Paneth cell                    tags=45%, list=18%, signal=38%
## Leydig precursor cell          tags=45%, list=21%, signal=36%
## Ciliated epithelial cell       tags=29%, list=12%, signal=27%
## FGFR1HighNME5- epithelial cell tags=39%, list=11%, signal=35%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         core_enrichment
## Neural progenitor cell                                                                                                                                                                                                                                            991/9493/9833/9133/10403/7153/259266/6241/9787/11065/55355/55872/51203/83461/22974/10460/79019/55839/890/983/4085/5080/332/3832/9212/51659/9055/891/4174/9232/4171/1033/1164/11004/51512/11130/79682/26227/57405/10615/3149/23594/56992/9768/11339/3070/4288/1163/4175/4173/2237/29028/3838/54478/3015/699/81611/1063/5111/26271/51053/10535/1719/8543/9289/4176/1841/7112/58516/3148
## FOXN4+ cell                                                                                                                                                                                                                                                                                                                                                                                                                                                                   79733/4751/79019/983/81620/4069/8836/3070/10733/81831/3932/10926/56938/7298/6491/262/1029/1719/8894/5902/10950/5496/2023/124222/10263/51142/11168/3638/51491/8751/6637/86
## Paneth cell                    597/3627/8140/1844/4283/7850/2921/7941/4318/4069/3576/1075/10797/56833/713/1230/200315/712/51338/64581/57823/10437/8767/5341/91543/6351/3055/6614/10859/330/9332/313/51765/1051/22797/4067/80896/864/5226/9046/5336/8942/3162/4688/29887/5641/11314/79888/942/6916/1439/1116/10288/7127/4064/1794/483/963/3689/5788/3383/962/23250/3937/2207/3394/5552/54504/1514/79879/64092/409/7805/4360/3936/4689/5880/7128/10105/9308/51296/10148/10287/8477/6503/929/945/10261/4794/9775/8013/5912/7124/5329/10875/7097/7852/1378/5743/6398/7454/3119/951/6039/1240/5142/1536/3071/834/8676/7305/7133/3310/958/5724/728/11309/3059
## Leydig precursor cell                                                                                                                                                                                   7169/120/2949/22871/79822/7070/3624/10761/3671/80114/6326/25959/2013/9902/8654/4017/4811/26010/1291/92689/126393/7474/7168/27303/51454/1306/678/8425/7837/590/5919/23635/8829/1831/30008/1277/7078/4747/6591/56944/8406/1281/11167/1290/9037/2946/283298/1009/4313/6320/3075/6387/5159/1289/1292/116039/3908/5101/1909/83716/5376/633/6857/10631/54674/54587/30846/83468/3487/8404/2200/1634/1287/3479/5348/2006/7373/730/10234/4239/79901/4969
## Ciliated epithelial cell                                                                                                                                                                                                                                                          9666/1490/9284/7840/51281/126070/5021/55779/80823/79781/11001/81890/79645/79411/8777/4753/2701/2259/9854/8100/323/55152/51364/9940/23345/216/146542/22832/582/57728/4246/7220/55064/79819/54681/5002/79884/10610/26018/79659/6542/53832/7033/8938/23245/26960/55638/9576/114327/54585/1287/9358/55259/643314/64799/6038/80129/4857/80736/2066/79083/79846/11122/79838
## FGFR1HighNME5- epithelial cell                                                                                                                                                                                                                                                                                                                                            3931/6586/23284/7106/5919/316/8829/4060/10559/8642/2846/58494/79987/8835/8076/64221/5118/83700/79776/50509/1513/7058/4313/55314/91851/10536/2674/3305/5159/443/3908/5101/2690/55733/2550/6857/10631/81035/4330/83468/576/2200/5157/7049/51380/1307/196740/2532/1308/8483/4239

12.3 MSigDb analysis

Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:

  • H: hallmark gene sets
  • C1: positional gene sets
  • C2: curated gene sets
  • C3: motif gene sets
  • C4: computational gene sets
  • C5: GO gene sets
  • C6: oncogenic signatures
  • C7: immunologic signatures

Users can download GMT files from Broad Institute and use the read.gmt() function to parse the file to be used in enricher() and GSEA().

There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).

It supports several specices:

##  [1] "Anolis carolinensis"             "Bos taurus"                     
##  [3] "Caenorhabditis elegans"          "Canis lupus familiaris"         
##  [5] "Danio rerio"                     "Drosophila melanogaster"        
##  [7] "Equus caballus"                  "Felis catus"                    
##  [9] "Gallus gallus"                   "Homo sapiens"                   
## [11] "Macaca mulatta"                  "Monodelphis domestica"          
## [13] "Mus musculus"                    "Ornithorhynchus anatinus"       
## [15] "Pan troglodytes"                 "Rattus norvegicus"              
## [17] "Saccharomyces cerevisiae"        "Schizosaccharomyces pombe 972h-"
## [19] "Sus scrofa"                      "Xenopus tropicalis"

We can retrieve all human gene sets:

m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) %>% as.data.frame
##   gs_cat      gs_subcat        gs_name gene_symbol entrez_gene    ensembl_gene
## 1     C3 MIR:MIR_Legacy AAACCAC_MIR140       ABCC4       10257 ENSG00000125257
## 2     C3 MIR:MIR_Legacy AAACCAC_MIR140    ABRAXAS2       23172 ENSG00000165660
##   human_gene_symbol human_entrez_gene human_ensembl_gene  gs_id gs_pmid
## 1             ABCC4             10257    ENSG00000125257 M12609        
## 2          ABRAXAS2             23172    ENSG00000165660 M12609        
##   gs_geoid gs_exact_source gs_url
## 1                                
## 2                                
##                                                                                                                                                                                           gs_description
## 1 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
## 2 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).

Or specific collection. Here we use C6, oncogenic gene sets as an example:

m_t2g <- msigdbr(species = "Homo sapiens", category = "C6") %>% 
  dplyr::select(gs_name, entrez_gene)
head(m_t2g)
## # A tibble: 6 × 2
##   gs_name              entrez_gene
##   <chr>                      <int>
## 1 AKT_UP_MTOR_DN.V1_DN       25864
## 2 AKT_UP_MTOR_DN.V1_DN          95
## 3 AKT_UP_MTOR_DN.V1_DN      137872
## 4 AKT_UP_MTOR_DN.V1_DN         134
## 5 AKT_UP_MTOR_DN.V1_DN       55326
## 6 AKT_UP_MTOR_DN.V1_DN       55326

12.3.1 MSigDb over-presentaton analysis

em <- enricher(gene, TERM2GENE=m_t2g)
head(em)
##                                            ID            Description GeneRatio
## RPS14_DN.V1_DN                 RPS14_DN.V1_DN         RPS14_DN.V1_DN    22/183
## GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP    16/183
## PRC2_EZH2_UP.V1_DN         PRC2_EZH2_UP.V1_DN     PRC2_EZH2_UP.V1_DN    15/183
## VEGF_A_UP.V1_DN               VEGF_A_UP.V1_DN        VEGF_A_UP.V1_DN    15/183
## RB_P107_DN.V1_UP             RB_P107_DN.V1_UP       RB_P107_DN.V1_UP    10/183
## E2F1_UP.V1_UP                   E2F1_UP.V1_UP          E2F1_UP.V1_UP    12/183
##                          BgRatio       pvalue     p.adjust       qvalue
## RPS14_DN.V1_DN         186/10922 4.657342e-13 7.870908e-11 6.422230e-11
## GCNP_SHH_UP_LATE.V1_UP 181/10922 5.765032e-08 4.871452e-06 3.974838e-06
## PRC2_EZH2_UP.V1_DN     192/10922 7.574546e-07 3.419660e-05 2.790255e-05
## VEGF_A_UP.V1_DN        193/10922 8.093869e-07 3.419660e-05 2.790255e-05
## RB_P107_DN.V1_UP       130/10922 6.394609e-05 2.013355e-03 1.642787e-03
## E2F1_UP.V1_UP          188/10922 7.477287e-05 2.013355e-03 1.642787e-03
##                                                                                                                                        geneID
## RPS14_DN.V1_DN         10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
## GCNP_SHH_UP_LATE.V1_UP                                      55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
## PRC2_EZH2_UP.V1_DN                                          8318/55388/4605/23397/9787/55355/10460/6362/81620/2146/7272/9212/11182/3887/24137
## VEGF_A_UP.V1_DN                                                   8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
## RB_P107_DN.V1_UP                                                                         8318/23397/79733/6241/4085/8208/9055/24137/4174/1307
## E2F1_UP.V1_UP                                                                  55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
##                        Count
## RPS14_DN.V1_DN            22
## GCNP_SHH_UP_LATE.V1_UP    16
## PRC2_EZH2_UP.V1_DN        15
## VEGF_A_UP.V1_DN           15
## RB_P107_DN.V1_UP          10
## E2F1_UP.V1_UP             12

12.3.2 MSigDb gene set enrichment analysis

In over-presentaton analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motif using GSEA approach.

C3_t2g <- msigdbr(species = "Homo sapiens", category = "C3") %>% 
  dplyr::select(gs_name, entrez_gene)
head(C3_t2g)
## # A tibble: 6 × 2
##   gs_name        entrez_gene
##   <chr>                <int>
## 1 AAACCAC_MIR140       10257
## 2 AAACCAC_MIR140       23172
## 3 AAACCAC_MIR140          81
## 4 AAACCAC_MIR140          81
## 5 AAACCAC_MIR140          90
## 6 AAACCAC_MIR140        8754
em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2)
##                                        ID          Description setSize
## HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES     401
## E2F1_Q3                           E2F1_Q3              E2F1_Q3     187
## E2F1_Q6                           E2F1_Q6              E2F1_Q6     185
## SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01     129
## E2F_Q4                             E2F_Q4               E2F_Q4     193
## E2F_Q6                             E2F_Q6               E2F_Q6     190
##                      enrichmentScore      NES       pvalue     p.adjust
## HSD17B8_TARGET_GENES       0.6368099 3.065382 1.000000e-10 1.503229e-07
## E2F1_Q3                    0.4983913 2.222218 1.000000e-10 1.503229e-07
## E2F1_Q6                    0.4906657 2.191042 1.362443e-10 1.503229e-07
## SGCGSSAAA_E2F1DP2_01       0.5408182 2.298186 4.030580e-10 3.335305e-07
## E2F_Q4                     0.4685394 2.100798 1.297983e-09 8.592650e-07
## E2F_Q6                     0.4672345 2.086952 4.380543e-09 1.318145e-06
##                           qvalues rank                   leading_edge
## HSD17B8_TARGET_GENES 1.161662e-07 1042  tags=36%, list=8%, signal=34%
## E2F1_Q3              1.161662e-07 2045 tags=36%, list=16%, signal=30%
## E2F1_Q6              1.161662e-07 1819 tags=34%, list=15%, signal=30%
## SGCGSSAAA_E2F1DP2_01 2.577450e-07 1797 tags=38%, list=14%, signal=33%
## E2F_Q4               6.640210e-07 1663 tags=29%, list=13%, signal=26%
## E2F_Q6               1.018634e-06 1663 tags=29%, list=13%, signal=25%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             core_enrichment
## HSD17B8_TARGET_GENES 55143/55388/991/2305/9493/1062/4605/9833/9133/10403/7153/23397/79733/259266/6241/55165/9787/11065/55355/9582/220134/55872/83461/10460/4751/10635/55839/890/9415/983/54821/4085/9837/5080/332/7272/64151/2842/9212/51659/9319/9055/3833/146909/10112/51514/4174/9232/1033/9928/3161/11004/993/4603/79801/990/5347/55215/701/55723/51512/55635/9156/11130/10024/57405/10615/1894/2491/8438/9700/5888/7083/898/3149/56992/9768/4288/10733/1163/4175/5307/2237/29899/55010/59336/56942/5984/29028/3838/1058/54478/3015/699/6491/81611/1063/27346/64785/9401/26271/51026/641/1869/10535/1029/28998/1763/8970/54892/55159/864/93594/1789/4176/7111/3148/116832/9585/9735/55732/81624/23310/1871/1031/79915/995/10051/30012/1104/80178/78995/7468/284403/5230/8549/7884/5558/4172/5424/79866/3182/56902/83990
## E2F1_Q3                                                                                                                                                                                                                                                                                                                                                                                                                                                 79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
## E2F1_Q6                                                                                                                                                                                                                                                                                                                                                                                                                                                                          79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
## SGCGSSAAA_E2F1DP2_01                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
## E2F_Q4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7473/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
## E2F_Q6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636