10  Universal enrichment analysis

The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.

The clusterProfiler package provides enricher() function for hypergeometric test and GSEA() function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.

10.1 Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).

For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset. If you want to prepare your own geneList, please refer to the FAQ.

We can load the sample data into R via:

data(geneList, package="DOSE")
head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 

Suppose we define fold change greater than 2 as DEGs:

gene <- names(geneList)[abs(geneList) > 2]
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  

10.2 Cell Marker

# not execcute, as the url is no longer valid

## cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
url <- "http://yikedaxue.slwshop.cn/Cell_marker_Human.xlsx"
f <- tempfile(fileext = ".xlsx")
download.file(url, f)

cell_marker_data <- readxl::read_excel(f, 1)


## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
    dplyr::select('Cell name', GeneID) %>%
    dplyr::mutate(GeneID = strsplit(GeneID, ', ')) %>%
    tidyr::unnest(cols = c(GeneID))

10.2.1 Cell Marker over-presentaton analysis

x <- enricher(gene, TERM2GENE = cells)
head(x)

10.2.2 Cell Marker gene set enrichment analysis

y <- GSEA(geneList, TERM2GENE = cells)
head(y)

10.3 MSigDb analysis

Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:

  • H: hallmark gene sets
  • C1: positional gene sets
  • C2: curated gene sets
  • C3: motif gene sets
  • C4: computational gene sets
  • C5: GO gene sets
  • C6: oncogenic signatures
  • C7: immunologic signatures

Users can download GMT files from Broad Institute and use the read.gmt() function to parse the file to be used in enricher() and GSEA().

There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).

It supports several specices:

# A tibble: 20 × 2
   species_name                    species_common_name                          
   <chr>                           <chr>                                        
 1 Anolis carolinensis             Carolina anole, green anole                  
 2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
 3 Caenorhabditis elegans          <NA>                                         
 4 Canis lupus familiaris          dog, dogs                                    
 5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
 6 Drosophila melanogaster         fruit fly                                    
 7 Equus caballus                  domestic horse, equine, horse                
 8 Felis catus                     cat, cats, domestic cat                      
 9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
10 Homo sapiens                    human                                        
11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica           gray short-tailed opossum                    
13 Mus musculus                    house mouse, mouse                           
14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes                 chimpanzee                                   
16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
18 Schizosaccharomyces pombe 972h- <NA>                                         
19 Sus scrofa                      pig, pigs, swine, wild boar                  
20 Xenopus tropicalis              tropical clawed frog, western clawed frog    

We can retrieve all human gene sets:

m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) |> as.data.frame()
  gene_symbol ncbi_gene    ensembl_gene db_gene_symbol db_ncbi_gene
1       ABCC4     10257 ENSG00000125257          ABCC4        10257
2    ABRAXAS2     23172 ENSG00000165660       ABRAXAS2        23172
  db_ensembl_gene source_gene  gs_id        gs_name gs_collection
1 ENSG00000125257       ABCC4 M12609 AAACCAC_MIR140            C3
2 ENSG00000165660    KIAA0157 M12609 AAACCAC_MIR140            C3
  gs_subcollection gs_collection_name
1   MIR:MIR_LEGACY         MIR_Legacy
2   MIR:MIR_LEGACY         MIR_Legacy
                                                                                                                                                                                          gs_description
1 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
2 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
  gs_source_species gs_pmid gs_geoid gs_exact_source gs_url db_version
1                HS                                          2025.1.Hs
2                HS                                          2025.1.Hs
  db_target_species
1                HS
2                HS

Or specific collection. Here we use C6, oncogenic gene sets as an example:

m_t2g <- m_df |>
  dplyr::filter(gs_collection == "C6") |>
  dplyr::select(gs_name, ncbi_gene)
head(m_t2g)
# A tibble: 6 × 2
  gs_name      ncbi_gene
  <chr>        <chr>    
1 AKT_UP.V1_DN 57007    
2 AKT_UP.V1_DN 22859    
3 AKT_UP.V1_DN 137872   
4 AKT_UP.V1_DN 249      
5 AKT_UP.V1_DN 271      
6 AKT_UP.V1_DN 51129    

10.3.1 MSigDb over-presentaton analysis

em <- enricher(gene, TERM2GENE=m_t2g)
head(em)
                                           ID            Description GeneRatio
RPS14_DN.V1_DN                 RPS14_DN.V1_DN         RPS14_DN.V1_DN    22/183
GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP    16/183
PRC2_EZH2_UP.V1_DN         PRC2_EZH2_UP.V1_DN     PRC2_EZH2_UP.V1_DN    15/183
VEGF_A_UP.V1_DN               VEGF_A_UP.V1_DN        VEGF_A_UP.V1_DN    15/183
RB_P107_DN.V1_UP             RB_P107_DN.V1_UP       RB_P107_DN.V1_UP    10/183
E2F1_UP.V1_UP                   E2F1_UP.V1_UP          E2F1_UP.V1_UP    12/183
                         BgRatio RichFactor FoldEnrichment    zScore
RPS14_DN.V1_DN         186/10927 0.11827957       7.062518 10.883293
GCNP_SHH_UP_LATE.V1_UP 181/10927 0.08839779       5.278266  7.574549
PRC2_EZH2_UP.V1_DN     192/10927 0.07812500       4.664874  6.686237
VEGF_A_UP.V1_DN        193/10927 0.07772021       4.640703  6.659725
RB_P107_DN.V1_UP       130/10927 0.07692308       4.593106  5.378527
E2F1_UP.V1_UP          188/10927 0.06382979       3.811301  5.074316
                             pvalue     p.adjust       qvalue
RPS14_DN.V1_DN         4.615656e-13 7.800459e-11 6.364747e-11
GCNP_SHH_UP_LATE.V1_UP 5.729300e-08 4.841259e-06 3.950202e-06
PRC2_EZH2_UP.V1_DN     7.531662e-07 3.400322e-05 2.774476e-05
VEGF_A_UP.V1_DN        8.048099e-07 3.400322e-05 2.774476e-05
RB_P107_DN.V1_UP       6.370479e-05 2.005313e-03 1.636225e-03
E2F1_UP.V1_UP          7.445084e-05 2.005313e-03 1.636225e-03
                                                                                                                                       geneID
RPS14_DN.V1_DN         10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
GCNP_SHH_UP_LATE.V1_UP                                      55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
PRC2_EZH2_UP.V1_DN                                          8318/55388/4605/23397/9787/55355/10460/6362/81620/2146/7272/9212/11182/3887/24137
VEGF_A_UP.V1_DN                                                   8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
RB_P107_DN.V1_UP                                                                         8318/23397/79733/6241/4085/8208/9055/24137/4174/1307
E2F1_UP.V1_UP                                                                  55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
                       Count
RPS14_DN.V1_DN            22
GCNP_SHH_UP_LATE.V1_UP    16
PRC2_EZH2_UP.V1_DN        15
VEGF_A_UP.V1_DN           15
RB_P107_DN.V1_UP          10
E2F1_UP.V1_UP             12

10.3.2 MSigDb gene set enrichment analysis

In over-presentaton analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motif using GSEA approach.

C3_t2g <- m_df |>
  dplyr::filter(gs_collection == "C3") |>
  dplyr::select(gs_name, ncbi_gene)
head(C3_t2g)
# A tibble: 6 × 2
  gs_name        ncbi_gene
  <chr>          <chr>    
1 AAACCAC_MIR140 10257    
2 AAACCAC_MIR140 23172    
3 AAACCAC_MIR140 81       
4 AAACCAC_MIR140 90       
5 AAACCAC_MIR140 8754     
6 AAACCAC_MIR140 11096    
em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2)
                                       ID          Description setSize
HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES     400
E2F1_Q3                           E2F1_Q3              E2F1_Q3     187
E2F1_Q6                           E2F1_Q6              E2F1_Q6     185
SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01     129
E2F_Q6                             E2F_Q6               E2F_Q6     190
E2F1DP1_01                     E2F1DP1_01           E2F1DP1_01     184
                     enrichmentScore      NES       pvalue     p.adjust
HSD17B8_TARGET_GENES       0.6375083 3.089450 1.000000e-10 1.650500e-07
E2F1_Q3                    0.4983913 2.232817 1.000000e-10 1.650500e-07
E2F1_Q6                    0.4906657 2.199247 3.310158e-10 3.642277e-07
SGCGSSAAA_E2F1DP2_01       0.5408182 2.267070 5.733592e-10 4.731647e-07
E2F_Q6                     0.4672345 2.100641 1.756396e-09 9.155890e-07
E2F1DP1_01                 0.4675989 2.096925 2.218937e-09 9.155890e-07
                           qvalue rank                   leading_edge
HSD17B8_TARGET_GENES 1.307895e-07 1042  tags=36%, list=8%, signal=34%
E2F1_Q3              1.307895e-07 2045 tags=36%, list=16%, signal=30%
E2F1_Q6              2.886225e-07 1819 tags=34%, list=15%, signal=30%
SGCGSSAAA_E2F1DP2_01 3.749467e-07 1797 tags=38%, list=14%, signal=33%
E2F_Q6               7.255341e-07 1663 tags=29%, list=13%, signal=25%
E2F1DP1_01           7.255341e-07 1797 tags=33%, list=14%, signal=28%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            core_enrichment
HSD17B8_TARGET_GENES 55143/55388/991/2305/9493/1062/4605/9833/9133/10403/7153/23397/79733/259266/6241/55165/9787/11065/55355/9582/220134/55872/83461/10460/4751/10635/55839/890/9415/983/54821/4085/9837/5080/332/7272/64151/2842/9212/51659/9319/9055/3833/146909/10112/51514/4174/9232/1033/9928/3161/11004/993/4603/79801/990/5347/55215/701/55723/51512/55635/9156/11130/10024/57405/10615/1894/2491/8438/9700/5888/7083/898/3149/56992/9768/4288/10733/1163/4175/5307/2237/29899/55010/59336/56942/5984/29028/3838/1058/54478/3015/699/6491/81611/1063/27346/64785/9401/26271/51026/641/1869/10535/1029/28998/1763/8970/54892/55159/864/93594/1789/4176/7111/3148/116832/9585/9735/55732/81624/23310/1871/1031/79915/995/10051/30012/1104/80178/78995/7468/284403/5230/8549/7884/5558/4172/5424/79866/3182/56902/83990
E2F1_Q3                                                                                                                                                                                                                                                                                                                                                                                                                                                 79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
E2F1_Q6                                                                                                                                                                                                                                                                                                                                                                                                                                                                          79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
SGCGSSAAA_E2F1DP2_01                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
E2F_Q6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
E2F1DP1_01                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173

References

Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.