10  Universal enrichment analysis

The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontologies/pathways, but it’s still not enough as users may want to analyze their data with unsupported organisms, slim versions of GO, novel functional annotations (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways, or customized annotations.

The clusterProfiler package provides enricher() function for hypergeometric test and GSEA() function for gene set enrichment analysis that are designed to accept user-defined annotations. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with the first column of term ID and the second column of corresponding mapped genes, and TERM2NAME is a data.frame with the first column of term ID and the second column of corresponding term names. TERM2NAME is optional.

10.1 Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).

For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset. If you want to prepare your own geneList, please refer to the FAQ.

We can load the sample data into R via:

data(geneList, package="DOSE")
head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 

Suppose we define fold change greater than 2 as DEGs:

gene <- names(geneList)[abs(geneList) > 2]
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  

10.2 Cell Marker

# not execcute, as the url is no longer valid

## cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
url <- "http://yikedaxue.slwshop.cn/Cell_marker_Human.xlsx"
f <- tempfile(fileext = ".xlsx")
download.file(url, f)

cell_marker_data <- readxl::read_excel(f, 1)


## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
    dplyr::select('Cell name', GeneID) %>%
    dplyr::mutate(GeneID = strsplit(GeneID, ', ')) %>%
    tidyr::unnest(cols = c(GeneID))

10.2.1 Cell Marker over-representation analysis

x <- enricher(gene, TERM2GENE = cells)
head(x)

10.2.2 Cell Marker gene set enrichment analysis

y <- GSEA(geneList, TERM2GENE = cells)
head(y)

10.3 MSigDb analysis

Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:

  • H: hallmark gene sets
  • C1: positional gene sets
  • C2: curated gene sets
  • C3: motif gene sets
  • C4: computational gene sets
  • C5: GO gene sets
  • C6: oncogenic signatures
  • C7: immunologic signatures

Users can download GMT files from Broad Institute and use the read.gmt() function to parse the file to be used in enricher() and GSEA().

There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).

It supports several species:

library(msigdbr)
msigdbr_species()
# A tibble: 20 × 2
   species_name                    species_common_name                          
   <chr>                           <chr>                                        
 1 Anolis carolinensis             Carolina anole, green anole                  
 2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
 3 Caenorhabditis elegans          <NA>                                         
 4 Canis lupus familiaris          dog, dogs                                    
 5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
 6 Drosophila melanogaster         fruit fly                                    
 7 Equus caballus                  domestic horse, equine, horse                
 8 Felis catus                     cat, cats, domestic cat                      
 9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
10 Homo sapiens                    human                                        
11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica           gray short-tailed opossum                    
13 Mus musculus                    house mouse, mouse                           
14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes                 chimpanzee                                   
16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
18 Schizosaccharomyces pombe 972h- <NA>                                         
19 Sus scrofa                      pig, pigs, swine, wild boar                  
20 Xenopus tropicalis              tropical clawed frog, western clawed frog    

We can retrieve human gene sets with msigdbr(). Since the complete human MSigDB table is large, it is usually better to request only the collection you need in analysis workflows.

msigdbr(species = "Homo sapiens", collection = "C6") |>
  head(2) |>
  as.data.frame()
  gene_symbol ncbi_gene    ensembl_gene db_gene_symbol db_ncbi_gene
1       ACKR3     57007 ENSG00000144476          ACKR3        57007
2      ADGRL1     22859 ENSG00000072071         ADGRL1        22859
  db_ensembl_gene source_gene gs_id      gs_name gs_collection gs_subcollection
1 ENSG00000144476       CXCR7 M2666 AKT_UP.V1_DN            C6                 
2 ENSG00000072071       LPHN1 M2666 AKT_UP.V1_DN            C6                 
   gs_collection_name
1 Oncogenic Signature
2 Oncogenic Signature
                                                                                                  gs_description
1 Genes down-regulated in mouse prostate by transgenic expression of human AKT1 gene [Gene ID=207]  vs controls.
2 Genes down-regulated in mouse prostate by transgenic expression of human AKT1 gene [Gene ID=207]  vs controls.
  gs_source_species  gs_pmid gs_geoid
1                MM 15156201  GSE1413
2                MM 15156201  GSE1413
                                               gs_exact_source gs_url
1 AKT_PLACEBO vs WT_PLACEBO; bottom 150 genes (diff. of means)       
2 AKT_PLACEBO vs WT_PLACEBO; bottom 150 genes (diff. of means)       
  db_version db_target_species
1  2026.1.Hs                HS
2  2026.1.Hs                HS

Or specific collection. Here we use C6, oncogenic gene sets as an example:

m_t2g <- msigdbr(species = "Homo sapiens", collection = "C6") |>
  dplyr::select(gs_name, ncbi_gene)
head(m_t2g)
# A tibble: 6 × 2
  gs_name      ncbi_gene
  <chr>        <chr>    
1 AKT_UP.V1_DN 57007    
2 AKT_UP.V1_DN 22859    
3 AKT_UP.V1_DN 137872   
4 AKT_UP.V1_DN 249      
5 AKT_UP.V1_DN 271      
6 AKT_UP.V1_DN 51129    

10.3.1 MSigDb over-representation analysis

em <- enricher(gene, TERM2GENE=m_t2g)
head(em)
                                           ID            Description GeneRatio
RPS14_DN.V1_DN                 RPS14_DN.V1_DN         RPS14_DN.V1_DN    22/183
GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP    16/183
PRC2_EZH2_UP.V1_DN         PRC2_EZH2_UP.V1_DN     PRC2_EZH2_UP.V1_DN    15/183
VEGF_A_UP.V1_DN               VEGF_A_UP.V1_DN        VEGF_A_UP.V1_DN    15/183
RB_P107_DN.V1_UP             RB_P107_DN.V1_UP       RB_P107_DN.V1_UP    10/183
E2F1_UP.V1_UP                   E2F1_UP.V1_UP          E2F1_UP.V1_UP    12/183
                         BgRatio RichFactor FoldEnrichment    zScore
RPS14_DN.V1_DN         186/10927 0.11827957       7.062518 10.883293
GCNP_SHH_UP_LATE.V1_UP 181/10927 0.08839779       5.278266  7.574549
PRC2_EZH2_UP.V1_DN     192/10927 0.07812500       4.664874  6.686237
VEGF_A_UP.V1_DN        193/10927 0.07772021       4.640703  6.659725
RB_P107_DN.V1_UP       130/10927 0.07692308       4.593106  5.378527
E2F1_UP.V1_UP          188/10927 0.06382979       3.811301  5.074316
                             pvalue     p.adjust       qvalue
RPS14_DN.V1_DN         4.615656e-13 8.723590e-11 8.723590e-11
GCNP_SHH_UP_LATE.V1_UP 5.729300e-08 5.414189e-06 5.414189e-06
PRC2_EZH2_UP.V1_DN     7.531662e-07 3.802727e-05 3.802727e-05
VEGF_A_UP.V1_DN        8.048099e-07 3.802727e-05 3.802727e-05
RB_P107_DN.V1_UP       6.370479e-05 2.242628e-03 2.242628e-03
E2F1_UP.V1_UP          7.445084e-05 2.242628e-03 2.242628e-03
                                                                                                                                       geneID
RPS14_DN.V1_DN         79733/10874/4605/9493/54821/51659/10112/4174/5105/23397/83461/1062/2532/79901/7021/9133/9055/55872/991/9319/55388/9787
GCNP_SHH_UP_LATE.V1_UP                                      6241/79733/6790/51203/1111/3833/4174/983/9055/1580/3169/9212/9319/7153/55388/9787
PRC2_EZH2_UP.V1_DN                                          55355/8318/4605/10460/7272/23397/3887/81620/9212/2146/24137/6362/11182/55388/9787
VEGF_A_UP.V1_DN                                                   6241/8318/23362/9493/7272/2167/1062/10234/10403/3832/891/9133/332/4085/9787
RB_P107_DN.V1_UP                                                                         6241/79733/8318/4174/8208/23397/9055/1307/24137/4085
E2F1_UP.V1_UP                                                                  79733/1111/3833/8208/23397/9055/2842/9212/2146/7153/55388/9787
                       Count
RPS14_DN.V1_DN            22
GCNP_SHH_UP_LATE.V1_UP    16
PRC2_EZH2_UP.V1_DN        15
VEGF_A_UP.V1_DN           15
RB_P107_DN.V1_UP          10
E2F1_UP.V1_UP             12

10.3.2 MSigDb gene set enrichment analysis

In over-representation analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motifs using the GSEA approach.

C3_t2g <- msigdbr(species = "Homo sapiens", collection = "C3") |>
  dplyr::select(gs_name, ncbi_gene)
head(C3_t2g)
# A tibble: 6 × 2
  gs_name        ncbi_gene
  <chr>          <chr>    
1 AAACCAC_MIR140 10257    
2 AAACCAC_MIR140 23172    
3 AAACCAC_MIR140 81       
4 AAACCAC_MIR140 90       
5 AAACCAC_MIR140 8754     
6 AAACCAC_MIR140 11096    
em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2)
                                       ID          Description setSize
SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01     129
E2F1_Q3                           E2F1_Q3              E2F1_Q3     187
E2F1_Q6                           E2F1_Q6              E2F1_Q6     185
E2F_Q4                             E2F_Q4               E2F_Q4     193
E2F1DP1_01                     E2F1DP1_01           E2F1DP1_01     184
E2F1DP2_01                     E2F1DP2_01           E2F1DP2_01     184
                     enrichmentScore      NES       pvalue     p.adjust
SGCGSSAAA_E2F1DP2_01       0.5408183 2.309688 2.044459e-09 1.790789e-06
E2F1_Q3                    0.4983913 2.236086 3.044964e-11 9.743884e-08
E2F1_Q6                    0.4906657 2.181350 9.728438e-11 1.556550e-07
E2F_Q4                     0.4685394 2.110766 3.330912e-09 1.790789e-06
E2F1DP1_01                 0.4675989 2.078194 5.596214e-09 1.790789e-06
E2F1DP2_01                 0.4675989 2.078194 5.596214e-09 1.790789e-06
                           qvalue rank                   leading_edge
SGCGSSAAA_E2F1DP2_01 1.303167e-06 1797 tags=38%, list=14%, signal=33%
E2F1_Q3              7.090678e-08 2045 tags=36%, list=16%, signal=30%
E2F1_Q6              1.132710e-07 1819 tags=34%, list=15%, signal=30%
E2F_Q4               1.303167e-06 1663 tags=29%, list=13%, signal=26%
E2F1DP1_01           1.303167e-06 1797 tags=33%, list=14%, signal=28%
E2F1DP2_01           1.303167e-06 1797 tags=33%, list=14%, signal=28%
                                                                                                                                                                                                                                                                                                                                                                         core_enrichment
SGCGSSAAA_E2F1DP2_01                                                                                                  79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
E2F1_Q3              79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
E2F1_Q6                                       79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
E2F_Q4                                                                              8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7473/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
E2F1DP1_01                                                    79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173
E2F1DP2_01                                                    79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173

References

Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.