10 Universal enrichment analysis

The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontologies/pathways, but it’s still not enough as users may want to analyze their data with unsupported organisms, slim versions of GO, novel functional annotations (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways, or customized annotations.

The clusterProfiler package provides enricher() function for hypergeometric test and GSEA() function for gene set enrichment analysis that are designed to accept user-defined annotations. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with the first column of term ID and the second column of corresponding mapped genes, and TERM2NAME is a data.frame with the first column of term ID and the second column of corresponding term names. TERM2NAME is optional.

10.1 Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).

For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset. If you want to prepare your own geneList, please refer to the FAQ.

We can load the sample data into R via:

data(geneList, package="DOSE")
head(geneList)

    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857

Suppose we define fold change greater than 2 as DEGs:

gene <- names(geneList)[abs(geneList) > 2]
head(gene)

[1] "4312"  "8318"  "10874" "55143" "55388" "991"

10.2 Cell Marker

# not execcute, as the url is no longer valid

## cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
url <- "http://yikedaxue.slwshop.cn/Cell_marker_Human.xlsx"
f <- tempfile(fileext = ".xlsx")
download.file(url, f)

cell_marker_data <- readxl::read_excel(f, 1)


## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
    dplyr::select('Cell name', GeneID) %>%
    dplyr::mutate(GeneID = strsplit(GeneID, ', ')) %>%
    tidyr::unnest(cols = c(GeneID))

10.2.1 Cell Marker over-representation analysis

x <- enricher(gene, TERM2GENE = cells)
head(x)

10.2.2 Cell Marker gene set enrichment analysis

y <- GSEA(geneList, TERM2GENE = cells)
head(y)

10.3 MSigDb analysis

Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

Users can download GMT files from Broad Institute and use the read.gmt() function to parse the file to be used in enricher() and GSEA().

There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).

It supports several species:

library(msigdbr)
msigdbr_species()

# A tibble: 20 × 2
   species_name                    species_common_name                          
   <chr>                           <chr>                                        
 1 Anolis carolinensis             Carolina anole, green anole                  
 2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
 3 Caenorhabditis elegans          <NA>                                         
 4 Canis lupus familiaris          dog, dogs                                    
 5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
 6 Drosophila melanogaster         fruit fly                                    
 7 Equus caballus                  domestic horse, equine, horse                
 8 Felis catus                     cat, cats, domestic cat                      
 9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
10 Homo sapiens                    human                                        
11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica           gray short-tailed opossum                    
13 Mus musculus                    house mouse, mouse                           
14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes                 chimpanzee                                   
16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
18 Schizosaccharomyces pombe 972h- <NA>                                         
19 Sus scrofa                      pig, pigs, swine, wild boar                  
20 Xenopus tropicalis              tropical clawed frog, western clawed frog

We can retrieve all human gene sets:

m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) |> as.data.frame()

  gene_symbol ncbi_gene    ensembl_gene db_gene_symbol db_ncbi_gene
1       ABCC4     10257 ENSG00000125257          ABCC4        10257
2    ABRAXAS2     23172 ENSG00000165660       ABRAXAS2        23172
  db_ensembl_gene source_gene  gs_id        gs_name gs_collection
1 ENSG00000125257       ABCC4 M12609 AAACCAC_MIR140            C3
2 ENSG00000165660    KIAA0157 M12609 AAACCAC_MIR140            C3
  gs_subcollection gs_collection_name
1   MIR:MIR_LEGACY         MIR_Legacy
2   MIR:MIR_LEGACY         MIR_Legacy
                                                                                                                                                                                          gs_description
1 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
2 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
  gs_source_species gs_pmid gs_geoid gs_exact_source gs_url db_version
1                HS                                          2025.1.Hs
2                HS                                          2025.1.Hs
  db_target_species
1                HS
2                HS

Or specific collection. Here we use C6, oncogenic gene sets as an example:

m_t2g <- m_df |>
  dplyr::filter(gs_collection == "C6") |>
  dplyr::select(gs_name, ncbi_gene)
head(m_t2g)

# A tibble: 6 × 2
  gs_name      ncbi_gene
  <chr>        <chr>    
1 AKT_UP.V1_DN 57007    
2 AKT_UP.V1_DN 22859    
3 AKT_UP.V1_DN 137872   
4 AKT_UP.V1_DN 249      
5 AKT_UP.V1_DN 271      
6 AKT_UP.V1_DN 51129

10.3.1 MSigDb over-representation analysis

em <- enricher(gene, TERM2GENE=m_t2g)
head(em)

                                           ID            Description GeneRatio
RPS14_DN.V1_DN                 RPS14_DN.V1_DN         RPS14_DN.V1_DN    22/183
GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP    16/183
PRC2_EZH2_UP.V1_DN         PRC2_EZH2_UP.V1_DN     PRC2_EZH2_UP.V1_DN    15/183
VEGF_A_UP.V1_DN               VEGF_A_UP.V1_DN        VEGF_A_UP.V1_DN    15/183
RB_P107_DN.V1_UP             RB_P107_DN.V1_UP       RB_P107_DN.V1_UP    10/183
E2F1_UP.V1_UP                   E2F1_UP.V1_UP          E2F1_UP.V1_UP    12/183
                         BgRatio RichFactor FoldEnrichment    zScore
RPS14_DN.V1_DN         186/10927 0.11827957       7.062518 10.883293
GCNP_SHH_UP_LATE.V1_UP 181/10927 0.08839779       5.278266  7.574549
PRC2_EZH2_UP.V1_DN     192/10927 0.07812500       4.664874  6.686237
VEGF_A_UP.V1_DN        193/10927 0.07772021       4.640703  6.659725
RB_P107_DN.V1_UP       130/10927 0.07692308       4.593106  5.378527
E2F1_UP.V1_UP          188/10927 0.06382979       3.811301  5.074316
                             pvalue     p.adjust       qvalue
RPS14_DN.V1_DN         4.615656e-13 8.723590e-11 8.723590e-11
GCNP_SHH_UP_LATE.V1_UP 5.729300e-08 5.414189e-06 5.414189e-06
PRC2_EZH2_UP.V1_DN     7.531662e-07 3.802727e-05 3.802727e-05
VEGF_A_UP.V1_DN        8.048099e-07 3.802727e-05 3.802727e-05
RB_P107_DN.V1_UP       6.370479e-05 2.242628e-03 2.242628e-03
E2F1_UP.V1_UP          7.445084e-05 2.242628e-03 2.242628e-03
                                                                                                                                       geneID
RPS14_DN.V1_DN         79901/79733/5105/51659/9319/9787/55388/10112/4605/9055/9133/991/83461/54821/55872/1062/2532/23397/4174/10874/7021/9493
GCNP_SHH_UP_LATE.V1_UP                                      79733/7153/6241/1580/1111/9319/9787/55388/9055/9212/51203/3169/983/3833/4174/6790
PRC2_EZH2_UP.V1_DN                                          3887/7272/55355/9787/81620/11182/55388/4605/9212/6362/8318/23397/2146/24137/10460
VEGF_A_UP.V1_DN                                                   3832/7272/6241/891/10234/10403/9787/23362/9133/8318/332/4085/1062/2167/9493
RB_P107_DN.V1_UP                                                                         79733/6241/9055/8318/1307/4085/23397/4174/8208/24137
E2F1_UP.V1_UP                                                                  79733/7153/1111/9787/55388/2842/9055/9212/3833/23397/8208/2146
                       Count
RPS14_DN.V1_DN            22
GCNP_SHH_UP_LATE.V1_UP    16
PRC2_EZH2_UP.V1_DN        15
VEGF_A_UP.V1_DN           15
RB_P107_DN.V1_UP          10
E2F1_UP.V1_UP             12

10.3.2 MSigDb gene set enrichment analysis

In over-representation analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motifs using the GSEA approach.

C3_t2g <- m_df |>
  dplyr::filter(gs_collection == "C3") |>
  dplyr::select(gs_name, ncbi_gene)
head(C3_t2g)

# A tibble: 6 × 2
  gs_name        ncbi_gene
  <chr>          <chr>    
1 AAACCAC_MIR140 10257    
2 AAACCAC_MIR140 23172    
3 AAACCAC_MIR140 81       
4 AAACCAC_MIR140 90       
5 AAACCAC_MIR140 8754     
6 AAACCAC_MIR140 11096

em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2)

                                       ID          Description setSize
SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01     129
E2F1_Q3                           E2F1_Q3              E2F1_Q3     187
E2F1_Q6                           E2F1_Q6              E2F1_Q6     185
E2F_Q4                             E2F_Q4               E2F_Q4     193
E2F1DP1_01                     E2F1DP1_01           E2F1DP1_01     184
E2F1DP2_01                     E2F1DP2_01           E2F1DP2_01     184
                     enrichmentScore      NES       pvalue     p.adjust
SGCGSSAAA_E2F1DP2_01       0.5408183 2.297678 2.447734e-10 2.610100e-07
E2F1_Q3                    0.4983913 2.211786 8.412761e-11 1.882495e-07
E2F1_Q6                    0.4906657 2.181104 1.176927e-10 1.882495e-07
E2F_Q4                     0.4685394 2.097110 1.203541e-09 8.801457e-07
E2F1DP1_01                 0.4675989 2.085233 2.718882e-09 8.801457e-07
E2F1DP2_01                 0.4675989 2.085233 2.718882e-09 8.801457e-07
                           qvalue rank                   leading_edge
SGCGSSAAA_E2F1DP2_01 1.905883e-07 1797 tags=38%, list=14%, signal=33%
E2F1_Q3              1.374589e-07 2045 tags=36%, list=16%, signal=30%
E2F1_Q6              1.374589e-07 1819 tags=34%, list=15%, signal=30%
E2F_Q4               6.426784e-07 1663 tags=29%, list=13%, signal=26%
E2F1DP1_01           6.426784e-07 1797 tags=33%, list=14%, signal=28%
E2F1DP2_01           6.426784e-07 1797 tags=33%, list=14%, signal=28%
                                                                                                                                                                                                                                                                                                                                                                         core_enrichment
SGCGSSAAA_E2F1DP2_01                                                                                                  79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
E2F1_Q3              79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
E2F1_Q6                                       79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
E2F_Q4                                                                              8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7473/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
E2F1DP1_01                                                    79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173
E2F1DP2_01                                                    79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173

References

Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.