4312 8318 10874 55143 55388 991
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
10 Universal enrichment analysis
The clusterProfiler package (Yu et al. 2012) supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.
The clusterProfiler package provides enricher() function for hypergeometric test and GSEA() function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
10.1 Input data
For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2 package).
For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset. If you want to prepare your own geneList, please refer to the FAQ.
We can load the sample data into R via:
Suppose we define fold change greater than 2 as DEGs:
10.2 Cell Marker
# not execcute, as the url is no longer valid
## cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
url <- "http://yikedaxue.slwshop.cn/Cell_marker_Human.xlsx"
f <- tempfile(fileext = ".xlsx")
download.file(url, f)
cell_marker_data <- readxl::read_excel(f, 1)
## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
dplyr::select('Cell name', GeneID) %>%
dplyr::mutate(GeneID = strsplit(GeneID, ', ')) %>%
tidyr::unnest(cols = c(GeneID))10.2.1 Cell Marker over-presentaton analysis
x <- enricher(gene, TERM2GENE = cells)
head(x)10.2.2 Cell Marker gene set enrichment analysis
y <- GSEA(geneList, TERM2GENE = cells)
head(y)10.3 MSigDb analysis
Molecular Signatures Database is a collection of annotated gene sets. It contains 8 major collections:
- H: hallmark gene sets
- C1: positional gene sets
- C2: curated gene sets
- C3: motif gene sets
- C4: computational gene sets
- C5: GO gene sets
- C6: oncogenic signatures
- C7: immunologic signatures
Users can download GMT files from Broad Institute and use the read.gmt() function to parse the file to be used in enricher() and GSEA().
There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler (Yu et al. 2012).
It supports several specices:
# A tibble: 20 × 2
species_name species_common_name
<chr> <chr>
1 Anolis carolinensis Carolina anole, green anole
2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
3 Caenorhabditis elegans <NA>
4 Canis lupus familiaris dog, dogs
5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
6 Drosophila melanogaster fruit fly
7 Equus caballus domestic horse, equine, horse
8 Felis catus cat, cats, domestic cat
9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
10 Homo sapiens human
11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica gray short-tailed opossum
13 Mus musculus house mouse, mouse
14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes chimpanzee
16 Rattus norvegicus brown rat, Norway rat, rat, rats
17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
18 Schizosaccharomyces pombe 972h- <NA>
19 Sus scrofa pig, pigs, swine, wild boar
20 Xenopus tropicalis tropical clawed frog, western clawed frog
We can retrieve all human gene sets:
m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) |> as.data.frame() gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene
1 ABCC4 10257 ENSG00000125257 ABCC4 10257
2 ABRAXAS2 23172 ENSG00000165660 ABRAXAS2 23172
db_ensembl_gene source_gene gs_id gs_name gs_collection
1 ENSG00000125257 ABCC4 M12609 AAACCAC_MIR140 C3
2 ENSG00000165660 KIAA0157 M12609 AAACCAC_MIR140 C3
gs_subcollection gs_collection_name
1 MIR:MIR_LEGACY MIR_Legacy
2 MIR:MIR_LEGACY MIR_Legacy
gs_description
1 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
2 Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-140 (v7.1 miRBase).
gs_source_species gs_pmid gs_geoid gs_exact_source gs_url db_version
1 HS 2025.1.Hs
2 HS 2025.1.Hs
db_target_species
1 HS
2 HS
Or specific collection. Here we use C6, oncogenic gene sets as an example:
# A tibble: 6 × 2
gs_name ncbi_gene
<chr> <chr>
1 AKT_UP.V1_DN 57007
2 AKT_UP.V1_DN 22859
3 AKT_UP.V1_DN 137872
4 AKT_UP.V1_DN 249
5 AKT_UP.V1_DN 271
6 AKT_UP.V1_DN 51129
10.3.1 MSigDb over-presentaton analysis
em <- enricher(gene, TERM2GENE=m_t2g)
head(em) ID Description GeneRatio
RPS14_DN.V1_DN RPS14_DN.V1_DN RPS14_DN.V1_DN 22/183
GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP 16/183
PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN 15/183
VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN 15/183
RB_P107_DN.V1_UP RB_P107_DN.V1_UP RB_P107_DN.V1_UP 10/183
E2F1_UP.V1_UP E2F1_UP.V1_UP E2F1_UP.V1_UP 12/183
BgRatio RichFactor FoldEnrichment zScore
RPS14_DN.V1_DN 186/10927 0.11827957 7.062518 10.883293
GCNP_SHH_UP_LATE.V1_UP 181/10927 0.08839779 5.278266 7.574549
PRC2_EZH2_UP.V1_DN 192/10927 0.07812500 4.664874 6.686237
VEGF_A_UP.V1_DN 193/10927 0.07772021 4.640703 6.659725
RB_P107_DN.V1_UP 130/10927 0.07692308 4.593106 5.378527
E2F1_UP.V1_UP 188/10927 0.06382979 3.811301 5.074316
pvalue p.adjust qvalue
RPS14_DN.V1_DN 4.615656e-13 7.800459e-11 6.364747e-11
GCNP_SHH_UP_LATE.V1_UP 5.729300e-08 4.841259e-06 3.950202e-06
PRC2_EZH2_UP.V1_DN 7.531662e-07 3.400322e-05 2.774476e-05
VEGF_A_UP.V1_DN 8.048099e-07 3.400322e-05 2.774476e-05
RB_P107_DN.V1_UP 6.370479e-05 2.005313e-03 1.636225e-03
E2F1_UP.V1_UP 7.445084e-05 2.005313e-03 1.636225e-03
geneID
RPS14_DN.V1_DN 10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
GCNP_SHH_UP_LATE.V1_UP 55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
PRC2_EZH2_UP.V1_DN 8318/55388/4605/23397/9787/55355/10460/6362/81620/2146/7272/9212/11182/3887/24137
VEGF_A_UP.V1_DN 8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
RB_P107_DN.V1_UP 8318/23397/79733/6241/4085/8208/9055/24137/4174/1307
E2F1_UP.V1_UP 55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
Count
RPS14_DN.V1_DN 22
GCNP_SHH_UP_LATE.V1_UP 16
PRC2_EZH2_UP.V1_DN 15
VEGF_A_UP.V1_DN 15
RB_P107_DN.V1_UP 10
E2F1_UP.V1_UP 12
10.3.2 MSigDb gene set enrichment analysis
In over-presentaton analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motif using GSEA approach.
# A tibble: 6 × 2
gs_name ncbi_gene
<chr> <chr>
1 AAACCAC_MIR140 10257
2 AAACCAC_MIR140 23172
3 AAACCAC_MIR140 81
4 AAACCAC_MIR140 90
5 AAACCAC_MIR140 8754
6 AAACCAC_MIR140 11096
em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2) ID Description setSize
HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES HSD17B8_TARGET_GENES 400
E2F1_Q3 E2F1_Q3 E2F1_Q3 187
E2F1_Q6 E2F1_Q6 E2F1_Q6 185
SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 SGCGSSAAA_E2F1DP2_01 129
E2F_Q6 E2F_Q6 E2F_Q6 190
E2F1DP1_01 E2F1DP1_01 E2F1DP1_01 184
enrichmentScore NES pvalue p.adjust
HSD17B8_TARGET_GENES 0.6375083 3.089450 1.000000e-10 1.650500e-07
E2F1_Q3 0.4983913 2.232817 1.000000e-10 1.650500e-07
E2F1_Q6 0.4906657 2.199247 3.310158e-10 3.642277e-07
SGCGSSAAA_E2F1DP2_01 0.5408182 2.267070 5.733592e-10 4.731647e-07
E2F_Q6 0.4672345 2.100641 1.756396e-09 9.155890e-07
E2F1DP1_01 0.4675989 2.096925 2.218937e-09 9.155890e-07
qvalue rank leading_edge
HSD17B8_TARGET_GENES 1.307895e-07 1042 tags=36%, list=8%, signal=34%
E2F1_Q3 1.307895e-07 2045 tags=36%, list=16%, signal=30%
E2F1_Q6 2.886225e-07 1819 tags=34%, list=15%, signal=30%
SGCGSSAAA_E2F1DP2_01 3.749467e-07 1797 tags=38%, list=14%, signal=33%
E2F_Q6 7.255341e-07 1663 tags=29%, list=13%, signal=25%
E2F1DP1_01 7.255341e-07 1797 tags=33%, list=14%, signal=28%
core_enrichment
HSD17B8_TARGET_GENES 55143/55388/991/2305/9493/1062/4605/9833/9133/10403/7153/23397/79733/259266/6241/55165/9787/11065/55355/9582/220134/55872/83461/10460/4751/10635/55839/890/9415/983/54821/4085/9837/5080/332/7272/64151/2842/9212/51659/9319/9055/3833/146909/10112/51514/4174/9232/1033/9928/3161/11004/993/4603/79801/990/5347/55215/701/55723/51512/55635/9156/11130/10024/57405/10615/1894/2491/8438/9700/5888/7083/898/3149/56992/9768/4288/10733/1163/4175/5307/2237/29899/55010/59336/56942/5984/29028/3838/1058/54478/3015/699/6491/81611/1063/27346/64785/9401/26271/51026/641/1869/10535/1029/28998/1763/8970/54892/55159/864/93594/1789/4176/7111/3148/116832/9585/9735/55732/81624/23310/1871/1031/79915/995/10051/30012/1104/80178/78995/7468/284403/5230/8549/7884/5558/4172/5424/79866/3182/56902/83990
E2F1_Q3 79733/6241/983/5080/2146/8715/4171/993/990/56992/9768/4998/10733/7037/4175/54962/54954/29028/3015/2643/85236/5111/64785/26271/51053/1869/3925/114/5427/4176/7112/4436/1871/79915/9088/6632/284403/7290/2597/7398/4172/5424/124222/1633/10849/9824/4678/6646/29902/63967/56886/6059/23234/1786/204/51087/23636/2048/83463/10393/79173/688/3007/79677/9462/56946/9221
E2F1_Q6 79733/6241/983/5080/81620/2146/4171/993/990/3159/9768/4998/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/29902/50/63967/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173/688
SGCGSSAAA_E2F1DP2_01 79733/6241/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/5427/4176/5902/4436/1871/79915/9088/6632/284403/7290/4172/5424/124222/1633/7374/2189/4678/2289/50/63967/56886/23234/6839/1786/204/51087/23636/1174/83463/10492/79173
E2F_Q6 8318/79733/6241/983/5080/81620/2146/4171/993/990/5888/9768/4175/4173/29028/3015/85236/5111/64785/26271/51053/1869/3925/5427/23649/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/7353/4678/6646/50/63967/56886/23234/1786/204/6427/51087/23636
E2F1DP1_01 79733/6241/983/5080/2146/4171/993/990/3159/9768/4175/4173/29028/3015/4609/85236/5111/64785/26271/51053/1869/3925/5427/4176/7112/5902/4436/1871/79915/9088/6632/284403/7290/2597/4172/5424/124222/1633/10849/7374/9824/2189/4678/2289/6646/50/63967/5457/56886/23234/6839/1786/204/51087/23636/11335/1174/83463/10492/79173