8  Disease enrichment analysis

We developed DOSE (Yu et al. 2015) package to promote the investigation of diseases. DOSE provides five methods for measuring semantic similarities among DO terms and gene products, hypergeometric model and gene set enrichment analysis (GSEA) for associating disease with gene list and extracting disease association insight from genome wide expression profiles.

8.1 Disease over-representation analysis

DOSE supports enrichment analysis of Disease Ontology (DO) (Schriml et al. 2011), Network of Cancer Gene (A. et al. 2016) and Disease Gene Network (DisGeNET) (Janet et al. 2015). In addition, several visualization methods were provided by enrichplot to help interpreting semantic and enrichment results.

8.1.1 Over-representation analysis for disease ontology

In the following example, we selected genes with fold change above 1.5 as the differential genes and analyzed their disease association.

library(DOSE)
data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  
x <- enrichDO(gene          = gene,
              ont           = "HDO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList),
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = FALSE)
head(x)
                       ID              Description GeneRatio  BgRatio
DOID:0060306 DOID:0060306    Meier-Gorlin syndrome     6/279  10/6272
DOID:10534     DOID:10534           stomach cancer    25/279 210/6272
DOID:5041       DOID:5041        esophageal cancer    18/279 132/6272
DOID:820         DOID:820              myocarditis     8/279  31/6272
DOID:1107       DOID:1107     esophageal carcinoma    16/279 117/6272
DOID:2799       DOID:2799 bronchiolitis obliterans     7/279  25/6272
             RichFactor FoldEnrichment   zScore       pvalue    p.adjust
DOID:0060306  0.6000000      13.488172 8.526891 1.327073e-06 0.001944162
DOID:10534    0.1190476       2.676225 5.330676 5.434138e-06 0.003980506
DOID:5041     0.1363636       3.065494 5.174579 1.873233e-05 0.009147624
DOID:820      0.2580645       5.801364 5.781849 4.472203e-05 0.015511798
DOID:1107     0.1367521       3.074227 4.886339 5.294129e-05 0.015511798
DOID:2799     0.2800000       6.294480 5.722765 7.697571e-05 0.017376259
                  qvalue
DOID:0060306 0.001944162
DOID:10534   0.003980506
DOID:5041    0.009147624
DOID:820     0.015511798
DOID:1107    0.015511798
DOID:2799    0.017376259
                                                                                                                                       geneID
DOID:0060306                                                                                                   8318/81620/4174/4998/23594/990
DOID:10534   3572/2018/3576/4288/6352/3117/81930/2066/4312/6926/347902/563/8140/259266/2305/3131/2146/332/8061/3169/7031/8792/10403/2952/4318
DOID:5041                                            2018/3576/6890/771/2066/4312/7850/563/8140/2146/11122/3169/2921/3868/4321/7373/2952/4318
DOID:820                                                                                             2697/6280/3627/1493/4982/29851/8792/6279
DOID:1107                                                      2018/3576/6890/771/2066/4312/563/8140/2146/11122/3169/3868/4321/7373/2952/4318
DOID:2799                                                                                                  6352/4283/6373/3627/4582/3002/4318
             Count
DOID:0060306     6
DOID:10534      25
DOID:5041       18
DOID:820         8
DOID:1107       16
DOID:2799        7

The enrichDO() function requires an entrezgene ID vector as input, which is mostly the differential gene list from gene expression profile studies. Please refer to Section 17.1 if you need to convert other gene ID types to entrezgene ID.

The ont parameter can be “HDO” (Human Disease Ontology), “HPO” (Human Phenotype Ontology) or “MPO” (Mouse Phenotype Ontology). pvalueCutoff setting the cutoff value of p value and adjusted p value; pAdjustMethod setting the p value correction methods, include the Bonferroni correction (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”), Hommel (“hommel”), Benjamini & Hochberg (“BH”) and Benjamini & Yekutieli (“BY”) while qvalueCutoff is used to control q-values.

The universe sets the background gene universe for testing. If users do not explicitly set this parameter, enrichDO() will set the universe to all human genes that have DO annotation.

The minGSSize (and maxGSSize) indicates that only those DO terms that have more than minGSSize (and less than maxGSSize) annotated genes will be tested.

The readable is a logical parameter that indicates whether the entrezgene IDs will be mapped to gene symbols or not, see also Section 17.2.

8.1.2 Over-representation analysis for the network of cancer gene

Network of Cancer Gene (NCG) (A. et al. 2016) is a manually curated repository of cancer genes. NCG release 5.0 (Aug. 2015) collects 1,571 cancer genes from 175 published studies. DOSE supports analyzing gene list and determine whether they are enriched in genes known to be mutated in a given cancer type.

gene2 <- names(geneList)[abs(geneList) > 3]
ncg <- enrichNCG(gene2) 
head(ncg)
 [1] ID             Description    GeneRatio      BgRatio        RichFactor    
 [6] FoldEnrichment zScore         pvalue         p.adjust       qvalue        
[11] geneID         Count         
<0 rows> (or 0-length row.names)

8.1.3 Over-representation analysis for the disease gene network

DisGeNET(Janet et al. 2015) is an integrative and comprehensive resource of gene-disease associations from several public data sources and the literature. It contains gene-disease associations and snp-gene-disease associations.

The enrichment analysis of disease-gene associations is supported by the enrichDGN function and analysis of snp-gene-disease associations is supported by the enrichDGNv function.

dgn <- enrichDGN(gene) 
head(dgn)
               ID                              Description GeneRatio   BgRatio
C0010278 C0010278                         Craniosynostosis    43/497 488/21671
C0853879 C0853879             Invasive carcinoma of breast    42/497 473/21671
C4733092 C4733092 estrogen receptor-negative breast cancer    34/497 356/21671
C3642347 C3642347              Basal-Like Breast Carcinoma    28/497 245/21671
C3642345 C3642345               Luminal A Breast Carcinoma    22/497 153/21671
C0036202 C0036202                              Sarcoidosis    36/497 413/21671
         RichFactor FoldEnrichment    zScore       pvalue     p.adjust
C0010278 0.08811475       3.842122  9.728932 4.609534e-14 3.150086e-10
C0853879 0.08879493       3.871780  9.674768 7.105190e-14 3.150086e-10
C4733092 0.09550562       4.164391  9.223137 2.446675e-12 6.756634e-09
C3642347 0.11428571       4.983271  9.606353 3.047991e-12 6.756634e-09
C3642345 0.14379085       6.269802 10.021789 7.034749e-12 1.172052e-08
C0036202 0.08716707       3.800800  8.804448 7.930882e-12 1.172052e-08
               qvalue
C0010278 3.150086e-10
C0853879 3.150086e-10
C4733092 6.756634e-09
C3642347 6.756634e-09
C3642345 1.172052e-08
C0036202 1.172052e-08
                                                                                                                                                                                                                          geneID
C0010278 8318/2625/26279/81620/1300/3572/820/6424/3479/3576/6352/2152/3117/6890/1062/4312/4998/23594/9607/2697/6280/185/563/4069/990/2200/27299/1308/6278/3627/4330/1493/3667/3002/2146/4982/6362/29968/7031/1846/5327/6279/4318
C0853879        9582/1894/7153/2625/2173/51203/6664/890/3479/3576/5080/367/2066/4312/6926/701/8836/9232/53335/2151/4171/185/5241/10855/7980/1978/6278/1111/9787/4582/5304/2146/4982/4102/3169/983/8792/8842/2952/23158/1602/4318
C4733092                                         1894/2625/81620/3479/3576/367/7083/81930/6241/8839/79733/3620/5241/1408/4137/6278/214/2305/10551/2491/80129/4582/3667/5304/2146/4982/8061/29968/3169/1846/8792/27324/11004/5327
C3642347                                                                          6663/9833/29127/2625/2173/6664/3479/3576/11065/1062/4605/2568/3620/5241/10232/55765/18/2305/5163/2146/6790/3169/1846/7368/8792/27324/3159/4318
C3642345                                                                                                         6663/9833/7153/2625/2001/3576/4288/2152/185/4128/5241/3161/1111/2305/80129/4582/5304/3169/27324/55355/8614/4318
C0036202                               3479/3576/6352/4283/3117/6373/6890/4485/2167/4312/6863/6364/1524/7043/2053/23541/6280/185/4069/25802/27299/6355/3627/4582/3002/4982/5004/6362/29851/80736/4321/10403/2952/6279/26227/4318
         Count
C0010278    43
C0853879    42
C4733092    34
C3642347    28
C3642345    22
C0036202    36
snp <- c("rs1401296", "rs9315050", "rs5498", "rs1524668", "rs147377392",
         "rs841", "rs909253", "rs7193343", "rs3918232", "rs3760396",
         "rs2231137", "rs10947803", "rs17222919", "rs386602276", "rs11053646",
         "rs1805192", "rs139564723", "rs2230806", "rs20417", "rs966221")
dgnv <- enrichDGNv(snp)
head(dgnv)
               ID               Description GeneRatio    BgRatio RichFactor
C0010054 C0010054 Coronary Arteriosclerosis      6/17 440/194515 0.01363636
C0151744 C0151744       Myocardial Ischemia      4/17 103/194515 0.03883495
C0031099 C0031099             Periodontitis      4/17 116/194515 0.03448276
C0007785 C0007785       Cerebral Infarction      4/17 123/194515 0.03252033
C0003850 C0003850          Arteriosclerosis      4/17 267/194515 0.01498127
C0004153 C0004153           Atherosclerosis      4/17 281/194515 0.01423488
         FoldEnrichment   zScore       pvalue     p.adjust       qvalue
C0010054       156.0281 30.43647 1.568917e-12 5.895991e-09 5.895991e-09
C0151744       444.3518 42.07730 1.754840e-10 3.297345e-07 3.297345e-07
C0031099       394.5538 39.63952 2.839985e-10 3.381759e-07 3.381759e-07
C0007785       372.0995 38.48983 3.599531e-10 3.381759e-07 3.381759e-07
C0003850       171.4166 26.05143 8.145389e-09 6.122074e-06 6.122074e-06
C0004153       162.8763 25.38727 9.996713e-09 6.261274e-06 6.261274e-06
                                                            geneID Count
C0010054 rs1805192/rs5498/rs147377392/rs20417/rs2230806/rs11053646     6
C0151744                   rs1805192/rs5498/rs147377392/rs11053646     4
C0031099                         rs909253/rs1805192/rs5498/rs20417     4
C0007785                rs1805192/rs147377392/rs2230806/rs11053646     4
C0003850                        rs1805192/rs5498/rs20417/rs2230806     4
C0004153                        rs1805192/rs5498/rs20417/rs2230806     4

8.2 Disease gene set enrichment analysis

8.2.1 gseDO function

In the following example, in order to speed up the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

library(DOSE)
data(geneList)
y <- gseDO(geneList,
           minGSSize     = 120,
           pvalueCutoff  = 0.2,
           pAdjustMethod = "BH",
           verbose       = FALSE)
head(y, 3)
                       ID                           Description setSize
DOID:612         DOID:612      primary immunodeficiency disease     234
DOID:2490       DOID:2490 congenital nervous system abnormality     114
DOID:0080599 DOID:0080599        Coronavirus infectious disease     131
             enrichmentScore      NES       pvalue     p.adjust       qvalue
DOID:612           0.4497659 2.052375 8.104981e-10 1.475107e-07 6.288567e-09
DOID:2490          0.4518077 1.879457 4.539146e-05 2.045595e-03 8.720634e-05
DOID:0080599       0.4090312 1.744238 5.873996e-05 2.045595e-03 8.720634e-05
             rank                   leading_edge
DOID:612     2521 tags=43%, list=20%, signal=35%
DOID:2490    2245 tags=35%, list=18%, signal=29%
DOID:0080599 1146  tags=23%, list=9%, signal=21%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       core_enrichment
DOID:612     55388/7153/9837/29851/9636/1503/1493/7037/4173/3932/3559/6772/51311/3507/3561/917/3574/3575/919/4860/915/22806/5693/4938/1535/3458/959/5336/11151/3702/925/4688/64135/28755/50615/974/1794/3689/5788/5424/916/7096/4068/3937/30009/5695/3394/10525/100/7374/3659/940/939/4689/5880/7128/6891/4210/6789/5699/930/6573/11322/204/6850/10095/7124/3569/7097/7852/8772/5692/64170/3119/1956/28985/1053/5971/1536/10125/8456/8625/3071/7293/4478/1380/958/5054/5591/9437/10379/54440/3570/3978/3593/10625/29927/3558/51371/735
DOID:2490                                                                                                                                                                                                                                                                                                   1062/23397/259266/55165/4085/4001/9928/23007/9918/699/6491/84823/6496/8543/7277/8941/9585/81624/23310/84790/284403/7545/16/203068/10381/27341/10383/1020/10376/7159/5129/1641/833/54517/55835/81027/1119/84107/23019/55262
DOID:0080599                                                                                                                                                                                                                                                                                                                                                                       4312/6279/3627/4283/3002/6355/2921/3576/6352/3934/59272/713/3559/1230/6772/6347/3561/6351/6590/3574/6354/3458/4783/925/1991/1234/916/6374/1088/2919

8.2.2 gseNCG function

ncg <- gseNCG(geneList,
              pvalueCutoff  = 0.5,
              pAdjustMethod = "BH",
              verbose       = FALSE)
ncg <- setReadable(ncg, 'org.Hs.eg.db')
head(ncg, 3) 
                                                           ID Description
pan-gynecological and breast     pan-gynecological and breast          NA
pan-gastric                                       pan-gastric          NA
breast_fibroepithelial_tumours breast_fibroepithelial_tumours          NA
                               setSize enrichmentScore       NES      pvalue
pan-gynecological and breast        43      -0.5263430 -1.693957 0.003106675
pan-gastric                         49      -0.4993804 -1.681018 0.004291327
breast_fibroepithelial_tumours      17      -0.6421576 -1.661797 0.011351646
                                p.adjust     qvalue rank
pan-gynecological and breast   0.1258789 0.06882972 2464
pan-gastric                    0.1258789 0.06882972 3280
breast_fibroepithelial_tumours 0.2497362 0.13655403 2700
                                                 leading_edge
pan-gynecological and breast   tags=40%, list=20%, signal=32%
pan-gastric                    tags=49%, list=26%, signal=36%
breast_fibroepithelial_tumours tags=53%, list=22%, signal=42%
                                                                                                                                                              core_enrichment
pan-gynecological and breast                                            NIPBL/SPOP/ARID1A/RASA1/RB1/RNF43/MAP2K4/NF1/CTNNB1/TP53/PIK3R1/CDKN1B/CCND1/ARID5B/MAP3K1/TBX3/GATA3
pan-gastric                    BCOR/SOX9/TCF7L2/ATM/CALD1/SEMG2/HTR7/ARID1A/RASA1/RB1/TTBK2/RNF43/CTNNB1/TP53/BCL9/SMAD3/APC/ZFP36L2/TGFBR2/MUC6/MAP3K1/CACNA1C/ATP8B1/CYP4B1
breast_fibroepithelial_tumours                                                                                               SETD2/RB1/PCNX4/NF1/TP53/RARA/SYNE1/MAP3K1/ERBB4

8.2.3 gseDGN function

dgn <- gseDGN(geneList,
              pvalueCutoff  = 0.2,
              pAdjustMethod = "BH",
              verbose       = FALSE)
dgn <- setReadable(dgn, 'org.Hs.eg.db')
head(dgn, 3) 
               ID                  Description setSize enrichmentScore      NES
C0151332 C0151332          Active tuberculosis      92       0.5924568 2.391317
C0024266 C0024266 Lymphocytic Choriomeningitis     120       0.5712593 2.389121
C1136382 C1136382         Sclerocystic Ovaries     106       0.5770160 2.374628
               pvalue     p.adjust       qvalue rank
C0151332 9.505300e-10 1.404408e-06 7.457258e-07 1687
C0024266 6.294850e-11 2.790192e-07 1.481563e-07 2579
C1136382 3.628877e-10 8.635218e-07 4.585209e-07  610
                           leading_edge
C0151332 tags=35%, list=14%, signal=30%
C0024266 tags=48%, list=21%, signal=38%
C1136382  tags=22%, list=5%, signal=21%
                                                                                                                                                                                                                                                                                                                                 core_enrichment
C0151332                                                                                                                                                            CXCL10/ISG20/CAMP/GZMB/LAG3/CXCL8/ISG15/CD38/CTLA4/CCL5/IL2RA/TLR8/CCL2/MUC16/CCL4/SIGLEC1/NCAPG2/CCN6/IFNG/DTYMK/HMOX1/IL32/CCR5/CCL22/IL15/LBP/CD27/EBI3/CD14/TNF/IL6/TLR2
C0024266 S100A9/CXCL10/CXCL9/EZH2/GZMB/ICOS/USP18/CXCL8/CTLA4/TREM1/PRF1/ADM/CA9/STAT1/CCL2/SELL/CDKN2A/IL7/IL7R/IFNG/CCR5/IL27RA/SH2D1A/FCER1G/CDK2AP2/CPVL/CD27/PSMB10/PTPN22/SLAMF1/KDM1A/TNF/IL6/FGL2/TLR2/RPAIN/NELFCD/PDCD1/WAS/HIF1A/ATP5F1B/FCGR2B/EGR2/STX11/CXCR3/TYROBP/YME1L1/SOSTDC1/PTPN2/TRAF1/HNF1A/IRF9/PML/NR0B2/IL2/TOX/AGFG1
C1136382                                                                                                                                                                                                     TOP2A/ASPM/RRM2/CEP55/PBK/MAD2L1/S100P/TTK/CCNB1/CKS2/HMMR/MYBL1/CDC6/RACGAP1/BCL11A/HELLS/TFRC/CCNE2/STAR/SRD5A1/TRPV6/RUNX3/CLDN4

References

A., Omer, Giovanni M. D., Thanos P. M., and Francesca D. C. 2016. “NCG 5.0: Updates of a Manually Curated Repository of Cancer Genes and Associated Properties from Cancer Mutational Screenings.” Nucleic Acids Research 44 (D1): D992–99. https://doi.org/10.1093/nar/gkv1123.
Janet, P., Núria Q. R., Àlex B., Jordi D. P., Anna B. M., Martin B., Ferran S., and Laura I. F. 2015. “DisGeNET: A Discovery Platform for the Dynamical Exploration of Human Diseases and Their Genes.” Database 2015 (March): bav028. https://doi.org/10.1093/database/bav028.
Schriml, L. M., C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe. 2011. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1): D940–46. https://doi.org/10.1093/nar/gkr972.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. DOSE: An r/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.