8  Disease enrichment analysis

We developed DOSE (Yu et al. 2015) package to promote the investigation of diseases. DOSE provides five methods for measuring semantic similarities among DO terms and gene products, hypergeometric model and gene set enrichment analysis (GSEA) for associating disease with gene list and extracting disease association insight from genome wide expression profiles.

8.1 Disease over-representation analysis

DOSE supports enrichment analysis of Disease Ontology (DO) (Schriml et al. 2011), Network of Cancer Gene (A. et al. 2016) and Disease Gene Network (DisGeNET) (Janet et al. 2015). In addition, several visualization methods were provided by enrichplot to help interpreting semantic and enrichment results.

8.1.1 Over-representation analysis for disease ontology

In the following example, we selected genes with fold change above 1.5 as the differential genes and analyzed their disease association.

library(DOSE)
data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  
x <- enrichDO(gene          = gene,
              ont           = "HDO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList),
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = FALSE)
head(x)
                       ID              Description GeneRatio  BgRatio
DOID:0060306 DOID:0060306    Meier-Gorlin syndrome     6/279  10/6272
DOID:10534     DOID:10534           stomach cancer    25/279 210/6272
DOID:5041       DOID:5041        esophageal cancer    18/279 132/6272
DOID:820         DOID:820              myocarditis     8/279  31/6272
DOID:1107       DOID:1107     esophageal carcinoma    16/279 117/6272
DOID:2799       DOID:2799 bronchiolitis obliterans     7/279  25/6272
             RichFactor FoldEnrichment   zScore       pvalue    p.adjust
DOID:0060306  0.6000000      13.488172 8.526891 1.327073e-06 0.001944162
DOID:10534    0.1190476       2.676225 5.330676 5.434138e-06 0.003980506
DOID:5041     0.1363636       3.065494 5.174579 1.873233e-05 0.009147624
DOID:820      0.2580645       5.801364 5.781849 4.472203e-05 0.015511798
DOID:1107     0.1367521       3.074227 4.886339 5.294129e-05 0.015511798
DOID:2799     0.2800000       6.294480 5.722765 7.697571e-05 0.017376259
                  qvalue
DOID:0060306 0.001944162
DOID:10534   0.003980506
DOID:5041    0.009147624
DOID:820     0.015511798
DOID:1107    0.015511798
DOID:2799    0.017376259
                                                                                                                                       geneID
DOID:0060306                                                                                                   8318/4998/81620/990/4174/23594
DOID:10534   3572/3576/4312/10403/8792/563/2952/3117/2018/3131/2305/3169/7031/2066/259266/332/2146/347902/8061/81930/4288/6352/6926/4318/8140
DOID:5041                                            6890/4321/11122/3576/4312/3868/7850/563/2952/2018/3169/2066/771/2146/7373/2921/4318/8140
DOID:820                                                                                             1493/6279/8792/3627/6280/2697/29851/4982
DOID:1107                                                      6890/4321/11122/3576/4312/3868/563/2952/2018/3169/2066/771/2146/7373/4318/8140
DOID:2799                                                                                                  4582/4283/3627/3002/6352/6373/4318
             Count
DOID:0060306     6
DOID:10534      25
DOID:5041       18
DOID:820         8
DOID:1107       16
DOID:2799        7

The enrichDO() function requires an entrezgene ID vector as input, which is mostly the differential gene list from gene expression profile studies. Please refer to Section 17.1 if you need to convert other gene ID types to entrezgene ID.

The ont parameter can be “HDO” (Human Disease Ontology), “HPO” (Human Phenotype Ontology) or “MPO” (Mouse Phenotype Ontology). pvalueCutoff setting the cutoff value of p value and adjusted p value; pAdjustMethod setting the p value correction methods, include the Bonferroni correction (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”), Hommel (“hommel”), Benjamini & Hochberg (“BH”) and Benjamini & Yekutieli (“BY”) while qvalueCutoff is used to control q-values.

The universe sets the background gene universe for testing. If users do not explicitly set this parameter, enrichDO() will set the universe to all human genes that have DO annotation.

The minGSSize (and maxGSSize) indicates that only those DO terms that have more than minGSSize (and less than maxGSSize) annotated genes will be tested.

The readable is a logical parameter that indicates whether the entrezgene IDs will be mapped to gene symbols or not, see also Section 17.2.

8.1.2 Over-representation analysis for the network of cancer gene

Network of Cancer Gene (NCG) (A. et al. 2016) is a manually curated repository of cancer genes. NCG release 5.0 (Aug. 2015) collects 1,571 cancer genes from 175 published studies. DOSE supports analyzing gene list and determine whether they are enriched in genes known to be mutated in a given cancer type.

gene2 <- names(geneList)[abs(geneList) > 3]
ncg <- enrichNCG(gene2) 
head(ncg)
 [1] ID             Description    GeneRatio      BgRatio        RichFactor    
 [6] FoldEnrichment zScore         pvalue         p.adjust       qvalue        
[11] geneID         Count         
<0 rows> (or 0-length row.names)

8.1.3 Over-representation analysis for the disease gene network

DisGeNET(Janet et al. 2015) is an integrative and comprehensive resource of gene-disease associations from several public data sources and the literature. It contains gene-disease associations and snp-gene-disease associations.

The enrichment analysis of disease-gene associations is supported by the enrichDGN function and analysis of snp-gene-disease associations is supported by the enrichDGNv function.

dgn <- enrichDGN(gene) 
head(dgn)
               ID                              Description GeneRatio   BgRatio
C0010278 C0010278                         Craniosynostosis    43/497 488/21671
C0853879 C0853879             Invasive carcinoma of breast    42/497 473/21671
C4733092 C4733092 estrogen receptor-negative breast cancer    34/497 356/21671
C3642347 C3642347              Basal-Like Breast Carcinoma    28/497 245/21671
C3642345 C3642345               Luminal A Breast Carcinoma    22/497 153/21671
C0036202 C0036202                              Sarcoidosis    36/497 413/21671
         RichFactor FoldEnrichment    zScore       pvalue     p.adjust
C0010278 0.08811475       3.842122  9.728932 4.609534e-14 3.150086e-10
C0853879 0.08879493       3.871780  9.674768 7.105190e-14 3.150086e-10
C4733092 0.09550562       4.164391  9.223137 2.446675e-12 6.756634e-09
C3642347 0.11428571       4.983271  9.606353 3.047991e-12 6.756634e-09
C3642345 0.14379085       6.269802 10.021789 7.034749e-12 1.172052e-08
C0036202 0.08716707       3.800800  8.804448 7.930882e-12 1.172052e-08
               qvalue
C0010278 3.150086e-10
C0853879 3.150086e-10
C4733092 6.756634e-09
C3642347 6.756634e-09
C3642345 1.172052e-08
C0036202 1.172052e-08
                                                                                                                                                                                                                          geneID
C0010278 820/8318/1493/6890/3572/4998/81620/2152/3576/990/6278/4312/6279/2200/1846/4069/563/5327/3627/3117/6280/2697/2625/9607/26279/4330/27299/29968/7031/4982/6362/1062/3667/2146/3002/3479/1300/23594/6352/6424/185/1308/4318
C0853879        5080/4582/1894/9582/23158/7980/2173/3576/983/6278/4312/9232/1111/7153/8792/1978/2952/5241/2625/53335/9787/3169/4171/4982/2066/51203/2146/3479/890/8836/701/367/2151/185/10855/4102/8842/5304/6926/1602/4318/6664
C4733092                                         4582/1894/81620/3576/6278/8792/1846/5241/5327/2625/1408/11004/4137/2305/29968/3169/4982/10551/2491/3667/7083/2146/3479/8061/81930/6241/3620/8839/367/80129/214/27324/5304/79733
C3642347                                                                          29127/18/2173/3576/8792/1846/5241/55765/2625/6790/2305/4605/3169/11065/1062/2568/2146/3479/3159/3620/9833/27324/7368/10232/5163/4318/6663/6664
C3642345                                                                                                         4582/2152/2001/3576/1111/7153/3161/8614/5241/4128/2625/55355/2305/3169/4288/185/80129/9833/27324/5304/4318/6663
C0036202                               4582/6890/4485/4321/3576/4312/6355/10403/6279/4283/4069/2952/3627/3117/6280/5004/1524/29851/27299/2167/4982/6863/2053/6362/25802/23541/3002/3479/6364/80736/7043/6352/26227/185/6373/4318
         Count
C0010278    43
C0853879    42
C4733092    34
C3642347    28
C3642345    22
C0036202    36
snp <- c("rs1401296", "rs9315050", "rs5498", "rs1524668", "rs147377392",
         "rs841", "rs909253", "rs7193343", "rs3918232", "rs3760396",
         "rs2231137", "rs10947803", "rs17222919", "rs386602276", "rs11053646",
         "rs1805192", "rs139564723", "rs2230806", "rs20417", "rs966221")
dgnv <- enrichDGNv(snp)
head(dgnv)
               ID               Description GeneRatio    BgRatio RichFactor
C0010054 C0010054 Coronary Arteriosclerosis      6/17 440/194515 0.01363636
C0151744 C0151744       Myocardial Ischemia      4/17 103/194515 0.03883495
C0031099 C0031099             Periodontitis      4/17 116/194515 0.03448276
C0007785 C0007785       Cerebral Infarction      4/17 123/194515 0.03252033
C0003850 C0003850          Arteriosclerosis      4/17 267/194515 0.01498127
C0004153 C0004153           Atherosclerosis      4/17 281/194515 0.01423488
         FoldEnrichment   zScore       pvalue     p.adjust       qvalue
C0010054       156.0281 30.43647 1.568917e-12 5.895991e-09 5.895991e-09
C0151744       444.3518 42.07730 1.754840e-10 3.297345e-07 3.297345e-07
C0031099       394.5538 39.63952 2.839985e-10 3.381759e-07 3.381759e-07
C0007785       372.0995 38.48983 3.599531e-10 3.381759e-07 3.381759e-07
C0003850       171.4166 26.05143 8.145389e-09 6.122074e-06 6.122074e-06
C0004153       162.8763 25.38727 9.996713e-09 6.261274e-06 6.261274e-06
                                                            geneID Count
C0010054 rs11053646/rs1805192/rs5498/rs20417/rs147377392/rs2230806     6
C0151744                   rs11053646/rs1805192/rs5498/rs147377392     4
C0031099                         rs909253/rs1805192/rs5498/rs20417     4
C0007785                rs11053646/rs1805192/rs147377392/rs2230806     4
C0003850                        rs1805192/rs5498/rs20417/rs2230806     4
C0004153                        rs1805192/rs5498/rs20417/rs2230806     4

8.2 Disease gene set enrichment analysis

8.2.1 gseDO function

In the following example, in order to speed up the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

library(DOSE)
data(geneList)
y <- gseDO(geneList,
           minGSSize     = 120,
           pvalueCutoff  = 0.2,
           pAdjustMethod = "BH",
           verbose       = FALSE)
head(y, 3)
                   ID                           Description setSize
DOID:612     DOID:612      primary immunodeficiency disease     234
DOID:2490   DOID:2490 congenital nervous system abnormality     114
DOID:11054 DOID:11054                urinary bladder cancer     168
           enrichmentScore      NES       pvalue     p.adjust       qvalue rank
DOID:612         0.4497659 2.067220 3.118029e-10 5.674812e-08 3.274348e-09 2521
DOID:2490        0.4518077 1.901323 4.172011e-05 1.518612e-03 8.762340e-05 2245
DOID:11054       0.3925549 1.750664 3.521145e-05 1.518612e-03 8.762340e-05 1687
                             leading_edge
DOID:612   tags=43%, list=20%, signal=35%
DOID:2490  tags=35%, list=18%, signal=29%
DOID:11054 tags=27%, list=14%, signal=23%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     core_enrichment
DOID:612   55388/7153/9837/29851/9636/1503/1493/7037/4173/3932/3559/6772/51311/3507/3561/917/3574/3575/919/4860/915/22806/5693/4938/1535/3458/959/5336/11151/3702/925/4688/64135/28755/50615/974/1794/3689/5788/5424/916/7096/4068/3937/30009/5695/3394/10525/100/7374/3659/940/939/4689/5880/7128/6891/4210/6789/5699/930/6573/11322/204/6850/10095/7124/3569/7097/7852/8772/5692/64170/3119/1956/28985/1053/5971/1536/10125/8456/8625/3071/7293/4478/1380/958/5054/5591/9437/10379/54440/3570/3978/3593/10625/29927/3558/51371/735
DOID:2490                                                                                                                                                                                                                                                                                                 1062/23397/259266/55165/4085/4001/9928/23007/9918/699/6491/84823/6496/8543/7277/8941/9585/81624/23310/84790/284403/7545/16/203068/10381/27341/10383/1020/10376/7159/5129/1641/833/54517/55835/81027/1119/84107/23019/55262
DOID:11054                                                                                                                                                                                                                                                                                                4312/6280/6279/7153/6241/983/332/2146/6790/993/5347/4318/898/4288/2175/768/994/7298/4609/1869/1029/727897/4830/864/3458/2950/4436/2810/2956/1019/3383/239/11200/7428/6696/6285/7422/999/6659/3732/1786/1312/1890/1030/7097

8.2.2 gseNCG function

ncg <- gseNCG(geneList,
              pvalueCutoff  = 0.5,
              pAdjustMethod = "BH",
              verbose       = FALSE)
ncg <- setReadable(ncg, 'org.Hs.eg.db')
head(ncg, 3) 
                                                           ID Description
pan-gynecological and breast     pan-gynecological and breast          NA
breast_fibroepithelial_tumours breast_fibroepithelial_tumours          NA
pan-gastric                                       pan-gastric          NA
                               setSize enrichmentScore       NES      pvalue
pan-gynecological and breast        43      -0.5263430 -1.701061 0.002263971
breast_fibroepithelial_tumours      17      -0.6421576 -1.687765 0.009662272
pan-gastric                         49      -0.4993804 -1.650134 0.007854058
                                p.adjust     qvalue rank
pan-gynecological and breast   0.1713378 0.07782234 2464
breast_fibroepithelial_tumours 0.2125700 0.09655020 2700
pan-gastric                    0.2125700 0.09655020 3280
                                                 leading_edge
pan-gynecological and breast   tags=40%, list=20%, signal=32%
breast_fibroepithelial_tumours tags=53%, list=22%, signal=42%
pan-gastric                    tags=49%, list=26%, signal=36%
                                                                                                                                                              core_enrichment
pan-gynecological and breast                                            NIPBL/SPOP/ARID1A/RASA1/RB1/RNF43/MAP2K4/NF1/CTNNB1/TP53/PIK3R1/CDKN1B/CCND1/ARID5B/MAP3K1/TBX3/GATA3
breast_fibroepithelial_tumours                                                                                               SETD2/RB1/PCNX4/NF1/TP53/RARA/SYNE1/MAP3K1/ERBB4
pan-gastric                    BCOR/SOX9/TCF7L2/ATM/CALD1/SEMG2/HTR7/ARID1A/RASA1/RB1/TTBK2/RNF43/CTNNB1/TP53/BCL9/SMAD3/APC/ZFP36L2/TGFBR2/MUC6/MAP3K1/CACNA1C/ATP8B1/CYP4B1

8.2.3 gseDGN function

dgn <- gseDGN(geneList,
              pvalueCutoff  = 0.2,
              pAdjustMethod = "BH",
              verbose       = FALSE)
dgn <- setReadable(dgn, 'org.Hs.eg.db')
head(dgn, 3) 
               ID                                      Description setSize
C0024266 C0024266                     Lymphocytic Choriomeningitis     120
C2700553 C2700553                                   Omenn Syndrome      29
C0153014 C0153014 Non-arthropod borne lymphocytic choriomeningitis     112
         enrichmentScore      NES       pvalue     p.adjust       qvalue rank
C0024266       0.5712593 2.399254 5.968593e-11 2.645579e-07 1.442019e-07 2579
C2700553       0.7461280 2.372404 4.839696e-07 2.031012e-04 1.107039e-04 1123
C0153014       0.5688856 2.359965 3.679235e-10 9.752880e-07 5.315978e-07 2579
                           leading_edge
C0024266 tags=48%, list=21%, signal=38%
C2700553  tags=52%, list=9%, signal=47%
C0153014 tags=49%, list=21%, signal=39%
                                                                                                                                                                                                                                                                                                                                 core_enrichment
C0024266 S100A9/CXCL10/CXCL9/EZH2/GZMB/ICOS/USP18/CXCL8/CTLA4/TREM1/PRF1/ADM/CA9/STAT1/CCL2/SELL/CDKN2A/IL7/IL7R/IFNG/CCR5/IL27RA/SH2D1A/FCER1G/CDK2AP2/CPVL/CD27/PSMB10/PTPN22/SLAMF1/KDM1A/TNF/IL6/FGL2/TLR2/RPAIN/NELFCD/PDCD1/WAS/HIF1A/ATP5F1B/FCGR2B/EGR2/STX11/CXCR3/TYROBP/YME1L1/SOSTDC1/PTPN2/TRAF1/HNF1A/IRF9/PML/NR0B2/IL2/TOX/AGFG1
C2700553                                                                                                                                                                                                                                                      ISG20/CTLA4/TFRC/IL2RA/IL2RG/IL7R/PNP/CD3D/IFNG/CD40LG/CORO1A/IL21R/PTPRC/CD3E/ADA
C0153014            S100A9/CXCL10/EZH2/GZMB/ICOS/USP18/CXCL8/CTLA4/TREM1/PRF1/ADM/CA9/STAT1/CCL2/SELL/CDKN2A/IL7/IL7R/IFNG/IL27RA/SH2D1A/FCER1G/CDK2AP2/CPVL/CD27/PSMB10/PTPN22/SLAMF1/KDM1A/TNF/IL6/FGL2/TLR2/RPAIN/NELFCD/PDCD1/WAS/HIF1A/ATP5F1B/FCGR2B/EGR2/STX11/CXCR3/TYROBP/YME1L1/SOSTDC1/PTPN2/TRAF1/HNF1A/IRF9/PML/NR0B2/IL2/TOX/AGFG1

References

A., Omer, Giovanni M. D., Thanos P. M., and Francesca D. C. 2016. “NCG 5.0: Updates of a Manually Curated Repository of Cancer Genes and Associated Properties from Cancer Mutational Screenings.” Nucleic Acids Research 44 (D1): D992–99. https://doi.org/10.1093/nar/gkv1123.
Janet, P., Núria Q. R., Àlex B., Jordi D. P., Anna B. M., Martin B., Ferran S., and Laura I. F. 2015. “DisGeNET: A Discovery Platform for the Dynamical Exploration of Human Diseases and Their Genes.” Database 2015 (March): bav028. https://doi.org/10.1093/database/bav028.
Schriml, L. M., C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe. 2011. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1): D940–46. https://doi.org/10.1093/nar/gkr972.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. DOSE: An r/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.