8 Disease enrichment analysis

We developed DOSE (Yu et al. 2015) package to promote the investigation of diseases. DOSE provides five methods for measuring semantic similarities among DO terms and gene products, hypergeometric model and gene set enrichment analysis (GSEA) for associating disease with gene list and extracting disease association insight from genome wide expression profiles.

8.1 Disease over-representation analysis

DOSE supports enrichment analysis of Disease Ontology (DO) (Schriml et al. 2011), Network of Cancer Gene (A. et al. 2016) and Disease Gene Network (DisGeNET) (Janet et al. 2015). In addition, several visualization methods were provided by enrichplot to help interpreting semantic and enrichment results.

8.1.1 Over-representation analysis for disease ontology

In the following example, we selected genes with fold change above 1.5 as the differential genes and analyzed their disease association.

library(DOSE)
data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)

[1] "4312"  "8318"  "10874" "55143" "55388" "991"

x <- enrichDO(gene          = gene,
              ont           = "HDO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList),
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = FALSE)
head(x)

                       ID              Description GeneRatio  BgRatio
DOID:0060306 DOID:0060306    Meier-Gorlin syndrome     6/279  10/6272
DOID:10534     DOID:10534           stomach cancer    25/279 210/6272
DOID:5041       DOID:5041        esophageal cancer    18/279 132/6272
DOID:820         DOID:820              myocarditis     8/279  31/6272
DOID:1107       DOID:1107     esophageal carcinoma    16/279 117/6272
DOID:2799       DOID:2799 bronchiolitis obliterans     7/279  25/6272
             RichFactor FoldEnrichment   zScore       pvalue    p.adjust
DOID:0060306  0.6000000      13.488172 8.526891 1.327073e-06 0.001944162
DOID:10534    0.1190476       2.676225 5.330676 5.434138e-06 0.003980506
DOID:5041     0.1363636       3.065494 5.174579 1.873233e-05 0.009147624
DOID:820      0.2580645       5.801364 5.781849 4.472203e-05 0.015511798
DOID:1107     0.1367521       3.074227 4.886339 5.294129e-05 0.015511798
DOID:2799     0.2800000       6.294480 5.722765 7.697571e-05 0.017376259
                  qvalue
DOID:0060306 0.001944162
DOID:10534   0.003980506
DOID:5041    0.009147624
DOID:820     0.015511798
DOID:1107    0.015511798
DOID:2799    0.017376259
                                                                                                                                       geneID
DOID:0060306                                                                                                   990/4998/23594/8318/81620/4174
DOID:10534   2952/6926/3572/259266/3576/8792/7031/3169/2066/4312/4318/332/2305/4288/2018/3117/563/81930/6352/8140/2146/8061/3131/10403/347902
DOID:5041                                            2952/7850/4321/3576/3868/3169/2066/771/4312/4318/2018/563/8140/11122/2146/7373/2921/6890
DOID:820                                                                                             1493/8792/6279/2697/6280/4982/3627/29851
DOID:1107                                                      2952/4321/3576/3868/3169/2066/771/4312/4318/2018/563/8140/11122/2146/7373/6890
DOID:2799                                                                                                  4582/4283/4318/6352/3627/3002/6373
             Count
DOID:0060306     6
DOID:10534      25
DOID:5041       18
DOID:820         8
DOID:1107       16
DOID:2799        7

The enrichDO() function requires an entrezgene ID vector as input, which is mostly the differential gene list from gene expression profile studies. Please refer to Section 17.1 if you need to convert other gene ID types to entrezgene ID.

The ont parameter can be “HDO” (Human Disease Ontology), “HPO” (Human Phenotype Ontology) or “MPO” (Mouse Phenotype Ontology). pvalueCutoff setting the cutoff value of p value and adjusted p value; pAdjustMethod setting the p value correction methods, include the Bonferroni correction (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”), Hommel (“hommel”), Benjamini & Hochberg (“BH”) and Benjamini & Yekutieli (“BY”) while qvalueCutoff is used to control q-values.

The universe sets the background gene universe for testing. If users do not explicitly set this parameter, enrichDO() will set the universe to all human genes that have DO annotation.

The minGSSize (and maxGSSize) indicates that only those DO terms that have more than minGSSize (and less than maxGSSize) annotated genes will be tested.

The readable is a logical parameter that indicates whether the entrezgene IDs will be mapped to gene symbols or not, see also Section 17.2.

8.1.2 Over-representation analysis for the network of cancer gene

Network of Cancer Gene (NCG) (A. et al. 2016) is a manually curated repository of cancer genes. NCG release 5.0 (Aug. 2015) collects 1,571 cancer genes from 175 published studies. DOSE supports analyzing gene list and determine whether they are enriched in genes known to be mutated in a given cancer type.

gene2 <- names(geneList)[abs(geneList) > 3]
ncg <- enrichNCG(gene2) 
head(ncg)

 [1] ID             Description    GeneRatio      BgRatio        RichFactor    
 [6] FoldEnrichment zScore         pvalue         p.adjust       qvalue        
[11] geneID         Count         
<0 rows> (or 0-length row.names)

8.1.3 Over-representation analysis for the disease gene network

DisGeNET(Janet et al. 2015) is an integrative and comprehensive resource of gene-disease associations from several public data sources and the literature. It contains gene-disease associations and snp-gene-disease associations.

The enrichment analysis of disease-gene associations is supported by the enrichDGN function and analysis of snp-gene-disease associations is supported by the enrichDGNv function.

dgn <- enrichDGN(gene) 
head(dgn)

               ID                              Description GeneRatio   BgRatio
C0010278 C0010278                         Craniosynostosis    43/497 488/21671
C0853879 C0853879             Invasive carcinoma of breast    42/497 473/21671
C4733092 C4733092 estrogen receptor-negative breast cancer    34/497 356/21671
C3642347 C3642347              Basal-Like Breast Carcinoma    28/497 245/21671
C3642345 C3642345               Luminal A Breast Carcinoma    22/497 153/21671
C0036202 C0036202                              Sarcoidosis    36/497 413/21671
         RichFactor FoldEnrichment    zScore       pvalue     p.adjust
C0010278 0.08811475       3.842122  9.728932 4.609534e-14 3.150086e-10
C0853879 0.08879493       3.871780  9.674768 7.105190e-14 3.150086e-10
C4733092 0.09550562       4.164391  9.223137 2.446675e-12 6.756634e-09
C3642347 0.11428571       4.983271  9.606353 3.047991e-12 6.756634e-09
C3642345 0.14379085       6.269802 10.021789 7.034749e-12 1.172052e-08
C0036202 0.08716707       3.800800  8.804448 7.930882e-12 1.172052e-08
               qvalue
C0010278 3.150086e-10
C0853879 3.150086e-10
C4733092 6.756634e-09
C3642347 6.756634e-09
C3642345 1.172052e-08
C0036202 1.172052e-08
                                                                                                                                                                                                                          geneID
C0010278 1493/6278/26279/990/2200/3572/3576/2152/5327/4998/2625/23594/3479/6424/820/7031/29968/6279/8318/1846/4312/4318/2697/6280/81620/1308/4330/1300/3117/563/9607/6352/4982/4069/3627/185/3002/2146/1062/3667/27299/6890/6362
C0853879        2952/6278/6664/9232/6926/1894/4582/7153/4171/3576/2625/3479/890/5241/5304/8792/5080/3169/2151/2066/9787/4312/4318/1602/51203/7980/9582/983/1111/1978/10855/53335/367/4982/8842/185/2146/4102/8836/701/2173/23158
C4733092                                         6278/3620/1894/2491/4582/3576/5327/2625/80129/79733/7083/8839/3479/5241/5304/8792/29968/3169/214/1846/81620/6241/2305/1408/81930/367/4137/4982/27324/2146/8061/10551/3667/11004
C3642347                                                                          6664/3620/29127/4605/3576/2625/3479/5241/5163/9833/8792/3169/6663/7368/1846/4318/10232/2305/18/11065/27324/55765/2568/2146/6790/1062/2173/3159
C3642345                                                                                                         8614/4582/7153/3576/2152/2625/3161/80129/5241/4128/5304/9833/3169/6663/55355/4318/1111/2305/4288/27324/185/2001
C0036202                               2952/4582/2167/6863/4321/4283/23541/3576/26227/4485/7043/3479/2053/6279/80736/4312/4318/6280/3117/1524/6352/6355/25802/5004/4982/4069/3627/185/3002/29851/10403/6373/27299/6890/6364/6362
         Count
C0010278    43
C0853879    42
C4733092    34
C3642347    28
C3642345    22
C0036202    36

snp <- c("rs1401296", "rs9315050", "rs5498", "rs1524668", "rs147377392",
         "rs841", "rs909253", "rs7193343", "rs3918232", "rs3760396",
         "rs2231137", "rs10947803", "rs17222919", "rs386602276", "rs11053646",
         "rs1805192", "rs139564723", "rs2230806", "rs20417", "rs966221")
dgnv <- enrichDGNv(snp)
head(dgnv)

               ID               Description GeneRatio    BgRatio RichFactor
C0010054 C0010054 Coronary Arteriosclerosis      6/17 440/194515 0.01363636
C0151744 C0151744       Myocardial Ischemia      4/17 103/194515 0.03883495
C0031099 C0031099             Periodontitis      4/17 116/194515 0.03448276
C0007785 C0007785       Cerebral Infarction      4/17 123/194515 0.03252033
C0003850 C0003850          Arteriosclerosis      4/17 267/194515 0.01498127
C0004153 C0004153           Atherosclerosis      4/17 281/194515 0.01423488
         FoldEnrichment   zScore       pvalue     p.adjust       qvalue
C0010054       156.0281 30.43647 1.568917e-12 5.895991e-09 5.895991e-09
C0151744       444.3518 42.07730 1.754840e-10 3.297345e-07 3.297345e-07
C0031099       394.5538 39.63952 2.839985e-10 3.381759e-07 3.381759e-07
C0007785       372.0995 38.48983 3.599531e-10 3.381759e-07 3.381759e-07
C0003850       171.4166 26.05143 8.145389e-09 6.122074e-06 6.122074e-06
C0004153       162.8763 25.38727 9.996713e-09 6.261274e-06 6.261274e-06
                                                            geneID Count
C0010054 rs2230806/rs147377392/rs1805192/rs5498/rs11053646/rs20417     6
C0151744                   rs147377392/rs1805192/rs5498/rs11053646     4
C0031099                         rs909253/rs1805192/rs5498/rs20417     4
C0007785                rs2230806/rs147377392/rs1805192/rs11053646     4
C0003850                        rs2230806/rs1805192/rs5498/rs20417     4
C0004153                        rs2230806/rs1805192/rs5498/rs20417     4

8.2 Disease gene set enrichment analysis

8.2.1 `gseDO` function

In the following example, in order to speed up the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.

library(DOSE)
data(geneList)
y <- gseDO(geneList,
           minGSSize     = 120,
           pvalueCutoff  = 0.2,
           pAdjustMethod = "BH",
           verbose       = FALSE)
head(y, 3)

                 ID                           Description setSize
DOID:612   DOID:612      primary immunodeficiency disease     234
DOID:2490 DOID:2490 congenital nervous system abnormality     114
DOID:720   DOID:720                     normocytic anemia     103
          enrichmentScore      NES       pvalue     p.adjust       qvalue rank
DOID:612        0.4497659 2.050031 5.370831e-10 9.774912e-08 4.180620e-09 2521
DOID:2490       0.4518077 1.890084 2.421751e-05 2.203794e-03 9.425379e-05 2245
DOID:720        0.4228019 1.733016 3.424473e-04 7.199264e-03 3.079045e-04 2075
                            leading_edge
DOID:612  tags=43%, list=20%, signal=35%
DOID:2490 tags=35%, list=18%, signal=29%
DOID:720  tags=33%, list=17%, signal=28%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    core_enrichment
DOID:612  55388/7153/9837/29851/9636/1503/1493/7037/4173/3932/3559/6772/51311/3507/3561/917/3574/3575/919/4860/915/22806/5693/4938/1535/3458/959/5336/11151/3702/925/4688/64135/28755/50615/974/1794/3689/5788/5424/916/7096/4068/3937/30009/5695/3394/10525/100/7374/3659/940/939/4689/5880/7128/6891/4210/6789/5699/930/6573/11322/204/6850/10095/7124/3569/7097/7852/8772/5692/64170/3119/1956/28985/1053/5971/1536/10125/8456/8625/3071/7293/4478/1380/958/5054/5591/9437/10379/54440/3570/3978/3593/10625/29927/3558/51371/735
DOID:2490                                                                                                                                                                                                                                                                                                1062/23397/259266/55165/4085/4001/9928/23007/9918/699/6491/84823/6496/8543/7277/8941/9585/81624/23310/84790/284403/7545/16/203068/10381/27341/10383/1020/10376/7159/5129/1641/833/54517/55835/81027/1119/84107/23019/55262
DOID:720                                                                                                                                                                                                                                                                                                                                            9493/29127/55215/5888/1493/2175/5551/55159/2178/2821/3458/959/64135/81624/942/119/55027/3783/7398/5230/83990/2539/2072/23586/2189/57817/4683/7124/79132/3119/51700/672/8328/286

8.2.2 `gseNCG` function

ncg <- gseNCG(geneList,
              pvalueCutoff  = 0.5,
              pAdjustMethod = "BH",
              verbose       = FALSE)
ncg <- setReadable(ncg, 'org.Hs.eg.db')
head(ncg, 3)

                                                           ID Description
pan-gynecological and breast     pan-gynecological and breast          NA
breast_fibroepithelial_tumours breast_fibroepithelial_tumours          NA
pan-gastric                                       pan-gastric          NA
                               setSize enrichmentScore       NES      pvalue
pan-gynecological and breast        43      -0.5263430 -1.688958 0.004833800
breast_fibroepithelial_tumours      17      -0.6421576 -1.688493 0.011047600
pan-gastric                         49      -0.4993804 -1.655705 0.006104571
                                p.adjust    qvalue rank
pan-gynecological and breast   0.2430472 0.1245364 2464
breast_fibroepithelial_tumours 0.2430472 0.1245364 2700
pan-gastric                    0.2430472 0.1245364 3280
                                                 leading_edge
pan-gynecological and breast   tags=40%, list=20%, signal=32%
breast_fibroepithelial_tumours tags=53%, list=22%, signal=42%
pan-gastric                    tags=49%, list=26%, signal=36%
                                                                                                                                                              core_enrichment
pan-gynecological and breast                                            NIPBL/SPOP/ARID1A/RASA1/RB1/RNF43/MAP2K4/NF1/CTNNB1/TP53/PIK3R1/CDKN1B/CCND1/ARID5B/MAP3K1/TBX3/GATA3
breast_fibroepithelial_tumours                                                                                               SETD2/RB1/PCNX4/NF1/TP53/RARA/SYNE1/MAP3K1/ERBB4
pan-gastric                    BCOR/SOX9/TCF7L2/ATM/CALD1/SEMG2/HTR7/ARID1A/RASA1/RB1/TTBK2/RNF43/CTNNB1/TP53/BCL9/SMAD3/APC/ZFP36L2/TGFBR2/MUC6/MAP3K1/CACNA1C/ATP8B1/CYP4B1

8.2.3 `gseDGN` function

dgn <- gseDGN(geneList,
              pvalueCutoff  = 0.2,
              pAdjustMethod = "BH",
              verbose       = FALSE)
dgn <- setReadable(dgn, 'org.Hs.eg.db')
head(dgn, 3)

               ID                  Description setSize enrichmentScore      NES
C0024266 C0024266 Lymphocytic Choriomeningitis     120       0.5712593 2.397615
C1136382 C1136382         Sclerocystic Ovaries     106       0.5770160 2.388502
C0151332 C0151332          Active tuberculosis      92       0.5924568 2.382139
               pvalue     p.adjust       qvalue rank
C0024266 2.001566e-11 1.774388e-07 9.435879e-08 2579
C1136382 3.375143e-10 9.941242e-07 5.286575e-07  610
C0151332 1.546603e-09 1.722779e-06 9.161431e-07 1687
                           leading_edge
C0024266 tags=48%, list=21%, signal=38%
C1136382  tags=22%, list=5%, signal=21%
C0151332 tags=35%, list=14%, signal=30%
                                                                                                                                                                                                                                                                                                                                 core_enrichment
C0024266 S100A9/CXCL10/CXCL9/EZH2/GZMB/ICOS/USP18/CXCL8/CTLA4/TREM1/PRF1/ADM/CA9/STAT1/CCL2/SELL/CDKN2A/IL7/IL7R/IFNG/CCR5/IL27RA/SH2D1A/FCER1G/CDK2AP2/CPVL/CD27/PSMB10/PTPN22/SLAMF1/KDM1A/TNF/IL6/FGL2/TLR2/RPAIN/NELFCD/PDCD1/WAS/HIF1A/ATP5F1B/FCGR2B/EGR2/STX11/CXCR3/TYROBP/YME1L1/SOSTDC1/PTPN2/TRAF1/HNF1A/IRF9/PML/NR0B2/IL2/TOX/AGFG1
C1136382                                                                                                                                                                                                     TOP2A/ASPM/RRM2/CEP55/PBK/MAD2L1/S100P/TTK/CCNB1/CKS2/HMMR/MYBL1/CDC6/RACGAP1/BCL11A/HELLS/TFRC/CCNE2/STAR/SRD5A1/TRPV6/RUNX3/CLDN4
C0151332                                                                                                                                                            CXCL10/ISG20/CAMP/GZMB/LAG3/CXCL8/ISG15/CD38/CTLA4/CCL5/IL2RA/TLR8/CCL2/MUC16/CCL4/SIGLEC1/NCAPG2/CCN6/IFNG/DTYMK/HMOX1/IL32/CCR5/CCL22/IL15/LBP/CD27/EBI3/CD14/TNF/IL6/TLR2

References

A., Omer, Giovanni M. D., Thanos P. M., and Francesca D. C. 2016. “NCG 5.0: Updates of a Manually Curated Repository of Cancer Genes and Associated Properties from Cancer Mutational Screenings.” Nucleic Acids Research 44 (D1): D992–99. https://doi.org/10.1093/nar/gkv1123.

Janet, P., Núria Q. R., Àlex B., Jordi D. P., Anna B. M., Martin B., Ferran S., and Laura I. F. 2015. “DisGeNET: A Discovery Platform for the Dynamical Exploration of Human Diseases and Their Genes.” Database 2015 (March): bav028. https://doi.org/10.1093/database/bav028.

Schriml, L. M., C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe. 2011. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1): D940–46. https://doi.org/10.1093/nar/gkr972.

Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An r/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.