3 DO semantic similarity analysis

Public health is an important driving force behind biological and medical research. A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications. Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms. Understanding similarities among disease aids in early diagnosis and new drug development.

Formal knowledge representation of gene-disease association is demanded for this purpose. Ontologies, such as Gene Ontology (GO), have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information. Disease Ontology (DO) (Schriml et al. 2011) was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context. Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.

Disease Ontology (DO) (Schriml et al. 2011) aims to provide an open source ontology for the integration of biomedical data that is associated with human disease. Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge.

We developed an R package DOSE (Yu et al. 2015) for analyzing semantic similarities among DO terms and gene products annotated with DO terms.

3.1 DO term semantic similarity measurement

Four methods determine the semantic similarity of two terms based on the Information Content of their common ancestor term were proposed by Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006). Wang (Wang et al. 2007) presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. DOSE supports all these methods to compute semantic similarity among DO terms and gene products. For algorithm details, please refer to chapter 1.

In DOSE, we implemented doSim for calculating semantic similarity between two DO terms and two set of DO terms.

library(DOSE)

a <- c("DOID:14095", "DOID:5844", "DOID:2044", "DOID:8432", "DOID:9146",
       "DOID:10588", "DOID:3209", "DOID:848", "DOID:3341", "DOID:252")
b <- c("DOID:9409", "DOID:2491", "DOID:4467", "DOID:3498", "DOID:11256")
doSim(a[1], b[1], measure="Wang")
## [1] 0.07142995
doSim(a[1], b[1], measure="Resnik")
## [1] 0
doSim(a[1], b[1], measure="Lin")
## [1] 0
s <- doSim(a, b, measure="Wang")
s
##             DOID:9409  DOID:2491  DOID:4467  DOID:3498 DOID:11256
## DOID:14095 0.07142995 0.05714393 0.03676194 0.03676194 0.52749870
## DOID:5844  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:2044  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:8432  0.17347273 0.13877811 0.03676194 0.03676194 0.07142995
## DOID:9146  0.07142995 0.05714393 0.03676194 0.03676194 0.17347273
## DOID:10588 0.13240905 0.18401515 0.02208240 0.02208240 0.05452137
## DOID:3209  0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:848   0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:3341  0.13240905 0.09998997 0.02208240 0.02208240 0.05452137
## DOID:252   0.06134327 0.04761992 0.02801328 0.02801328 0.06134327

The doSim function requires three parameter DOID1, DOID2 and measure. DOID1 and DOID2 should be a vector of DO terms, while measure should be one of Resnik, Jiang, Lin, Rel, and Wang.

We also implement a plot function simplot() to visualize the similarity result as demonstrated in Figure 3.1.

simplot(s,
        color.low="white", color.high="red",
        labs=TRUE, digits=2, labs.size=5,
        font.size=14, xlab="", ylab="")
Similarity matrix visualization. The simplot() function visualize similarity matrix as a heatmap.

Figure 3.1: Similarity matrix visualization. The simplot() function visualize similarity matrix as a heatmap.

Parameter color.low and colow.high are used to setting the color gradient; labs is a logical parameter indicating whether to show the similarity values or not, digits to indicate the number of decimal places to be used and labs.size control the font size of similarity values; font.size setting the font size of axis and label of the coordinate system.

3.2 Gene semantic similarity measurement

On the basis of semantic similarity between DO terms, DOSE can also compute semantic similarity among gene products. DOSE provides four methods which called max, avg, rcmax and BMA to combine semantic similarity scores of multiple DO terms (as described in session 1.3). The similarities among genes and gene clusters which annotated by multiple DO terms were also calculated by these combine methods.

DOSE implemented geneSim to measure semantic similarities among genes.

g1 <- c("84842", "2524", "10590", "3070", "91746")
g2 <- c("84289", "6045", "56999", "9869")

DOSE::geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
## [1] 0.051
gs <- DOSE::geneSim(g1, g2, measure="Wang", combine="BMA")
gs
##       84289  6045 56999  9869
## 84842 0.051 0.135 0.355 0.103
## 2524  0.284 0.172 0.517 0.517
## 10590 0.150 0.173 0.242 0.262
## 3070  0.573 0.517 1.000 1.000
## 91746 0.351 0.308 0.527 0.496

The geneSim requires four parameter geneID1, geneID2, measure and combine. geneID1 and geneID2 should be a vector of entrez gene IDs; measure should be one of Resnik, Jiang, Lin, Rel, and Wang, while combine should be one of max, avg, rcmax and BMA.

3.3 Gene cluster semantic similarity measurement

DOSE also implemented clusterSim for calculating semantic similarity between two gene clusters and mclusterSim for calculating semantic similarities among multiple gene clusters.

DOSE::clusterSim(g1, g2, measure="Wang", combine="BMA")
## [1] 0.549
g3 <- c("57491", "6296", "51438", "5504", "27319", "1643")
clusters <- list(a=g1, b=g2, c=g3)
DOSE::mclusterSim(clusters, measure="Wang", combine="BMA")
##       a     b     c
## a 1.000 0.549 0.425
## b 0.549 1.000 0.645
## c 0.425 0.645 1.000