2 GO semantic similarity analysis

GOSemSim implemented all methods described in Chapter 1, including four IC-based methods and one graph-based method.

2.1 Semantic data

To measure semantic similarity, we need to prepare GO annotations including GO structure (i.e. GO term relationships) and gene to GO mapping. For IC-based methods, information of GO term is species specific. We need to calculate IC for all GO terms of a species before we measure semantic similarity.

GOSemSim provides the godata() function to prepare semantic data to support measuring GO and gene simiarlity. It internally used the GO.db package to obtain GO strucuture and OrgDb for gene to GO mapping.

library(GOSemSim)
hsGO <- godata('org.Hs.eg.db', ont="MF")

User can set computeIC=FALSE if they only want to use Wang’s method.

2.2 Supported organisms

GOSemSim supports all organisms that have an OrgDb object available.

Bioconductor have already provided OrgDb for about 20 species.

We can query OrgDb online via the AnnotationHub package. For example:

library(AnnotationHub)
hub <- AnnotationHub()
q <- query(hub, "Cricetulus")
id <- q$ah_id[length(q)]
Cgriseus <- hub[[id]]

If an organism is not supported by AnnotationHub, user can use the AnnotationForge package to build OrgDb manually.

Once we have OrgDb, we can build annotation data needed by GOSemSim via godata() function described previously.

2.3 GO semantic similarity measurement

The goSim() function calculates semantic similarity between two GO terms, while the mgoSim() function calculates semantic similarity between two sets of GO terms.

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")

## [1] 0.158

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")

## [1] 0.116

go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)

##            GO:0009055 GO:0005515
## GO:0004022      0.368      0.116
## GO:0004024      0.335      0.107
## GO:0004174      0.663      0.119

mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")

## [1] 0.43

2.4 Gene semantic similarity measurement

On the basis of semantic similarity between GO terms, GOSemSim can also compute semantic similarity among sets of GO terms, gene products, and gene clusters.

Suppose we have gene \(g_1\) annotated by GO terms sets \(GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}\) and \(g_2\) annotated by \(GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}\), GOSemSim implemented four combine methods, including max, avg, rcmax, and BMA, to aggregate semantic similarity scores of multiple GO terms (see also session 1.3). The similarities among gene products and gene clusters which annotated by multiple GO terms are also calculated by the these combine methods.

GOSemSim provides geneSim() to calculate semantic similarity between two gene products, and mgeneSim() to calculate semantic similarity among multiple gene products.

GOSemSim::geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")

## $geneSim
## [1] 0.149
## 
## $GO1
## [1] "GO:0004364" "GO:0004464" "GO:0004602" "GO:0005515" "GO:0047485"
## [6] "GO:0050544"
## 
## $GO2
## [1] "GO:0004035"

mgeneSim(genes=c("835", "5261","241", "994"),
         semData=hsGO, measure="Wang",verbose=FALSE)

##        835  5261   241   994
## 835  1.000 0.478 0.451 0.578
## 5261 0.478 1.000 0.433 0.499
## 241  0.451 0.433 1.000 0.452
## 994  0.578 0.499 0.452 1.000

mgeneSim(genes=c("835", "5261","241", "994"),
       semData=hsGO, measure="Rel",verbose=FALSE)

##        835  5261   241   994
## 835  0.947 0.400 0.255 0.488
## 5261 0.400 0.908 0.325 0.462
## 241  0.255 0.325 0.941 0.273
## 994  0.488 0.462 0.273 0.907

By default, godata function use ENTREZID as keytype, and the input ID type is ENTREZID. User can use other ID types such as ENSEMBL, UNIPROT, REFSEQ, ACCNUM, SYMBOL et al.

Here as an example, we use SYMBOL as keytype and calculate semantic similarities among several genes by using their gene symbol as input.

hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE) 
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

##       CDC45 MCM10 CDC20   NMU  MMP1
## CDC45 1.000 0.916 0.578 0.592 0.117
## MCM10 0.916 1.000 0.605 0.620 0.113
## CDC20 0.578 0.605 1.000 0.640 0.121
## NMU   0.592 0.620 0.640 1.000 0.108
## MMP1  0.117 0.113 0.121 0.108 1.000

Users can also use clusterProfiler::bitr to translate biological IDs.

2.5 Gene cluster semantic similarity measurement

GOSemSim also supports calculating semantic similarity between two gene clusters using clusterSim() function and measuring semantic similarity among multiple gene clusters using mclusterSim() function.

gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")

## [1] 0.621

library(org.Hs.eg.db)
x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")

##       a     b     c
## a 1.000 0.687 0.752
## b 0.687 1.000 0.797
## c 0.752 0.797 1.000

1 Overview of semantic similarity analysis

3 DO semantic similarity analysis