2 GO semantic similarity analysis
GOSemSim implemented all methods described in Chapter 1, including four IC-based methods and one graph-based method.
2.1 Semantic data
To measure semantic similarity, we need to prepare GO annotations including GO structure (i.e. GO term relationships) and gene to GO mapping. For IC-based methods, information of GO term is species specific. We need to calculate IC
for all GO terms of a species before we measure semantic similarity.
GOSemSim provides the godata()
function to prepare semantic data to support measuring GO and gene simiarlity. It internally used the GO.db package to obtain GO strucuture and OrgDb
for gene to GO mapping.
User can set computeIC=FALSE
if they only want to use Wang’s method.
2.2 Supported organisms
GOSemSim supports all organisms that have an OrgDb
object available.
Bioconductor have already provided OrgDb
for about 20 species.
We can query OrgDb
online via the AnnotationHub package. For example:
library(AnnotationHub)
hub <- AnnotationHub()
q <- query(hub, "Cricetulus")
id <- q$ah_id[length(q)]
Cgriseus <- hub[[id]]
If an organism is not supported by AnnotationHub, user can use the AnnotationForge package to build OrgDb
manually.
Once we have OrgDb
, we can build annotation data needed by GOSemSim via godata()
function described previously.
2.3 GO semantic similarity measurement
The goSim()
function calculates semantic similarity between two GO terms, while the mgoSim()
function calculates semantic similarity between two sets of GO terms.
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
## [1] 0.158
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
## [1] 0.116
go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
## GO:0009055 GO:0005515
## GO:0004022 0.368 0.116
## GO:0004024 0.335 0.107
## GO:0004174 0.663 0.119
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.43
2.4 Gene semantic similarity measurement
On the basis of semantic similarity between GO terms, GOSemSim can also compute semantic similarity among sets of GO terms, gene products, and gene clusters.
Suppose we have gene \(g_1\) annotated by GO terms sets \(GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}\) and \(g_2\) annotated by \(GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}\), GOSemSim implemented four combine methods, including max, avg, rcmax, and BMA, to aggregate semantic similarity scores of multiple GO terms (see also session 1.3). The similarities among gene products and gene clusters which annotated by multiple GO terms are also calculated by the these combine methods.
GOSemSim provides geneSim()
to calculate semantic similarity between two gene products, and mgeneSim()
to calculate semantic similarity among multiple gene products.
GOSemSim::geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")
## $geneSim
## [1] 0.149
##
## $GO1
## [1] "GO:0004364" "GO:0004464" "GO:0004602" "GO:0005515" "GO:0047485"
## [6] "GO:0050544"
##
## $GO2
## [1] "GO:0004035"
## 835 5261 241 994
## 835 1.000 0.478 0.451 0.578
## 5261 0.478 1.000 0.433 0.499
## 241 0.451 0.433 1.000 0.452
## 994 0.578 0.499 0.452 1.000
## 835 5261 241 994
## 835 0.947 0.400 0.255 0.488
## 5261 0.400 0.908 0.325 0.462
## 241 0.255 0.325 0.941 0.273
## 994 0.488 0.462 0.273 0.907
By default, godata
function use ENTREZID
as keytype, and the input ID type is ENTREZID
. User can use other ID types such as ENSEMBL
, UNIPROT
, REFSEQ
, ACCNUM
, SYMBOL
et al.
Here as an example, we use SYMBOL
as keytype
and calculate semantic similarities among several genes by using their gene symbol as input.
hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE)
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
## CDC45 MCM10 CDC20 NMU MMP1
## CDC45 1.000 0.916 0.578 0.592 0.117
## MCM10 0.916 1.000 0.605 0.620 0.113
## CDC20 0.578 0.605 1.000 0.640 0.121
## NMU 0.592 0.620 0.640 1.000 0.108
## MMP1 0.117 0.113 0.121 0.108 1.000
Users can also use clusterProfiler::bitr
to translate biological IDs.
2.5 Gene cluster semantic similarity measurement
GOSemSim also supports calculating semantic similarity between two gene clusters using clusterSim()
function and measuring semantic similarity among multiple gene clusters using mclusterSim()
function.
gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.621
library(org.Hs.eg.db)
x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")
## a b c
## a 1.000 0.687 0.752
## b 0.687 1.000 0.797
## c 0.752 0.797 1.000