1 GO semantic similarity analysis

<environment: namespace:Rcpp>

GOSemSim implemented all methods described in Chapter 1, including four IC-based methods and one graph-based method.

1.1 Semantic data

To measure semantic similarity, we need to prepare GO annotations including GO structure (i.e. GO term relationships) and gene to GO mapping. For IC-based methods, information of GO term is species specific. We need to calculate IC for all GO terms of a species before we measure semantic similarity.

GOSemSim provides the godata() function to prepare semantic data to support measuring GO and gene similarity. It internally uses the GO.db package to obtain GO structure and OrgDb for gene to GO mapping.

library(AnnotationDbi)
library(GOSemSim)
hsGO <- godata(annoDb = 'org.Hs.eg.db', ont="MF")

User can set computeIC=FALSE if they only want to use Wang’s method.

1.2 Supported organisms

GOSemSim supports all organisms that have an OrgDb object available.

Bioconductor have already provided OrgDb for about 20 species.

We can query OrgDb online via the AnnotationHub package. For example:

library(AnnotationHub)
hub <- AnnotationHub()
q <- query(hub, "Cricetulus")
id <- q$ah_id[length(q)]
Cgriseus <- hub[[id]]

If an organism is not supported by AnnotationHub, user can use the AnnotationForge package to build OrgDb manually.

Once we have OrgDb, we can build annotation data needed by GOSemSim via godata() function described previously.

1.3 GO semantic similarity measurement

The goSim() function calculates semantic similarity between two GO terms, while the mgoSim() function calculates semantic similarity between two sets of GO terms.

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")

[1] 0.25

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")

[1] 0.116

go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)

           GO:0009055 GO:0005515
GO:0004022      0.368      0.116
GO:0004024      0.335      0.107
GO:0004174      0.663      0.119

mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")

[1] 0.43

1.4 Gene semantic similarity measurement

On the basis of semantic similarity between GO terms, GOSemSim can also compute semantic similarity among sets of GO terms, gene products, and gene clusters.

Suppose we have gene \(g_1\) annotated by GO terms sets \(GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}\) and \(g_2\) annotated by \(GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}\), GOSemSim implemented four combining methods, including max, avg, rcmax, and BMA, to aggregate semantic similarity scores of multiple GO terms (see also Section 3). The similarities among gene products and gene clusters which are annotated by multiple GO terms are also calculated by these combining methods.

GOSemSim provides geneSim() to calculate semantic similarity between two gene products, and mgeneSim() to calculate semantic similarity among multiple gene products.

GOSemSim::geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")

$geneSim
[1] 0.149

$GO1
[1] "GO:0004364" "GO:0004464" "GO:0004602" "GO:0005515" "GO:0050544"

$GO2
[1] "GO:0004035"

mgeneSim(genes=c("835", "5261","241", "994"),
         semData=hsGO, measure="Wang",verbose=FALSE)

       835  5261   241   994
835  1.000 0.576 0.515 0.671
5261 0.576 1.000 0.393 0.499
241  0.515 0.393 1.000 0.414
994  0.671 0.499 0.414 1.000

mgeneSim(genes=c("835", "5261","241", "994"),
       semData=hsGO, measure="Rel",verbose=FALSE)

       835  5261   241   994
835  0.937 0.448 0.322 0.555
5261 0.448 0.933 0.338 0.482
241  0.322 0.338 0.949 0.294
994  0.555 0.482 0.294 0.932

By default, godata function use ENTREZID as keytype, and the input ID type is ENTREZID. User can use other ID types such as ENSEMBL, UNIPROT, REFSEQ, ACCNUM, SYMBOL et al.

Here as an example, we use SYMBOL as keytype and calculate semantic similarities among several genes by using their gene symbol as input.

hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE) 
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)

      CDC45 MCM10 CDC20   NMU  MMP1
CDC45 1.000 0.916 0.475 0.592 0.169
MCM10 0.916 1.000 0.505 0.620 0.160
CDC20 0.475 0.505 1.000 0.555 0.162
NMU   0.592 0.620 0.555 1.000 0.150
MMP1  0.169 0.160 0.162 0.150 1.000

Users can also use clusterProfiler::bitr to translate biological IDs.

1.5 Gene cluster semantic similarity measurement

GOSemSim also supports calculating semantic similarity between two gene clusters using clusterSim() function and measuring semantic similarity among multiple gene clusters using mclusterSim() function.

gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")

[1] 0.588

library(org.Hs.eg.db)
x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")

      a     b     c
a 1.000 0.779 0.601
b 0.779 1.000 0.619
c 0.601 0.619 1.000

1.6 Cross-species gene similarity calculation

While GOSemSim primarily focuses on semantic similarity within a single species, researchers often need to calculate GO semantic similarity for orthologous gene pairs between different species. This is particularly relevant for comparative genomics studies between well-annotated species like human and mouse.

1.6.1 Current limitations and workarounds

By default, godata() function only accepts a single OrgDb object, meaning it supports semantic similarity calculation within one species. However, there are several approaches to address cross-species similarity calculation:

Graph-based methods: Use Wang’s method (graph-based) with mgoSim() function by manually providing GO term sets for genes from different species
Manual GO term mapping: Extract GO annotations for orthologous genes and use mgoSim() directly
Future development: A potential merge() function for combining GOSemSimDATA objects from multiple species

# Example workflow for cross-species similarity using mgoSim
# Extract GO terms for human gene using clusterProfiler
library(clusterProfiler)
human_go_terms <- bitr("human_gene_id", fromType = "ENTREZID", 
                       toType = "GO", OrgDb = org.Hs.eg.db)$GO

# Extract GO terms for mouse ortholog  
mouse_go_terms <- bitr("mouse_gene_id", fromType = "ENTREZID",
                       toType = "GO", OrgDb = org.Mm.eg.db)$GO

# Calculate similarity using Wang's method (graph-based)
similarity <- mgoSim(human_go_terms, mouse_go_terms, 
                    semData=hsGO, measure="Wang", combine="BMA")

For species not supported by Bioconductor OrgDb packages, users can: - Use AnnotationHub to query and retrieve organism-specific annotations - Build custom OrgDb objects using AnnotationForge::makeOrgPackage() or makeOrgPackageFromNCBI() - Perform manual GO annotation followed by semantic similarity calculation

1.7 Applications in functional genomics

GOSemSim has been widely applied in various research domains, with over 200 citations. One particularly valuable application is in prioritizing genes for experimental validation from large sets of differentially expressed genes.

1.7.1 Functional similarity for target prioritization

When faced with numerous differentially expressed genes, researchers can use GO semantic similarity to identify functionally related “hub” genes that may serve as better candidates for experimental validation. The approach involves:

Calculating pairwise semantic similarity among all differentially expressed genes
Constructing a similarity network where edges represent functional relationships
Identifying genes with high betweenness centrality as potential key regulators

1.7.2 Case study: FMNL1 interactome analysis

In a collaboration with Technische Universität München, we used GO semantic similarity to analyze proteins identified through Co-IP and LC-MS/MS experiments targeting FMNL1 in hematopoietic cells. The workflow was:

Similarity metric definition: Functional similarity was defined as the geometric mean of molecular function and cellular component GO semantic similarities
Network construction: Created a similarity network of all co-purified proteins
Hub identification: Identified AHNAK1, SIPA1, and FLII as central nodes in the network
Experimental validation: Validated the novel interaction between FMNL1 and AHNAK1, leading to discovery of FMNL1’s role in calcium-dependent membrane plasticity

This approach successfully identified previously unreported protein interactions that were subsequently validated experimentally, demonstrating the power of GO semantic similarity in guiding biological discovery.

1.7.3 Implementation considerations

When using semantic similarity for target prioritization: - Combine multiple GO aspects (MF, BP, CC) for comprehensive functional assessment - Consider both direct interactions and functional relationships - Use appropriate similarity measures based on biological context - Validate computational predictions with experimental evidence