1 Overview of semantic similarity analysis

Functional similarity of gene products can be estimated by controlled biological vocabularies, such as Gene Ontology (GO), Disease Ontology (DO) and Medical Subject Headings (MeSH).

Four methods including Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang (Wang et al. 2007) proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and weaknesses and can be applied to other ontologies that have similar structure (i.e. directed acyclic graph).

1.1 Information content-based methods

Four methods proposed by Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations. The information content of a GO term is computed by the negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.

The frequency of a term t is defined as:

\[p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}\]

where \(n_{t'}\) is the number of term \(t'\), and \(N\) is the total number of terms in GO corpus.

Thus the information content is defined as:

\[IC(t) = -\log(p(t))\]

As GO allow multiple parents for each concept, two terms can share parents by multiple paths. IC-based methods calculate similarity of two GO terms based on the information content of their closest common ancestor term, which was also called most informative common ancestor (MICA).

1.1.1 Resnik method

The Resnik method is defined as:

\[sim_{Resnik}(t_1,t_2) = IC(MICA)\]

1.1.2 Lin method

The Lin method is defined as:

\[sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}\]

1.1.3 Rel method

The Relevance method, which was proposed by Schlicker, combine Resnik’s and Lin’s method and is defined as:

\[sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}\]

1.1.4 Jiang method

The Jiang and Conrath’s method is defined as:

\[sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))\]

1.2 Graph-based method

Graph-based methods using the topology of GO graph structure to compute semantic similarity. Formally, a GO term A can be represented as \(DAG_{A}=(A,T_{A},E_{A})\) where \(T_{A}\) is the set of GO terms in \(DAG_{A}\), including term A and all of its ancestor terms in the GO graph, and \(E_{A}\) is the set of edges connecting the GO terms in \(DAG_{A}\).

1.2.1 Wang method

To encode the semantic of a GO term in a measurable format to enable a quantitative comparison, Wang(Wang et al. 2007) firstly defined the semantic value of term A as the aggregate contribution of all terms in \(DAG_{A}\) to the semantics of term A, terms closer to term A in \(DAG_{A}\) contribute more to its semantics. Thus, defined the contribution of a GO term \(t\) to the semantic of GO term \(A\) as the S-value of GO term \(t\) related to term \(A\).

For any of term \(t\) in \(DAG_{A}\), its S-value related to term \(A\), \(S_{A}(\textit{t})\) is defined as:

\[\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.\]

where \(w_{e}\) is the semantic contribution factor for edge \(e \in E_{A}\) linking term \(t\) with its child term \(t'\). Term \(A\) contributes to its own is defined as 1. After obtaining the S-values for all terms in \(DAG_{A}\), the semantic value of DO term A, \(SV(A)\), is calculated as:

\[SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)\]

Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:

\[sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}\]

where \(S_{A}(\textit{t})\) is the S-value of GO term \(t\) related to term \(A\) and \(S_{B}(\textit{t})\) is the S-value of GO term \(t\) related to term \(B\).

This method proposed by Wang (Wang et al. 2007) determines the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms.

1.3 Combine methods

Since a gene product can be annotated by multiple GO terms, semantic similarity among gene products needs to be aggregated from different semantic similarity scores of multiple GO terms associated with genes, including max, avg, rcmax and BMA.

1.3.1 max

The max method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.

\[sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})\]

1.3.2 avg

The avg calculates the average semantic similarity score over all pairs of GO terms.

\[sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}\]

1.3.3 rcmax

Similarities among two sets of GO terms form a matrix, the rcmax method uses the maximum of RowScore and ColumnScore, where RowScore (or ColumnScore) is the average of maximum similarity on each row (or column).

\[sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})\]

1.3.4 BMA

The BMA method, used the Best-Match Average strategy, calculates the average of all maximum similarities on each row and column, and is defined as:

\[sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}\]

1.4 Summary

The idea behind semantic similarity measurement is the notion that genes with similar function should have similar annotation vocabulary and have a close relationship in the ontology strucutre. Measuring similarity is critical for expanding knownledge, since similar objects tend to behave similarly, which supports many bioinformatics applications to infer gene/protein functions, miRNA function, genetic interaction, protein-protein interaction, miRNA-mRNA interaction and celluar localization.

We developed several Bioconductor packages, including GOSemSim (Yu et al. 2010; Yu 2020) for computing semantic similarity among GO terms, sets of GO terms, gene products and gene clusters (see also Chapter 2), DOSE (Yu et al. 2015) for Disease Ontology (DO) (see also Chapter 3) and meshes (Yu 2018) that based on Medical Subject Headings (MeSH) (see also Chapter 4).