1 Overview of semantic similarity analysis
Functional similarity of gene products can be estimated by controlled biological vocabularies, such as Gene Ontology (GO), Disease Ontology (DO) and Medical Subject Headings (MeSH).
Four methods including Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang (Wang et al. 2007) proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and weaknesses and can be applied to other ontologies that have similar structure (i.e. directed acyclic graph).
1.1 Information content-based methods
Four methods proposed by Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations. The information content of a GO term is computed by the negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.
The frequency of a term t is defined as:
\[p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}\]
where \(n_{t'}\) is the number of term \(t'\), and \(N\) is the total number of terms in GO corpus.
Thus the information content is defined as:
\[IC(t) = -\log(p(t))\]
As GO allow multiple parents for each concept, two terms can share parents by multiple paths. IC-based methods calculate similarity of two GO terms based on the information content of their closest common ancestor term, which was also called most informative common ancestor (MICA).
1.1.2 Lin method
The Lin method is defined as:
\[sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}\]
1.2 Graph-based method
Graph-based methods using the topology of GO graph structure to compute semantic similarity. Formally, a GO term A can be represented as \(DAG_{A}=(A,T_{A},E_{A})\) where \(T_{A}\) is the set of GO terms in \(DAG_{A}\), including term A and all of its ancestor terms in the GO graph, and \(E_{A}\) is the set of edges connecting the GO terms in \(DAG_{A}\).
1.2.1 Wang method
To encode the semantic of a GO term in a measurable format to enable a quantitative comparison, Wang(Wang et al. 2007) firstly defined the semantic value of term A as the aggregate contribution of all terms in \(DAG_{A}\) to the semantics of term A, terms closer to term A in \(DAG_{A}\) contribute more to its semantics. Thus, defined the contribution of a GO term \(t\) to the semantic of GO term \(A\) as the S-value of GO term \(t\) related to term \(A\).
For any of term \(t\) in \(DAG_{A}\), its S-value related to term \(A\), \(S_{A}(\textit{t})\) is defined as:
\[\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.\]
where \(w_{e}\) is the semantic contribution factor for edge \(e \in E_{A}\) linking term \(t\) with its child term \(t'\). Term \(A\) contributes to its own is defined as 1. After obtaining the S-values for all terms in \(DAG_{A}\), the semantic value of DO term A, \(SV(A)\), is calculated as:
\[SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)\]
Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:
\[sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}\]
where \(S_{A}(\textit{t})\) is the S-value of GO term \(t\) related to term \(A\) and \(S_{B}(\textit{t})\) is the S-value of GO term \(t\) related to term \(B\).
This method proposed by Wang (Wang et al. 2007) determines the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms.
1.3 Combine methods
Since a gene product can be annotated by multiple GO terms, semantic similarity among gene products needs to be aggregated from different semantic similarity scores of multiple GO terms associated with genes, including max
, avg
, rcmax
and BMA
.
1.3.1 max
The max
method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.
\[sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})\]
1.3.2 avg
The avg
calculates the average semantic similarity score over all pairs of GO terms.
\[sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}\]
1.3.3 rcmax
Similarities among two sets of GO terms form a matrix, the rcmax
method uses the maximum of RowScore
and ColumnScore
, where RowScore
(or ColumnScore
) is the average of maximum similarity on each row (or column).
\[sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})\]
1.3.4 BMA
The BMA
method, used the Best-Match Average strategy, calculates the average of all maximum similarities on each row and column, and is defined as:
\[sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}\]
1.4 Summary
The idea behind semantic similarity measurement is the notion that genes with similar function should have similar annotation vocabulary and have a close relationship in the ontology strucutre. Measuring similarity is critical for expanding knownledge, since similar objects tend to behave similarly, which supports many bioinformatics applications to infer gene/protein functions, miRNA function, genetic interaction, protein-protein interaction, miRNA-mRNA interaction and celluar localization.
We developed several Bioconductor packages, including GOSemSim (Yu et al. 2010; Yu 2020) for computing semantic similarity among GO terms, sets of GO terms, gene products and gene clusters (see also Chapter 2), DOSE (Yu et al. 2015) for Disease Ontology (DO) (see also Chapter 3) and meshes (Yu 2018) that based on Medical Subject Headings (MeSH) (see also Chapter 4).