Semantic Similarity Analysis
Functional similarity of gene products can be estimated by controlled biological vocabularies, such as Gene Ontology (GO), Disease Ontology (DO) and Medical Subject Headings (MeSH).
Four methods including Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang (Wang et al. 2007) proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and weaknesses and can be applied to other ontologies that have similar structure (i.e. directed acyclic graph).
Information content-based methods
Four methods proposed by Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations. The information content of a GO term is computed by the negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.
The frequency of a term t is defined as:
\[p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}\]
where \(n_{t'}\) is the number of term \(t'\), and \(N\) is the total number of terms in GO corpus.
Thus the information content is defined as:
\[IC(t) = -\log(p(t))\]
As GO allows multiple parents for each concept, two terms can share parents through multiple paths. IC-based methods calculate the similarity of two GO terms based on the information content of their closest common ancestor term, which is also called the most informative common ancestor (MICA).
Resnik method
The Resnik method is defined as:
\[sim_{Resnik}(t_1,t_2) = IC(MICA)\]
Lin method
The Lin method is defined as:
\[sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}\]
Rel method
The Relevance method, which was proposed by Schlicker, combines Resnik’s and Lin’s methods and is defined as:
\[sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}\]
Jiang method
The Jiang and Conrath’s method is defined as:
\[sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))\]
Graph-based methods
Graph-based methods use the topology of the GO graph structure to compute semantic similarity. Formally, a GO term A can be represented as \(DAG_{A}=(A,T_{A},E_{A})\) where \(T_{A}\) is the set of GO terms in \(DAG_{A}\), including term A and all of its ancestor terms in the GO graph, and \(E_{A}\) is the set of edges connecting the GO terms in \(DAG_{A}\).
Wang method
To encode the semantics of a GO term in a measurable format to enable quantitative comparison, Wang(Wang et al. 2007) first defined the semantic value of term A as the aggregate contribution of all terms in \(DAG_{A}\) to the semantics of term A, where terms closer to term A in \(DAG_{A}\) contribute more to its semantics. Thus, the contribution of a GO term \(t\) to the semantics of GO term \(A\) is defined as the S-value of GO term \(t\) related to term \(A\).
For any of term \(t\) in \(DAG_{A}\), its S-value related to term \(A\), \(S_{A}(\textit{t})\) is defined as:
\[\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.\]
where \(w_{e}\) is the semantic contribution factor for edge \(e \in E_{A}\) linking term \(t\) with its child term \(t'\). Term \(A\)’s contribution to itself is defined as 1. After obtaining the S-values for all terms in \(DAG_{A}\), the semantic value of GO term A, \(SV(A)\), is calculated as:
\[SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)\]
Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:
\[sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}\]
where \(S_{A}(\textit{t})\) is the S-value of GO term \(t\) related to term \(A\) and \(S_{B}(\textit{t})\) is the S-value of GO term \(t\) related to term \(B\).
This method proposed by Wang (Wang et al. 2007) determines the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms.
Combine methods
Since a gene product can be annotated by multiple GO terms, semantic similarity among gene products needs to be aggregated from different semantic similarity scores of multiple GO terms associated with genes, including max, avg, rcmax and BMA.
max
The max method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.
\[sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})\]
avg
The avg calculates the average semantic similarity score over all pairs of GO terms.
\[sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}\]
rcmax
Similarities among two sets of GO terms form a matrix, the rcmax method uses the maximum of RowScore and ColumnScore, where RowScore (or ColumnScore) is the average of maximum similarity on each row (or column).
\[sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})\]
BMA
The BMA method, used the Best-Match Average strategy, calculates the average of all maximum similarities on each row and column, and is defined as:
\[sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}\]
Summary
The idea behind semantic similarity measurement is the notion that genes with similar functions should have similar annotation vocabularies and be closely related in the ontology structure. Measuring similarity is critical for expanding knowledge, since similar objects tend to exhibit similar behaviors, which supports many bioinformatics applications to infer gene/protein functions, miRNA functions, genetic interactions, protein-protein interactions, miRNA-mRNA interactions, and cellular localization.
We developed several Bioconductor packages, including r Biocpkg("GOSemSim") (Yu et al. 2010; Yu 2020) for computing semantic similarity among GO terms, sets of GO terms, gene products and gene clusters (see also Chapter 2), r Biocpkg("DOSE") (Yu et al. 2015) for Disease Ontology (DO) (see also Chapter 3) and r Biocpkg("meshes") (Yu 2018) that based on Medical Subject Headings (MeSH) (see also Chapter 4).