Semantic Similarity Analysis

Functional similarity of gene products can be estimated by controlled biological vocabularies, such as Gene Ontology (GO), Disease Ontology (DO) and Medical Subject Headings (MeSH).

Four methods including Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang (Wang et al. 2007) proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and weaknesses and can be applied to other ontologies that have similar structure (i.e. directed acyclic graph).

Information content-based methods

Four methods proposed by Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006) are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations. The information content of a GO term is computed by the negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.

The frequency of a term t is defined as:

\[p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}\]

where \(n_{t'}\) is the number of term \(t'\), and \(N\) is the total number of terms in GO corpus.

Thus the information content is defined as:

\[IC(t) = -\log(p(t))\]

As GO allows multiple parents for each concept, two terms can share parents through multiple paths. IC-based methods calculate the similarity of two GO terms based on the information content of their closest common ancestor term, which is also called the most informative common ancestor (MICA).

Resnik method

The Resnik method is defined as:

\[sim_{Resnik}(t_1,t_2) = IC(MICA)\]

Lin method

The Lin method is defined as:

\[sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}\]

Rel method

The Relevance method, which was proposed by Schlicker, combines Resnik’s and Lin’s methods and is defined as:

\[sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}\]

Jiang method

The Jiang and Conrath’s method is defined as:

\[sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))\]

Graph-based methods

Graph-based methods use the topology of the GO graph structure to compute semantic similarity. Formally, a GO term A can be represented as \(DAG_{A}=(A,T_{A},E_{A})\) where \(T_{A}\) is the set of GO terms in \(DAG_{A}\), including term A and all of its ancestor terms in the GO graph, and \(E_{A}\) is the set of edges connecting the GO terms in \(DAG_{A}\).

Wang method

To encode the semantics of a GO term in a measurable format to enable quantitative comparison, Wang(Wang et al. 2007) first defined the semantic value of term A as the aggregate contribution of all terms in \(DAG_{A}\) to the semantics of term A, where terms closer to term A in \(DAG_{A}\) contribute more to its semantics. Thus, the contribution of a GO term \(t\) to the semantics of GO term \(A\) is defined as the S-value of GO term \(t\) related to term \(A\).

For any of term \(t\) in \(DAG_{A}\), its S-value related to term \(A\), \(S_{A}(\textit{t})\) is defined as:

\[\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.\]

where \(w_{e}\) is the semantic contribution factor for edge \(e \in E_{A}\) linking term \(t\) with its child term \(t'\). Term \(A\)’s contribution to itself is defined as 1. After obtaining the S-values for all terms in \(DAG_{A}\), the semantic value of GO term A, \(SV(A)\), is calculated as:

\[SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)\]

Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:

\[sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}\]

where \(S_{A}(\textit{t})\) is the S-value of GO term \(t\) related to term \(A\) and \(S_{B}(\textit{t})\) is the S-value of GO term \(t\) related to term \(B\).

This method proposed by Wang (Wang et al. 2007) determines the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms.

Combine methods

Since a gene product can be annotated by multiple GO terms, semantic similarity among gene products needs to be aggregated from different semantic similarity scores of multiple GO terms associated with genes, including max, avg, rcmax and BMA.

max

The max method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.

\[sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})\]

avg

The avg calculates the average semantic similarity score over all pairs of GO terms.

\[sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}\]

rcmax

Similarities among two sets of GO terms form a matrix, the rcmax method uses the maximum of RowScore and ColumnScore, where RowScore (or ColumnScore) is the average of maximum similarity on each row (or column).

\[sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})\]

BMA

The BMA method, used the Best-Match Average strategy, calculates the average of all maximum similarities on each row and column, and is defined as:

\[sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}\]

Summary

The idea behind semantic similarity measurement is the notion that genes with similar functions should have similar annotation vocabularies and be closely related in the ontology structure. Measuring similarity is critical for expanding knowledge, since similar objects tend to exhibit similar behaviors, which supports many bioinformatics applications to infer gene/protein functions, miRNA functions, genetic interactions, protein-protein interactions, miRNA-mRNA interactions, and cellular localization.

We developed several Bioconductor packages, including r Biocpkg("GOSemSim") (Yu et al. 2010; Yu 2020) for computing semantic similarity among GO terms, sets of GO terms, gene products and gene clusters (see also Chapter 2), r Biocpkg("DOSE") (Yu et al. 2015) for Disease Ontology (DO) (see also Chapter 3) and r Biocpkg("meshes") (Yu 2018) that based on Medical Subject Headings (MeSH) (see also Chapter 4).

Jiang, Jay J., and David W. Conrath. 1997. “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.” Proceedings of 10th International Conference on Research In Computational Linguistics. http://www.citebase.org/abstract?id=oai:arXiv.org:cmp-lg/9709008.

Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the 15th International Conference on Machine Learning, 296—304. https://doi.org/10.1.1.55.1832.

Philip, Resnik. 1999. “Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language.” Journal of Artificial Intelligence Research 11: 95–130.

Schlicker, Andreas, Francisco S Domingues, Jörg Rahnenführer, and Thomas Lengauer. 2006. “A New Measure for Functional Similarity of Gene Products Based on Gene Ontology.” BMC Bioinformatics 7: 302. https://doi.org/1471-2105-7-302.

Wang, James Z, Zhidian Du, Rapeeporn Payattakool, Philip S Yu, and Chin-Fu Chen. 2007. “A New Method to Measure the Semantic Similarity of GO Terms.” Bioinformatics (Oxford, England) 23 (May): 1274–81. https://doi.org/btm087.

Yu, Guangchuang. 2018. “Using Meshes for MeSH Term Enrichment and Semantic Analyses.” Bioinformatics 34 (21): 3766–67. https://doi.org/10.1093/bioinformatics/bty410.

———. 2020. “Gene Ontology Semantic Similarity Analysis Using GOSemSim.” Methods in Molecular Biology (Clifton, N.J.) 2117: 207–15. https://doi.org/10.1007/978-1-0716-0301-7_11.

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An r Package for Measuring Semantic Similarity Among GO Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.

Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An r/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.