3  MeSH semantic similarity analysis

MeSH (Medical Subject Headings) is the NLM (U.S. National Library of Medicine) controlled vocabulary used to manually index articles for MEDLINE/PubMed. MeSH is a comprehensive life science vocabulary. MeSH has 19 categories and MeSH.db contains 16 of them. That is:

Abbreviation Category
A Anatomy
B Organisms
C Diseases
D Chemicals and Drugs
E Analytical, Diagnostic and Therapeutic Techniques and Equipment
F Psychiatry and Psychology
G Phenomena and Processes
H Disciplines and Occupations
I Anthropology, Education, Sociology and Social Phenomena
J Technology and Food and Beverages
K Humanities
L Information Science
M Persons
N Health Care
V Publication Type
Z Geographical Locations

MeSH terms were associated with Entrez Gene ID by three methods, gendoo, gene2pubmed and RBBH (Reciprocal Blast Best Hit).

Method Way of corresponding Entrez Gene IDs and MeSH IDs
Gendoo Text-mining
gene2pubmed Manual curation by NCBI teams
RBBH sequence homology with BLASTP search (E-value<10-50)

3.1 Supported organisms

The meshes package (Yu 2018) relies on MeSHDb to prepare semantic data for measuring simiarlity. MeSHDb can be downloaded from AnnotationHub (see also AHMeSHDbs) and about 200 species are available and are supported by the meshes package.

First, we need to load/fetch species-specific MeSH annotation database:

#############################
## BioC 2.14 to BioC 3.13  ##
#############################
##
## library(MeSH.Hsa.eg.db)
## db <- MeSH.Hsa.eg.db
##
##---------------------------

# From BioC 3.14 (Nov. 2021, with R-4.2.0)
library(AnnotationHub)
library(MeSHDbi)
ah <- AnnotationHub(localHub=TRUE)
hsa <- query(ah, c("MeSHDb", "Homo sapiens"))
file_hsa <- hsa[[1]]
db <- MeSHDbi::MeSHDb(file_hsa)

The semantic data can be prepared by the meshdata() function:

library(meshes)
hsamd <- meshdata(db, category='A', computeIC=T, database="gendoo")

## you may want to save the result for future usage
# 
# save(hsamd, file = "hsamd.rda")
#

3.2 MeSH semantic similarity measurement

The meshes package (Yu 2018) implemented four IC-based methods (i.e. Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006)) and one graph-structure based method (i.e. Wang (Wang et al. 2007)), to measure MeSH term semantic similarity. For algorithm details, please refer to Chapter 1.

The meshSim() function is designed to measure semantic similarity between two MeSH term vectors.

library(meshes)
meshSim("D000009", "D009130", semData=hsamd, measure="Resnik")
[1] 0.3847944
meshSim("D000009", "D009130", semData=hsamd, measure="Rel")
[1] 0.633538
meshSim("D000009", "D009130", semData=hsamd, measure="Jiang")
[1] 0.5587351
meshSim("D000009", "D009130", semData=hsamd, measure="Wang")
[1] 0.5557103
meshSim(c("D001369", "D002462"), c("D017629", "D002890", "D008928"), semData=hsamd, measure="Wang")
          D017629   D002890   D008928
D001369 0.2886598 0.1923711 0.2193326
D002462 0.6521739 0.2381925 0.2809552

3.3 Gene semantic similarity measurement

The geneSim() function is designed to measure semantic similarity among two gene vectors.

geneSim("241", "251", semData=hsamd, measure="Wang", combine="BMA")
[1] 0.216
geneSim(c("241", "251"), c("835", "5261","241", "994"), semData=hsamd, measure="Wang", combine="BMA")
      835  5261   241   994
241 0.353 0.143 1.000 0.262
251 0.301 0.347 0.216 0.361

References

Jiang, Jay J., and David W. Conrath. 1997. “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.” Proceedings of 10th International Conference on Research In Computational Linguistics. http://www.citebase.org/abstract?id=oai:arXiv.org:cmp-lg/9709008.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the 15th International Conference on Machine Learning, 296—304. https://doi.org/10.1.1.55.1832.
Philip, Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language.” Journal of Artificial Intelligence Research 11: 95–130.
Schlicker, Andreas, Francisco S Domingues, Jörg Rahnenführer, and Thomas Lengauer. 2006. “A New Measure for Functional Similarity of Gene Products Based on Gene Ontology.” BMC Bioinformatics 7: 302. https://doi.org/1471-2105-7-302.
Wang, James Z, Zhidian Du, Rapeeporn Payattakool, Philip S Yu, and Chin-Fu Chen. 2007. “A New Method to Measure the Semantic Similarity of GO Terms.” Bioinformatics (Oxford, England) 23 (May): 1274–81. https://doi.org/btm087.
Yu, Guangchuang. 2018. “Using Meshes for MeSH Term Enrichment and Semantic Analyses.” Bioinformatics 34 (21): 3766–67. https://doi.org/10.1093/bioinformatics/bty410.