4 MeSH semantic similarity analysis

MeSH (Medical Subject Headings) is the NLM (U.S. National Library of Medicine) controlled vocabulary used to manually index articles for MEDLINE/PubMed. MeSH is a comprehensive life science vocabulary. MeSH has 19 categories and MeSH.db contains 16 of them. That is:

Abbreviation Category
A Anatomy
B Organisms
C Diseases
D Chemicals and Drugs
E Analytical, Diagnostic and Therapeutic Techniques and Equipment
F Psychiatry and Psychology
G Phenomena and Processes
H Disciplines and Occupations
I Anthropology, Education, Sociology and Social Phenomena
J Technology and Food and Beverages
K Humanities
L Information Science
M Persons
N Health Care
V Publication Type
Z Geographical Locations

MeSH terms were associated with Entrez Gene ID by three methods, gendoo, gene2pubmed and RBBH (Reciprocal Blast Best Hit).

Method Way of corresponding Entrez Gene IDs and MeSH IDs
Gendoo Text-mining
gene2pubmed Manual curation by NCBI teams
RBBH sequence homology with BLASTP search (E-value<10-50)

4.1 Supported organisms

The meshes package (Yu 2018) relies on MeSHDb to prepare semantic data for measuring simiarlity. MeSHDb can be downloaded from AnnotationHub (see also AHMeSHDbs) and about 200 species are available and are supported by the meshes package.

First, we need to load/fetch species-specific MeSH annotation database:

#############################
## BioC 2.14 to BioC 3.13  ##
#############################
##
## library(MeSH.Hsa.eg.db)
## db <- MeSH.Hsa.eg.db
##
##---------------------------

# From BioC 3.14 (Nov. 2021, with R-4.2.0)
library(AnnotationHub)
library(MeSHDbi)
ah <- AnnotationHub(localHub=TRUE)
hsa <- query(ah, c("MeSHDb", "Homo sapiens"))
file_hsa <- hsa[[1]]
db <- MeSHDbi::MeSHDb(file_hsa)

The semantic data can be prepared by the meshdata() function:

library(meshes)
hsamd <- meshdata(db, category='A', computeIC=T, database="gendoo")

## you may want to save the result for future usage
# 
# save(hsamd, file = "hsamd.rda")
#

4.2 MeSH semantic similarity measurement

The meshes package (Yu 2018) implemented four IC-based methods (i.e. Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006)) and one graph-structure based method (i.e. Wang (Wang et al. 2007)), to measure MeSH term semantic similarity. For algorithm details, please refer to Chapter 1.

The meshSim() function is designed to measure semantic similarity between two MeSH term vectors.

library(meshes)
meshSim("D000009", "D009130", semData=hsamd, measure="Resnik")
## [1] 0.3847944
meshSim("D000009", "D009130", semData=hsamd, measure="Rel")
## [1] 0.633538
meshSim("D000009", "D009130", semData=hsamd, measure="Jiang")
## [1] 0.5587351
meshSim("D000009", "D009130", semData=hsamd, measure="Wang")
## [1] 0.5557103
meshSim(c("D001369", "D002462"), c("D017629", "D002890", "D008928"), semData=hsamd, measure="Wang")
##           D017629   D002890   D008928
## D001369 0.2886598 0.1923711 0.2193326
## D002462 0.6521739 0.2381925 0.2809552

4.3 Gene semantic similarity measurement

The geneSim() function is designed to measure semantic similarity among two gene vectors.

geneSim("241", "251", semData=hsamd, measure="Wang", combine="BMA")
## [1] 0.216
geneSim(c("241", "251"), c("835", "5261","241", "994"), semData=hsamd, measure="Wang", combine="BMA")
##       835  5261   241   994
## 241 0.353 0.143 1.000 0.262
## 251 0.301 0.347 0.216 0.361