4 MeSH semantic similarity analysis
MeSH (Medical Subject Headings) is the NLM (U.S. National Library of
Medicine) controlled vocabulary used to manually index articles for
MEDLINE/PubMed. MeSH is a comprehensive life science vocabulary. MeSH has
19 categories and MeSH.db
contains 16 of them. That is:
Abbreviation | Category |
---|---|
A | Anatomy |
B | Organisms |
C | Diseases |
D | Chemicals and Drugs |
E | Analytical, Diagnostic and Therapeutic Techniques and Equipment |
F | Psychiatry and Psychology |
G | Phenomena and Processes |
H | Disciplines and Occupations |
I | Anthropology, Education, Sociology and Social Phenomena |
J | Technology and Food and Beverages |
K | Humanities |
L | Information Science |
M | Persons |
N | Health Care |
V | Publication Type |
Z | Geographical Locations |
MeSH terms were associated with Entrez Gene ID by three methods,
gendoo
, gene2pubmed
and RBBH
(Reciprocal Blast Best Hit).
Method | Way of corresponding Entrez Gene IDs and MeSH IDs |
---|---|
Gendoo | Text-mining |
gene2pubmed | Manual curation by NCBI teams |
RBBH | sequence homology with BLASTP search (E-value<10-50) |
4.1 Supported organisms
The meshes package (Yu 2018) relies on MeSHDb
to prepare semantic data for measuring simiarlity. MeSHDb
can be downloaded from AnnotationHub (see also AHMeSHDbs) and about 200 species are available and are supported by the meshes package.
First, we need to load/fetch species-specific MeSH annotation database:
#############################
## BioC 2.14 to BioC 3.13 ##
#############################
##
## library(MeSH.Hsa.eg.db)
## db <- MeSH.Hsa.eg.db
##
##---------------------------
# From BioC 3.14 (Nov. 2021, with R-4.2.0)
library(AnnotationHub)
library(MeSHDbi)
ah <- AnnotationHub(localHub=TRUE)
hsa <- query(ah, c("MeSHDb", "Homo sapiens"))
file_hsa <- hsa[[1]]
db <- MeSHDbi::MeSHDb(file_hsa)
The semantic data can be prepared by the meshdata()
function:
4.2 MeSH semantic similarity measurement
The meshes package (Yu 2018) implemented four IC-based methods (i.e. Resnik (Philip 1999), Jiang (Jiang and Conrath 1997), Lin (Lin 1998) and Schlicker (Schlicker et al. 2006)) and one graph-structure based method (i.e. Wang (Wang et al. 2007)), to measure MeSH term semantic similarity. For algorithm details, please refer to Chapter 1.
The meshSim()
function is designed to measure semantic similarity between two MeSH term vectors.
## [1] 0.3847944
meshSim("D000009", "D009130", semData=hsamd, measure="Rel")
## [1] 0.633538
meshSim("D000009", "D009130", semData=hsamd, measure="Jiang")
## [1] 0.5587351
meshSim("D000009", "D009130", semData=hsamd, measure="Wang")
## [1] 0.5557103
## D017629 D002890 D008928
## D001369 0.2886598 0.1923711 0.2193326
## D002462 0.6521739 0.2381925 0.2809552
4.3 Gene semantic similarity measurement
The geneSim()
function is designed to measure semantic similarity among two gene vectors.
geneSim("241", "251", semData=hsamd, measure="Wang", combine="BMA")
## [1] 0.216
geneSim(c("241", "251"), c("835", "5261","241", "994"), semData=hsamd, measure="Wang", combine="BMA")
## 835 5261 241 994
## 241 0.353 0.143 1.000 0.262
## 251 0.301 0.347 0.216 0.361