Biological Knowledge Mining and Discovery

Knowledge discovery for precision medicine big data is an important aspect in promoting clinical translational applications. By mining biomedical knowledge, it is possible to drive the discovery of new biomedical knowledge. We have implemented a series of methods and tools, including: (1) for the first time, implementing biological theme comparison for complex experimental designs, (2) universal enrichment analysis methods for interpreting omics data, (3) measuring semantic similarity to assist in biological knowledge discovery, (4) cistromic data mining to aid in the discovery of (unknown) co-regulators, (5) integration of biological knowledge to enhance the biological interpretability of single-cell clustering, and (6) characterization of single-cell functional states and identification of spatial specific biological functions. This series of methods and software enables a more diverse range of biomedical knowledge to be applied at a greater variety of species, allowing biomedical knowledge to support biological big data mining and extract novel and potential disoveries.


1. Frist-time implementation of comparing biological themes Link to heading

When experimental designs were essentially simple, such as a case-control design, we realized the importance of complex experimental designs involving multiple time points and conditions for studying biomedical issues. We devloped the clusterProfiler to compare biological themes, and this work was published in OMICS: A Journal of Integrative Biology 2012. It has been cited over 20,000 times, making it the most highly cited paper in the journal and one of the top 10 papers among China’s highly cited papers from 2011 to 2021. Featuring an extremely user-friendly interface, it allows for specifying complex experimental designs using formula syntax, as demonstrated in The Innovation 2021 with data showcasing the different time points with various drug treatments simultaneously. Twelve years after its initial publication, we authored a protocol-type article introducing the functionality of comparing biological themes in comparing disease subtypes, functions of perturbed transcription factors, and cell type annotations using various omics data, including metagenomics, metabolomics, transcriptomics, and single-cell omics. This work was published in Nature Protocols 2024.

2. Discovering potential molecular mechanisms driven by high-throughput data and biomedical knowledge Link to heading

One of the key aspects of functional genomics research is to identify the biological pathways involved in perturbed genes and propose hypotheses about molecular mechanisms. Gene enrichment analysis is a widely used and effective method. However, it faces challenges such as outdated annotation and lack of support for non-model organisms. We have developed a series of R packages represented by clusterProfiler to address these challenges. ClusterProfiler allows for online retrieval of the latest genome annotations, supporting GO and KEGG enrichment analysis for thousands of species; providing a universal interface to support user-provided custom annotations, facilitating analysis of new species and using new functional annotations; allow for the use of genomic regions, enabling enrichment analysis of epigenomic data; implement comparative analysis of multiple datasets for supporting complex experimental designs; and provide a tidy interface, making it easier for users to operate, explore and interpret data (OMICS: A Journal of Integrative Biology 2012; The Innovation 2021; Nature Protocols 2024).

This work has had a significant impact on biomedical research, with over 20,000 citations and integration into more than 40 bioinformatics software tools. It has become one of the indispensable bioinformatics tools for various omics data analyses. We have also extended its application to disease ontology (Bioinformatics 2015b), medical subject headings (Bioinformatics 2018), and Reactome pathway analysis (Molecular BioSystems 2016).

3. Measuring gene semantic similarity promotes the discovery of biomedical knowledge Link to heading

By mining biomedical knowledge, new biomedical knowledge can be discovered. Calculating gene functional similarity using biomedical knowledge plays a crucial role in this process. It quantifies biomedical knowledge mathematically to measure the similarity between geens. Similarity entities share similar functions and behaviors, which form the basis for solving a wide range of real-world problems, including predicting and inferring aspects such as gene function, localization, and interactions, as well as analyzing diseases and drugs (including drug repurposing).

We developed the software tool GOSemSim based on gene ontology, which implements multiple information content-based algorithms and graph-based algorithms. This work was published in Bioinformatics 2010 and as a chapter in Stem Cell Transcriptional Networks (2nd edition) 2020. We extended this work to a broader range of biomedical knowledge, supporting the measurement of gene similarity from perspectives such as diseases (Bioinformatics 2015b), phenotypes, and medical subject headings (Bioinformatics 2018).

4. Annotation, visualization, data mining of cistrome data Link to heading

Cistrome refers to the set of cis-acting targets of trans-acting factors at the genome-wide scale, including transcription factor binding sites and the genomic locations of histone modifications. The cis-regulatory information can be obtained at the genome-wide level through techniques such as ChIP-seq, DNase-seq, and ATAC-seq. The key to bridging upstream (determining locations) and downstream analysis (functional studies) lies in annotating the genomic positional information. On the other hand, most tools are designed for individual datasets. With the proliferation of cistromic technologies and the accumulation of public data, comparing multiple datasets, mining public data, and identifying (unknown) co-regulators of transcription factors are critical for comprehensive understanding of cistrome interactions. To address these issues, we developed the ChIPseeker package for annotation and comparision, integrated with the GEO database to allow users to infer potential co-transcription factors or protein complexes through data mining. This work was published in Bioinformatics 2015a. ChIPseeker has been widely applied to various cistromic datasets and is listed as a key step in the anlysis of DNase-seq and ATAC-seq data. We also published a protocol article (Current Protocols 2022) introducing the analysis application of ChIPseeker in various different epigenomic datasets.

5. Integrating biological knowledge to enhance the biological interpretability of single-cell clustering Link to heading

Clustering is a key step in bridging upstream and downstream analyses in single-cell analysis, and accurate clustering is crucial for downstream analysis. Currently, the most common clustering methods rely on graph-based community detection approaches. We propose integrating biological knowledge as attributes of graph nodes to obtain clustering results that are more aligned with biological interpretations. We have implemented the MSGNN tool based on graph neural networks to enhance the biological interpretation of complex data.

6. Identification of spatial variable biological functions Link to heading

Identification of biological functions with spatial specific distributions involves identifying genes with high spatial variability, followed by enrichment analysis to characterize these functions. We propose a new method for characterizing cell functional states as well as a general approach for identifying spatial highly variable features. The SVP package has been developed to implement these two methods, enabling the characterization of various biological functions, including biological pathways, and identifying whether these functions exhibit spatial distribution specificity.

Comments from the academic community Link to heading

Users familiar with R/Bioconductor have several R packages to perform epigenomic enrichment analysis. ChIPseeker (Yu et al., 2015) provides a variety of functions for annotating user-provided ROIs in BED format by proximity to the nearby genes, finding overlaps with epigenomic regions, performing gene-centric and epigenomic enrichment analysis. ChIPseeker employs permutation strategy to calculate overlap enrichment p-values. ChIPseeker provides access to multi-organism epigenomic datasets from GEO (Gene Expression Omnibus), expanding enrichment analysis to species other than human. Coupled with excellent visualization capabilities, ChIPseeker is a well-designed tool with complete functionality for the interpretation of genome-wide ROIs.

– Excerpt from the article: Epigenomic annotation-based interpretation of genomic data: from enrichment analysis to machine learning, Bioinformatics, 2017, 33(20):3323-3330.

Publications Link to heading