Miscellaneous topics

Leading Edge Analysis

Leading edge analysis is a powerful feature in GSEA that identifies the core set of genes driving the enrichment signal. It reports three key metrics:

  • Tags: Percentage of genes contributing to the enrichment score
  • List: Position in the ranked list where the enrichment score is attained
  • Signal: Strength of the enrichment signal

DOSE, clusterProfiler, and ReactomePA all support leading edge analysis and can report the core enriched genes that contribute to the enrichment.

Core Enriched Genes Extraction

After performing GSEA, the results object contains a core_enrichment column that lists the core genes responsible for each enriched term:

library(DOSE)
DOSE v4.5.1 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an
R/Bioconductor package for Disease Ontology Semantic and Enrichment
analysis. Bioinformatics. 2015, 31(4):608-609
data(geneList)
x <- gseDO(geneList)
head(x)
                       ID                      Description setSize
DOID:0111962 DOID:0111962        combined immunodeficiency      61
DOID:0060306 DOID:0060306            Meier-Gorlin syndrome      10
DOID:2799       DOID:2799         bronchiolitis obliterans      25
DOID:0070297 DOID:0070297             primary microcephaly      29
DOID:820         DOID:820                      myocarditis      31
DOID:612         DOID:612 primary immunodeficiency disease     234
             enrichmentScore      NES       pvalue     p.adjust       qvalue
DOID:0111962       0.6365097 2.369943 3.087832e-08 1.651990e-05 7.836705e-06
DOID:0060306       0.9453691 2.289140 5.867369e-08 2.092695e-05 9.927319e-06
DOID:2799          0.7055856 2.204729 3.320345e-05 4.720814e-03 2.239458e-03
DOID:0070297       0.6677151 2.078426 4.594987e-05 5.409562e-03 2.566186e-03
DOID:820           0.6455437 2.056984 9.639181e-05 7.367088e-03 3.494797e-03
DOID:612           0.4497659 2.038623 4.266921e-10 4.565606e-07 2.165831e-07
             rank                   leading_edge
DOID:0111962 2039 tags=54%, list=16%, signal=45%
DOID:0060306  452  tags=80%, list=4%, signal=77%
DOID:2799    1061  tags=44%, list=8%, signal=40%
DOID:0070297  949  tags=38%, list=8%, signal=35%
DOID:820     2099 tags=61%, list=17%, signal=51%
DOID:612     2521 tags=43%, list=20%, signal=35%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       core_enrichment
DOID:0111962                                                                                                                                                                                                                                                                                                                                                    9837/1503/7037/3932/3559/51311/3561/3574/3575/4860/915/959/11151/50615/1794/3689/5788/5424/5695/3394/10525/100/5880/5699/204/10095/5971/10125/8456/8625/3071/7293/4478
DOID:0060306                                                                                                                                                                                                                                                                                                                                                                                                                                                                                8318/81620/4174/990/23594/4998/64785/51053
DOID:2799                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3627/6373/4283/3002/4318/6352/6347/6354/942/6361/6367
DOID:0070297                                                                                                                                                                                                                                                                                                                                                                                                                                                              1062/23397/259266/4001/9928/9918/699/6491/84823/23310/284403
DOID:820                                                                                                                                                                                                                                                                                                                                                                                                                               6280/6279/3627/29851/8792/1493/3934/6347/3689/3383/7295/6696/57817/4282/3119/6376/2833/5464/958
DOID:612     55388/7153/9837/29851/9636/1503/1493/7037/4173/3932/3559/6772/51311/3507/3561/917/3574/3575/919/4860/915/22806/5693/4938/1535/3458/959/5336/11151/3702/925/4688/64135/28755/50615/974/1794/3689/5788/5424/916/7096/4068/3937/30009/5695/3394/10525/100/7374/3659/940/939/4689/5880/7128/6891/4210/6789/5699/930/6573/11322/204/6850/10095/7124/3569/7097/7852/8772/5692/64170/3119/1956/28985/1053/5971/1536/10125/8456/8625/3071/7293/4478/1380/958/5054/5591/9437/10379/54440/3570/3978/3593/10625/29927/3558/51371/735

The output includes the core enriched genes in Entrez ID format for each significant term.

Enhancing Readability with setReadable

To make the results more interpretable, use setReadable() to convert Entrez IDs to gene symbols:

library(clusterProfiler)
clusterProfiler v4.19.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan,
X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal
enrichment tool for interpreting omics data. The Innovation. 2021,
2(3):100141

Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':

    filter
y <- setReadable(x, 'org.Hs.eg.db')
head(y, 2)
                       ID               Description setSize enrichmentScore
DOID:0111962 DOID:0111962 combined immunodeficiency      61       0.6365097
DOID:0060306 DOID:0060306     Meier-Gorlin syndrome      10       0.9453691
                  NES       pvalue     p.adjust       qvalue rank
DOID:0111962 2.369943 3.087832e-08 1.651990e-05 7.836705e-06 2039
DOID:0060306 2.289140 5.867369e-08 2.092695e-05 9.927319e-06  452
                               leading_edge
DOID:0111962 tags=54%, list=16%, signal=45%
DOID:0060306  tags=80%, list=4%, signal=77%
                                                                                                                                                                                           core_enrichment
DOID:0111962 GINS1/CTPS1/TFRC/LCK/IL2RA/TLR8/IL2RG/IL7/IL7R/PNP/CD3D/CD40LG/CORO1A/IL21R/DOCK2/ITGB2/PTPRC/POLD1/PSMB7/IRF8/HYOU1/ADA/RAC2/PSMB10/AK2/ARPC1B/RELB/RASGRP1/FOXN1/RFXANK/NCKAP1L/TNFRSF4/MSN
DOID:0060306                                                                                                                                                     CDC45/CDT1/MCM5/CDC6/ORC6/ORC1/GINS3/GMNN

This transformation makes the core enrichment results much more readable and biologically meaningful.

For visualization of leading edge analysis results using cnetplot, please refer to the enrichplot chapter.

Non-Model Plant Annotation with clusterProfiler

For non-model plants and other organisms lacking standard annotation packages, clusterProfiler can be used with custom annotation data obtained from tools like eggNOG.

Workflow Overview

  1. Annotation with eggNOG: Use the eggNOG web server to annotate protein sequences
  2. Parse eggNOG Results: Extract GO and KEGG annotations using custom scripts
  3. Enrichment Analysis: Use clusterProfiler’s enricher() function with custom annotation data

Key Steps

1. eggNOG Annotation

Upload protein sequences to the eggNOG mapper with appropriate parameters for your organism.

2. Parsing eggNOG Results

Use Python scripts to process eggNOG output files:

# Parse GO ontology file
python parse_go_obofile.py -i go-basic.obo -o go.tb

# Parse eggNOG annotations with reference species filtering
python parse_eggNOG.py -i panax_ginseng.annotations \
                       -g go.tb \
                       -O ath,osa \
                       -o output_directory

This generates two key files: - GOannotation.tsv: GO term annotations - KOannotation.tsv: KEGG pathway annotations

3. Enrichment Analysis with clusterProfiler

library(clusterProfiler)

# Read annotation files
KOannotation <- read.delim("KOannotation.tsv", stringsAsFactors=FALSE)
GOannotation <- read.delim("GOannotation.tsv", stringsAsFactors=FALSE)
GOinfo <- read.delim("go.tb", stringsAsFactors=FALSE)

# Your gene list
gene_list <- c("gene1", "gene2", "gene3")  # Replace with your actual gene list

# GO enrichment (Molecular Function as example)
GOannotation_split <- split(GOannotation, GOannotation$level)
enricher(gene_list,
          TERM2GENE = GOannotation_split[['molecular_function']][c(2,1)],
          TERM2NAME = GOinfo[1:2])

# KEGG enrichment
enricher(gene_list,
          TERM2GENE = KOannotation[c(3,1)],
          TERM2NAME = KOannotation[c(3,4)])

Advantages

  • Works for any organism with protein sequences
  • Uses reliable eggNOG annotation pipeline
  • Flexible reference species filtering for KEGG

Considerations

  • Requires intermediate Python scripting
  • Performance may vary with dataset size
  • Manual integration of annotation and analysis steps

This approach enables comprehensive functional enrichment analysis for non-model organisms using clusterProfiler’s powerful enrichment capabilities combined with custom annotation data.