<environment: namespace:Rcpp>
4 GO enrichment analysis
GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).
4.1 Supported organisms
GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available (see also session 2.2).
If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis. Users can also read GO annotation from GMT files using the read.gmt() function, which parses the file into a data.frame suitable for these functions.
If the genes are annotated by direct annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap function, which will infer indirect annotation and generate a data.frame that is suitable for both enricher() and GSEA().
4.2 Direct vs indirect GO annotation
One common issue users encounter is the discrepancy between direct gene-to-GO mapping and the results from enrichment analysis. For instance, a user might find that a specific GO term is enriched, but when they check the input gene list against the direct annotations, the term doesn’t appear to be associated with many genes.
This happens because GO structure is a Directed Acyclic Graph (DAG). If a gene is annotated to a specific term (e.g., “glucose metabolic process”), it is implicitly annotated to all its ancestor terms (e.g., “metabolic process”). enrichGO and gseGO account for this indirect annotation by propagating annotations up the graph.
If you are providing your own annotation file (e.g., using enricher or GSEA), you must ensure that it includes these indirect annotations. If your annotation file only contains direct annotations, you can use the buildGOmap() function to generate the full annotation set (direct + indirect) before performing enrichment analysis.
# Assuming 'gomap' is a data.frame with direct annotations (GeneID, GOID)
# Build indirect annotations
gomap_with_ancestors <- buildGOmap(gomap)
# Use the result for enrichment
enricher(gene, TERM2GENE = gomap_with_ancestors, ...)This step is crucial for accurate GO enrichment analysis when not using OrgDb objects.
4.3 GO classification
In clusterProfiler, the groupGO() function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList provided by DOSE.
library(clusterProfiler)
library(org.Hs.eg.db)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
# Entrez gene ID
head(gene)[1] "4312" "8318" "10874" "55143" "55388" "991"
ggo <- groupGO(gene = gene,
OrgDb = org.Hs.eg.db,
ont = "CC",
level = 3,
readable = TRUE)
head(ggo) ID Description Count GeneRatio geneID
GO:0000133 GO:0000133 GO:0000133 0 0/207
GO:0000417 GO:0000417 HIR complex 0 0/207
GO:0000796 GO:0000796 condensin complex 2 2/207 NCAPH/NCAPG
GO:0000808 GO:0000808 origin recognition complex 0 0/207
GO:0000930 GO:0000930 gamma-tubulin complex 0 0/207
GO:0000939 GO:0000939 inner kinetochore 0 0/207
The gene parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding OrgDb, see also Section 17.1). If readable is set to TRUE, the input gene IDs will be converted to gene symbols.
4.4 GO over-representation analysis
The clusterProfiler package implements enrichGO() for gene ontology over-representation test.
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego) ID Description GeneRatio
GO:0000775 GO:0000775 chromosome, centromeric region 21/201
GO:0000779 GO:0000779 condensed chromosome, centromeric region 18/201
GO:0072686 GO:0072686 mitotic spindle 17/201
GO:0000793 GO:0000793 condensed chromosome 21/201
GO:0098687 GO:0098687 chromosomal region 25/201
GO:0000776 GO:0000776 kinetochore 17/201
BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0000775 199/11914 0.10552764 6.255006 9.792716 1.864198e-11
GO:0000779 150/11914 0.12000000 7.112836 9.869278 6.484361e-11
GO:0072686 137/11914 0.12408759 7.355122 9.800345 1.316552e-10
GO:0000793 222/11914 0.09459459 5.606965 9.076565 1.486538e-10
GO:0098687 322/11914 0.07763975 4.601990 8.583527 1.732119e-10
GO:0000776 141/11914 0.12056738 7.146466 9.617585 2.087156e-10
p.adjust qvalue
GO:0000775 1.262062e-08 1.262062e-08
GO:0000779 2.194956e-08 2.194956e-08
GO:0072686 2.345289e-08 2.345289e-08
GO:0000793 2.345289e-08 2.345289e-08
GO:0098687 2.345289e-08 2.345289e-08
GO:0000776 2.355008e-08 2.355008e-08
geneID
GO:0000775 SKA1/CDCA8/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/AURKB/CENPM/HJURP/NCAPG/NDC80/CDT1/EZH2/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0000779 SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/AURKB/CENPM/HJURP/NCAPG/NDC80/CDT1/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0072686 SKA1/KIFC1/DLGAP5/KIF11/AURKA/KIF20A/AURKB/KIF23/CDK1/KIF18B/NUSAP1/ASPM/PRC1/MAD2L1/KIF18A/TPX2/CENPE
GO:0000793 SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/AURKB/CENPM/HJURP/NCAPG/NDC80/CHEK1/CDT1/NCAPH/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0098687 SKA1/CDCA8/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/MCM5/AURKB/CENPM/CDK1/RAD51AP1/HJURP/NCAPG/NDC80/CHEK1/CDT1/EZH2/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0000776 SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/AURKB/CENPM/HJURP/NDC80/CDT1/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
Count
GO:0000775 21
GO:0000779 18
GO:0072686 17
GO:0000793 21
GO:0098687 25
GO:0000776 17
Any gene ID type that is supported in OrgDb can be directly used in GO analyses. Users need to specify the keyType parameter to specify the input gene ID type.
However, the GO annotation in OrgDb is indexed by ENTREZID. If the input ID type is not ENTREZID, clusterProfiler will first map the input IDs to ENTREZID using the bitr() function. This mapping step may lead to a loss of information, as some IDs might not have a corresponding ENTREZID or multiple IDs might map to the same ENTREZID (many-to-one mapping).
To address this issue, enrichGO and gseGO perform the mapping and use the mapped ENTREZIDs for enrichment analysis. The results are then mapped back to the original input ID type. This ensures that the gene counts and gene ratios in the final result accurately reflect the input IDs, avoiding discrepancies caused by the intermediate mapping step.
For example, if we use ENSEMBL IDs as input:
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(ego2, 3) ID Description GeneRatio
GO:0000775 GO:0000775 chromosome, centromeric region 21/202
GO:0000779 GO:0000779 condensed chromosome, centromeric region 18/202
GO:0098687 GO:0098687 chromosomal region 25/202
BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0000775 257/19979 0.08171206 8.081808 11.54800 1.899777e-13
GO:0000779 190/19979 0.09473684 9.370036 11.71558 8.795611e-13
GO:0098687 417/19979 0.05995204 5.929613 10.28124 9.528885e-13
p.adjust qvalue
GO:0000775 1.453329e-10 1.453329e-10
GO:0000779 2.429866e-10 2.429866e-10
GO:0098687 2.429866e-10 2.429866e-10
geneID
GO:0000775 ENSG00000154839/ENSG00000134690/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000131747/ENSG00000178999/ENSG00000100162/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000167513/ENSG00000106462/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
GO:0000779 ENSG00000154839/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000178999/ENSG00000100162/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000167513/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
GO:0098687 ENSG00000154839/ENSG00000134690/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000131747/ENSG00000100297/ENSG00000178999/ENSG00000100162/ENSG00000170312/ENSG00000111247/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000149554/ENSG00000167513/ENSG00000106462/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
Count
GO:0000775 21
GO:0000779 18
GO:0098687 25
Gene IDs can be mapped to gene Symbols by using the parameter readable=TRUE or setReadable() function.
4.5 GO Gene Set Enrichment Analysis
The clusterProfiler package provides the gseGO() function for gene set enrichment analysis using gene ontology.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)The format of input data, geneList, is documented in the Section 17.1. Beware that only gene sets with size in [minGSSize, maxGSSize] will be tested.
4.6 GO analysis for non-model organisms
Both the enrichGO() and gseGO() functions require an OrgDb object as the background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt, or annotate the genes using Blast2GO or the Trinotate pipeline. Then the enricher() or GSEA() functions can be used to perform GO analysis for these organisms, similar to the examples using WikiPathways and MSigDB. Another solution is to create an OrgDb on your own using the AnnotationForge package.
Here is an example of querying GO annotation from Ensembl using biomaRt.
library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)Alternatively, you can use AnnotationHub to query and retrieve OrgDb objects for a wide range of organisms, including non-model species.
Example: GO enrichment analysis for Maize (Zea mays)
library(AnnotationHub)
hub <- AnnotationHub()
# Query for Zea mays OrgDb
query(hub, "zea")
# Retrieve the OrgDb (e.g., AH55736, assume it's the correct record)
maize <- hub[['AH55736']]
# Check keys
length(keys(maize))
columns(maize)
# Perform enrichment analysis
# Assuming 'sample_genes' is a vector of Entrez IDs
res <- enrichGO(sample_genes, OrgDb=maize, pvalueCutoff=0.05, qvalueCutoff=0.05)This approach allows you to use the standard enrichGO workflow with organisms not included in the default Bioconductor OrgDb packages.
4.7 Visualization of GO enrichment results
The clusterProfiler package provides several visualization methods to help interpret enrichment results, including barplot, dotplot, cnetplot, emapplot, goplot and plotGOgraph.
Please refer to the Chapter 13 chapter for details.
4.8 Filtering and simplifying GO terms
GO enrichment analysis often results in many redundant terms due to the hierarchical structure of GO. Parent terms and child terms often share a large proportion of genes, leading to multiple significant results that represent similar biological processes.
4.8.1 Removing specific terms or levels
The dropGO() function can be used to remove specific GO terms or GO levels from the results. The enrichGO() function tests the whole GO corpus and enriched result may contains very general terms. If users want to restrict the result to a specific GO level (e.g., level 3 or 4), they can use the gofilter() function. Both functions work with results obtained from enrichGO(), gseGO() and compareCluster().
4.8.2 Simplifying enriched GO terms
To reduce redundancy, clusterProfiler provides a simplify method. It uses semantic similarity (via GOSemSim) to calculate the similarity between enriched terms and removes highly similar terms, keeping the most significant one.
The simplify method applies select_fun (which can be a user-defined function) to the feature specified by by (e.g., p.adjust) to select one representative term from redundant terms (which have similarity higher than cutoff). This results in a cleaner and more interpretable visualization.
In Figure 4.1(A), we can found that there are many redundant terms form a highly condense network. After removing redundant terms using the simplify() method, the result is more clear to view the whole story.
#|
library(enrichplot)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
bp <- enrichGO(de, ont="BP", OrgDb = 'org.Hs.eg.db')
bp <- pairwise_termsim(bp)
bp2 <- simplify(bp, cutoff=0.7, by="p.adjust", select_fun=min)
p1 <- emapplot(bp)
p2 <- emapplot(bp2)
plot_list(p1, p2, ncol=2, tag_levels = 'A')emapplot() function. (A) original result. (B) simplify result.
Alternatively, users can use slim version of GO and use the enricher() or gseGO() functions to analyze.
Note on enricher results: The simplify() method is designed for enrichGO() results where the ontology (BP, CC, or MF) is known. If you use enricher() to perform GO analysis (e.g., with custom annotation), the result object does not strictly define the ontology. To use simplify(), you must explicitly set the ontology slot in the result object:
# Assuming 'res' is the result from enricher() using GO annotation
# Set the ontology (must be one of "BP", "MF", "CC")
res@ontology <- "BP"
res_simplified <- simplify(res)4.9 Troubleshooting
4.9.1 No results found
It is common to receive a message saying “No gene set have size > 10 … return NULL” or finding no significant terms.
- Gene Set Size: The default
minGSSizeis 10 andmaxGSSizeis 500. If your background annotations for a specific term have fewer than 10 genes or more than 500, they are excluded. You can adjust these parameters. - Universe: If you provide a custom
universe(background gene list), ensure it is large enough. A small universe can make it statistically difficult to find enrichment. - P-value Adjustment:
clusterProfileruses multiple testing correction (e.g., BH) by default. Some web tools might use raw p-values or different thresholds, leading to more (but potentially false positive) results.clusterProfilertends to be more conservative and reliable.
4.9.2 Annotation quality
Sometimes, the OrgDb provided by Bioconductor might be outdated compared to the latest online databases (e.g., TAIR for Arabidopsis).
- Check Dates: Always check the metadata of the
OrgDbobject to see the source and date of the annotation. - Custom Annotation: If the
OrgDbis too old, consider downloading the latest annotation (e.g., GAF or GOSLIM files) from the organism’s primary database and usingenricher()/GSEA()withbuildGOmap()as described in the Direct vs indirect GO annotation section.
4.10 Summary
GO semantic similarity can be calculated by GOSemSim (Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.