4  GO enrichment analysis

<environment: namespace:Rcpp>

GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).

4.1 Supported organisms

GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available (see also session 2.2).

If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis. Users can also read GO annotation from GMT files using the read.gmt() function, which parses the file into a data.frame suitable for these functions.

If the genes are annotated by direct annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap function, which will infer indirect annotation and generate a data.frame that is suitable for both enricher() and GSEA().

4.2 Direct vs indirect GO annotation

One common issue users encounter is the discrepancy between direct gene-to-GO mapping and the results from enrichment analysis. For instance, a user might find that a specific GO term is enriched, but when they check the input gene list against the direct annotations, the term doesn’t appear to be associated with many genes.

This happens because GO structure is a Directed Acyclic Graph (DAG). If a gene is annotated to a specific term (e.g., “glucose metabolic process”), it is implicitly annotated to all its ancestor terms (e.g., “metabolic process”). enrichGO and gseGO account for this indirect annotation by propagating annotations up the graph.

If you are providing your own annotation file (e.g., using enricher or GSEA), you must ensure that it includes these indirect annotations. If your annotation file only contains direct annotations, you can use the buildGOmap() function to generate the full annotation set (direct + indirect) before performing enrichment analysis.

# Assuming 'gomap' is a data.frame with direct annotations (GeneID, GOID)
# Build indirect annotations
gomap_with_ancestors <- buildGOmap(gomap)

# Use the result for enrichment
enricher(gene, TERM2GENE = gomap_with_ancestors, ...)

This step is crucial for accurate GO enrichment analysis when not using OrgDb objects.

4.3 GO classification

In clusterProfiler, the groupGO() function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList provided by DOSE.

library(clusterProfiler)
library(org.Hs.eg.db)

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

# Entrez gene ID
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  
ggo <- groupGO(gene     = gene,
               OrgDb    = org.Hs.eg.db,
               ont      = "CC",
               level    = 3,
               readable = TRUE)

head(ggo)
                   ID                Description Count GeneRatio      geneID
GO:0000133 GO:0000133                 GO:0000133     0     0/207            
GO:0000417 GO:0000417                HIR complex     0     0/207            
GO:0000796 GO:0000796          condensin complex     2     2/207 NCAPH/NCAPG
GO:0000808 GO:0000808 origin recognition complex     0     0/207            
GO:0000930 GO:0000930      gamma-tubulin complex     0     0/207            
GO:0000939 GO:0000939          inner kinetochore     0     0/207            

The gene parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding OrgDb, see also Section 17.1). If readable is set to TRUE, the input gene IDs will be converted to gene symbols.

4.4 GO over-representation analysis

The clusterProfiler package implements enrichGO() for gene ontology over-representation test.

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
        readable      = TRUE)
head(ego)
                   ID                              Description GeneRatio
GO:0000775 GO:0000775           chromosome, centromeric region    21/201
GO:0000779 GO:0000779 condensed chromosome, centromeric region    18/201
GO:0072686 GO:0072686                          mitotic spindle    17/201
GO:0000793 GO:0000793                     condensed chromosome    21/201
GO:0098687 GO:0098687                       chromosomal region    25/201
GO:0000776 GO:0000776                              kinetochore    17/201
             BgRatio RichFactor FoldEnrichment   zScore       pvalue
GO:0000775 199/11914 0.10552764       6.255006 9.792716 1.864198e-11
GO:0000779 150/11914 0.12000000       7.112836 9.869278 6.484361e-11
GO:0072686 137/11914 0.12408759       7.355122 9.800345 1.316552e-10
GO:0000793 222/11914 0.09459459       5.606965 9.076565 1.486538e-10
GO:0098687 322/11914 0.07763975       4.601990 8.583527 1.732119e-10
GO:0000776 141/11914 0.12056738       7.146466 9.617585 2.087156e-10
               p.adjust       qvalue
GO:0000775 1.262062e-08 1.262062e-08
GO:0000779 2.194956e-08 2.194956e-08
GO:0072686 2.345289e-08 2.345289e-08
GO:0000793 2.345289e-08 2.345289e-08
GO:0098687 2.345289e-08 2.345289e-08
GO:0000776 2.355008e-08 2.355008e-08
                                                                                                                                                        geneID
GO:0000775                          SKA1/CDCA8/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/AURKB/CENPM/HJURP/NCAPG/NDC80/CDT1/EZH2/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0000779                                           SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/AURKB/CENPM/HJURP/NCAPG/NDC80/CDT1/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0072686                                              SKA1/KIFC1/DLGAP5/KIF11/AURKA/KIF20A/AURKB/KIF23/CDK1/KIF18B/NUSAP1/ASPM/PRC1/MAD2L1/KIF18A/TPX2/CENPE
GO:0000793                         SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/AURKB/CENPM/HJURP/NCAPG/NDC80/CHEK1/CDT1/NCAPH/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0098687 SKA1/CDCA8/ERCC6L/CCNB1/TTK/CENPN/AURKA/TOP2A/MCM5/AURKB/CENPM/CDK1/RAD51AP1/HJURP/NCAPG/NDC80/CHEK1/CDT1/EZH2/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
GO:0000776                                                 SKA1/ERCC6L/CCNB1/TTK/CENPN/AURKA/AURKB/CENPM/HJURP/NDC80/CDT1/BIRC5/MAD2L1/NEK2/KIF18A/CDC20/CENPE
           Count
GO:0000775    21
GO:0000779    18
GO:0072686    17
GO:0000793    21
GO:0098687    25
GO:0000776    17

Any gene ID type that is supported in OrgDb can be directly used in GO analyses. Users need to specify the keyType parameter to specify the input gene ID type.

However, the GO annotation in OrgDb is indexed by ENTREZID. If the input ID type is not ENTREZID, clusterProfiler will first map the input IDs to ENTREZID using the bitr() function. This mapping step may lead to a loss of information, as some IDs might not have a corresponding ENTREZID or multiple IDs might map to the same ENTREZID (many-to-one mapping).

To address this issue, enrichGO and gseGO perform the mapping and use the mapped ENTREZIDs for enrichment analysis. The results are then mapped back to the original input ID type. This ensures that the gene counts and gene ratios in the final result accurately reflect the input IDs, avoiding discrepancies caused by the intermediate mapping step.

For example, if we use ENSEMBL IDs as input:

gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keyType       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(ego2, 3)                
                   ID                              Description GeneRatio
GO:0000775 GO:0000775           chromosome, centromeric region    21/202
GO:0000779 GO:0000779 condensed chromosome, centromeric region    18/202
GO:0098687 GO:0098687                       chromosomal region    25/202
             BgRatio RichFactor FoldEnrichment   zScore       pvalue
GO:0000775 257/19979 0.08171206       8.081808 11.54800 1.899777e-13
GO:0000779 190/19979 0.09473684       9.370036 11.71558 8.795611e-13
GO:0098687 417/19979 0.05995204       5.929613 10.28124 9.528885e-13
               p.adjust       qvalue
GO:0000775 1.453329e-10 1.453329e-10
GO:0000779 2.429866e-10 2.429866e-10
GO:0098687 2.429866e-10 2.429866e-10
                                                                                                                                                                                                                                                                                                                                                                                                                    geneID
GO:0000775                                                                 ENSG00000154839/ENSG00000134690/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000131747/ENSG00000178999/ENSG00000100162/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000167513/ENSG00000106462/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
GO:0000779                                                                                                                 ENSG00000154839/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000178999/ENSG00000100162/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000167513/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
GO:0098687 ENSG00000154839/ENSG00000134690/ENSG00000186871/ENSG00000134057/ENSG00000112742/ENSG00000166451/ENSG00000087586/ENSG00000131747/ENSG00000100297/ENSG00000178999/ENSG00000100162/ENSG00000170312/ENSG00000111247/ENSG00000123485/ENSG00000109805/ENSG00000080986/ENSG00000149554/ENSG00000167513/ENSG00000106462/ENSG00000089685/ENSG00000164109/ENSG00000117650/ENSG00000121621/ENSG00000117399/ENSG00000138778
           Count
GO:0000775    21
GO:0000779    18
GO:0098687    25

Gene IDs can be mapped to gene Symbols by using the parameter readable=TRUE or setReadable() function.

4.5 GO Gene Set Enrichment Analysis

The clusterProfiler package provides the gseGO() function for gene set enrichment analysis using gene ontology.

ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

The format of input data, geneList, is documented in the Section 17.1. Beware that only gene sets with size in [minGSSize, maxGSSize] will be tested.

4.6 GO analysis for non-model organisms

Both the enrichGO() and gseGO() functions require an OrgDb object as the background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt, or annotate the genes using Blast2GO or the Trinotate pipeline. Then the enricher() or GSEA() functions can be used to perform GO analysis for these organisms, similar to the examples using WikiPathways and MSigDB. Another solution is to create an OrgDb on your own using the AnnotationForge package.

Here is an example of querying GO annotation from Ensembl using biomaRt.

library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)

Alternatively, you can use AnnotationHub to query and retrieve OrgDb objects for a wide range of organisms, including non-model species.

Example: GO enrichment analysis for Maize (Zea mays)

library(AnnotationHub)
hub <- AnnotationHub()

# Query for Zea mays OrgDb
query(hub, "zea")
# Retrieve the OrgDb (e.g., AH55736, assume it's the correct record)
maize <- hub[['AH55736']]

# Check keys
length(keys(maize))
columns(maize)

# Perform enrichment analysis
# Assuming 'sample_genes' is a vector of Entrez IDs
res <- enrichGO(sample_genes, OrgDb=maize, pvalueCutoff=0.05, qvalueCutoff=0.05)

This approach allows you to use the standard enrichGO workflow with organisms not included in the default Bioconductor OrgDb packages.

4.7 Visualization of GO enrichment results

The clusterProfiler package provides several visualization methods to help interpret enrichment results, including barplot, dotplot, cnetplot, emapplot, goplot and plotGOgraph.

Please refer to the Chapter 13 chapter for details.

4.8 Filtering and simplifying GO terms

GO enrichment analysis often results in many redundant terms due to the hierarchical structure of GO. Parent terms and child terms often share a large proportion of genes, leading to multiple significant results that represent similar biological processes.

4.8.1 Removing specific terms or levels

The dropGO() function can be used to remove specific GO terms or GO levels from the results. The enrichGO() function tests the whole GO corpus and enriched result may contains very general terms. If users want to restrict the result to a specific GO level (e.g., level 3 or 4), they can use the gofilter() function. Both functions work with results obtained from enrichGO(), gseGO() and compareCluster().

4.8.2 Simplifying enriched GO terms

To reduce redundancy, clusterProfiler provides a simplify method. It uses semantic similarity (via GOSemSim) to calculate the similarity between enriched terms and removes highly similar terms, keeping the most significant one.

The simplify method applies select_fun (which can be a user-defined function) to the feature specified by by (e.g., p.adjust) to select one representative term from redundant terms (which have similarity higher than cutoff). This results in a cleaner and more interpretable visualization.

In Figure 4.1(A), we can found that there are many redundant terms form a highly condense network. After removing redundant terms using the simplify() method, the result is more clear to view the whole story.

#|
library(enrichplot)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
bp <- enrichGO(de, ont="BP", OrgDb = 'org.Hs.eg.db')
bp <- pairwise_termsim(bp)
bp2 <- simplify(bp, cutoff=0.7, by="p.adjust", select_fun=min)
p1 <- emapplot(bp)
p2 <- emapplot(bp2)
plot_list(p1, p2, ncol=2, tag_levels = 'A')
Figure 4.1: Visualize enriched terms by EnrichmentMap using the emapplot() function. (A) original result. (B) simplify result.

Alternatively, users can use slim version of GO and use the enricher() or gseGO() functions to analyze.

Note on enricher results: The simplify() method is designed for enrichGO() results where the ontology (BP, CC, or MF) is known. If you use enricher() to perform GO analysis (e.g., with custom annotation), the result object does not strictly define the ontology. To use simplify(), you must explicitly set the ontology slot in the result object:

# Assuming 'res' is the result from enricher() using GO annotation
# Set the ontology (must be one of "BP", "MF", "CC")
res@ontology <- "BP" 
res_simplified <- simplify(res)

4.9 Troubleshooting

4.9.1 No results found

It is common to receive a message saying “No gene set have size > 10 … return NULL” or finding no significant terms.

  1. Gene Set Size: The default minGSSize is 10 and maxGSSize is 500. If your background annotations for a specific term have fewer than 10 genes or more than 500, they are excluded. You can adjust these parameters.
  2. Universe: If you provide a custom universe (background gene list), ensure it is large enough. A small universe can make it statistically difficult to find enrichment.
  3. P-value Adjustment: clusterProfiler uses multiple testing correction (e.g., BH) by default. Some web tools might use raw p-values or different thresholds, leading to more (but potentially false positive) results. clusterProfiler tends to be more conservative and reliable.

4.9.2 Annotation quality

Sometimes, the OrgDb provided by Bioconductor might be outdated compared to the latest online databases (e.g., TAIR for Arabidopsis).

  • Check Dates: Always check the metadata of the OrgDb object to see the source and date of the annotation.
  • Custom Annotation: If the OrgDb is too old, consider downloading the latest annotation (e.g., GAF or GOSLIM files) from the organism’s primary database and using enricher()/GSEA() with buildGOmap() as described in the Direct vs indirect GO annotation section.

4.10 Summary

GO semantic similarity can be calculated by GOSemSim (Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

References

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An r Package for Measuring Semantic Similarity Among GO Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.