4  GO enrichment analysis

<environment: namespace:Rcpp>

GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).

4.1 Supported organisms

GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available (see also session 2.2).

If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.

If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap function, which will infer indirect annotation and generate a data.frame that is suitable for both enricher() and GSEA().

4.2 GO classification

In clusterProfiler, the groupGO() function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList provided by DOSE.

library(clusterProfiler)
library(org.Hs.eg.db)

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

# Entrez gene ID
head(gene)
[1] "4312"  "8318"  "10874" "55143" "55388" "991"  
ggo <- groupGO(gene     = gene,
               OrgDb    = org.Hs.eg.db,
               ont      = "CC",
               level    = 3,
               readable = TRUE)

head(ggo)
                   ID                Description Count GeneRatio      geneID
GO:0000133 GO:0000133                 polarisome     0     0/207            
GO:0000417 GO:0000417                HIR complex     0     0/207            
GO:0000796 GO:0000796          condensin complex     2     2/207 NCAPH/NCAPG
GO:0000808 GO:0000808 origin recognition complex     0     0/207            
GO:0000930 GO:0000930      gamma-tubulin complex     0     0/207            
GO:0000939 GO:0000939          inner kinetochore     2     2/207 CENPM/CENPN

The gene parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding OrgDb, see also session 16.1). If readable is set to TRUE, the input gene IDs will be converted to gene symbols.

4.3 GO over-representation analysis

The clusterProfiler package implements enrichGO() for gene ontology over-representation test.

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
        readable      = TRUE)
head(ego)
                   ID                              Description GeneRatio
GO:0072686 GO:0072686                          mitotic spindle    20/201
GO:0000775 GO:0000775           chromosome, centromeric region    21/201
GO:0005819 GO:0005819                                  spindle    27/201
GO:0000779 GO:0000779 condensed chromosome, centromeric region    18/201
GO:0000793 GO:0000793                     condensed chromosome    21/201
GO:0000776 GO:0000776                              kinetochore    17/201
             BgRatio RichFactor FoldEnrichment    zScore       pvalue
GO:0072686 160/11898 0.12500000       7.399254 10.682473 2.532128e-12
GO:0000775 202/11898 0.10396040       6.153835  9.684273 2.545942e-11
GO:0005819 352/11898 0.07670455       4.540451  8.838808 4.104222e-11
GO:0000779 149/11898 0.12080537       7.150957  9.904096 5.917376e-11
GO:0000793 222/11898 0.09459459       5.599435  9.068041 1.523011e-10
GO:0000776 139/11898 0.12230216       7.239558  9.699638 1.695228e-10
               p.adjust       qvalue
GO:0072686 7.292528e-10 6.690148e-10
GO:0000775 3.666156e-09 3.363323e-09
GO:0005819 3.940054e-09 3.614596e-09
GO:0000779 4.260511e-09 3.908583e-09
GO:0000793 8.137092e-09 7.464950e-09
GO:0000776 8.137092e-09 7.464950e-09
                                                                                                                                                                     geneID
GO:0072686                                        KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/TRAT1/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/DNALI1
GO:0000775                                       CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/AURKA/CCNB1
GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A/DNALI1
GO:0000779                                                        CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
GO:0000793                                      CDC20/CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1
GO:0000776                                                              CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/AURKA/CCNB1
           Count
GO:0072686    20
GO:0000775    21
GO:0005819    27
GO:0000779    18
GO:0000793    21
GO:0000776    17

Any gene ID type that is supported in OrgDb can be directly used in GO analyses. Users need to specify the keyType parameter to specify the input gene ID type.

gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
                keyType       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
head(ego2, 3)                
                   ID                              Description GeneRatio
GO:0072686 GO:0072686                          mitotic spindle    24/237
GO:0005819 GO:0005819                                  spindle    31/237
GO:0000779 GO:0000779 condensed chromosome, centromeric region    20/237
             BgRatio RichFactor FoldEnrichment   zScore       pvalue
GO:0072686 221/22665 0.10859729      10.385475 14.41284 1.199753e-17
GO:0005819 496/22665 0.06250000       5.977057 11.52093 1.398262e-15
GO:0000779 201/22665 0.09950249       9.515713 12.46587 3.670045e-14
               p.adjust       qvalue
GO:0072686 3.203340e-15 2.437393e-15
GO:0005819 1.866679e-13 1.420339e-13
GO:0000779 2.986106e-12 2.272101e-12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    geneID
GO:0072686                                                                                                                 ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000013810/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000138160/ENSG00000163519/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000163879
GO:0005819 ENSG00000134690/ENSG00000117399/ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000013810/ENSG00000117650/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000163519/ENSG00000112742/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000134057/ENSG00000090889/ENSG00000163879
GO:0000779                                                                                                                                                                                 ENSG00000117399/ENSG00000138778/ENSG00000080986/ENSG00000123485/ENSG00000154839/ENSG00000262634/ENSG00000117650/ENSG00000100162/ENSG00000166451/ENSG00000291964/ENSG00000186871/ENSG00000164109/ENSG00000121621/ENSG00000167513/ENSG00000089685/ENSG00000112742/ENSG00000109805/ENSG00000178999/ENSG00000087586/ENSG00000134057
           Count
GO:0072686    24
GO:0005819    31
GO:0000779    20

Gene IDs can be mapped to gene Symbols by using the parameter readable=TRUE or setReadable() function.

4.4 GO Gene Set Enrichment Analysis

The clusterProfiler package provides the gseGO() function for gene set enrichment analysis using gene ontology.

ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

The format of input data, geneList, was documented in the FAQ. Beware that only gene Set size in [minGSSize, maxGSSize] will be tested.

4.5 GO analysis for non-model organisms

Both the enrichGO() and gseGO() functions require an OrgDb object as the background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt, or annotate the genes using Blast2GO or the Trinotate pipeline. Then the enricher() or GSEA() functions can be used to perform GO analysis for these organisms, similar to the examples using wikiPathways and MSigDB. Another solution is to create an OrgDb on your own using AnnotationForge package.

Here is an example of querying GO annotation from Ensembl using biomaRt.

library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)

4.6 Visualize enriched GO terms as a directed acyclic graph

The goplot() function can accept the output of enrichGO and visualize the enriched GO induced graph.

(ref:goplotscap) Goplot of enrichment analysis.

(ref:goplotcap) Goplot of enrichment analysis.

goplot(ego)

(ref:goplotcap)

4.7 Summary

GO semantic similarity can be calculated by GOSemSim (Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

References

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An r Package for Measuring Semantic Similarity Among GO Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.