<environment: namespace:Rcpp>
4 GO enrichment analysis
GO comprises three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).
4.1 Supported organisms
GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available (see also session 2.2).
If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.
If the genes are annotated by direction annotation, they should also be annotated by their ancestor GO nodes (indirect annotation). If a user only has direct annotation, they can pass their annotation to the buildGOmap function, which will infer indirect annotation and generate a data.frame that is suitable for both enricher() and GSEA().
4.2 GO classification
In clusterProfiler, the groupGO() function is designed for gene classification based on GO distribution at a specific level. Here we use the dataset geneList provided by DOSE.
[1] "4312" "8318" "10874" "55143" "55388" "991"
ID Description Count GeneRatio geneID
GO:0000133 GO:0000133 polarisome 0 0/207
GO:0000417 GO:0000417 HIR complex 0 0/207
GO:0000796 GO:0000796 condensin complex 2 2/207 NCAPH/NCAPG
GO:0000808 GO:0000808 origin recognition complex 0 0/207
GO:0000930 GO:0000930 gamma-tubulin complex 0 0/207
GO:0000939 GO:0000939 inner kinetochore 2 2/207 CENPM/CENPN
The gene parameter is a vector of gene IDs (can be any ID type that is supported by the corresponding OrgDb, see also session 16.1). If readable is set to TRUE, the input gene IDs will be converted to gene symbols.
4.3 GO over-representation analysis
The clusterProfiler package implements enrichGO() for gene ontology over-representation test.
ID Description GeneRatio
GO:0072686 GO:0072686 mitotic spindle 20/201
GO:0000775 GO:0000775 chromosome, centromeric region 21/201
GO:0005819 GO:0005819 spindle 27/201
GO:0000779 GO:0000779 condensed chromosome, centromeric region 18/201
GO:0000793 GO:0000793 condensed chromosome 21/201
GO:0000776 GO:0000776 kinetochore 17/201
BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0072686 160/11898 0.12500000 7.399254 10.682473 2.532128e-12
GO:0000775 202/11898 0.10396040 6.153835 9.684273 2.545942e-11
GO:0005819 352/11898 0.07670455 4.540451 8.838808 4.104222e-11
GO:0000779 149/11898 0.12080537 7.150957 9.904096 5.917376e-11
GO:0000793 222/11898 0.09459459 5.599435 9.068041 1.523011e-10
GO:0000776 139/11898 0.12230216 7.239558 9.699638 1.695228e-10
p.adjust qvalue
GO:0072686 7.292528e-10 6.690148e-10
GO:0000775 3.666156e-09 3.363323e-09
GO:0005819 3.940054e-09 3.614596e-09
GO:0000779 4.260511e-09 3.908583e-09
GO:0000793 8.137092e-09 7.464950e-09
GO:0000776 8.137092e-09 7.464950e-09
geneID
GO:0072686 KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/TRAT1/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/DNALI1
GO:0000775 CDCA8/CDC20/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/AURKA/CCNB1
GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A/DNALI1
GO:0000779 CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
GO:0000793 CDC20/CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CHEK1/AURKA/CCNB1
GO:0000776 CDC20/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/AURKA/CCNB1
Count
GO:0072686 20
GO:0000775 21
GO:0005819 27
GO:0000779 18
GO:0000793 21
GO:0000776 17
Any gene ID type that is supported in OrgDb can be directly used in GO analyses. Users need to specify the keyType parameter to specify the input gene ID type.
ID Description GeneRatio
GO:0072686 GO:0072686 mitotic spindle 24/237
GO:0005819 GO:0005819 spindle 31/237
GO:0000779 GO:0000779 condensed chromosome, centromeric region 20/237
BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0072686 221/22665 0.10859729 10.385475 14.41284 1.199753e-17
GO:0005819 496/22665 0.06250000 5.977057 11.52093 1.398262e-15
GO:0000779 201/22665 0.09950249 9.515713 12.46587 3.670045e-14
p.adjust qvalue
GO:0072686 3.203340e-15 2.437393e-15
GO:0005819 1.866679e-13 1.420339e-13
GO:0000779 2.986106e-12 2.272101e-12
geneID
GO:0072686 ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000013810/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000138160/ENSG00000163519/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000163879
GO:0005819 ENSG00000134690/ENSG00000117399/ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000013810/ENSG00000117650/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000163519/ENSG00000112742/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000134057/ENSG00000090889/ENSG00000163879
GO:0000779 ENSG00000117399/ENSG00000138778/ENSG00000080986/ENSG00000123485/ENSG00000154839/ENSG00000262634/ENSG00000117650/ENSG00000100162/ENSG00000166451/ENSG00000291964/ENSG00000186871/ENSG00000164109/ENSG00000121621/ENSG00000167513/ENSG00000089685/ENSG00000112742/ENSG00000109805/ENSG00000178999/ENSG00000087586/ENSG00000134057
Count
GO:0072686 24
GO:0005819 31
GO:0000779 20
Gene IDs can be mapped to gene Symbols by using the parameter readable=TRUE or setReadable() function.
4.4 GO Gene Set Enrichment Analysis
The clusterProfiler package provides the gseGO() function for gene set enrichment analysis using gene ontology.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)The format of input data, geneList, was documented in the FAQ. Beware that only gene Set size in [minGSSize, maxGSSize] will be tested.
4.5 GO analysis for non-model organisms
Both the enrichGO() and gseGO() functions require an OrgDb object as the background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt, or annotate the genes using Blast2GO or the Trinotate pipeline. Then the enricher() or GSEA() functions can be used to perform GO analysis for these organisms, similar to the examples using wikiPathways and MSigDB. Another solution is to create an OrgDb on your own using AnnotationForge package.
Here is an example of querying GO annotation from Ensembl using biomaRt.
library(biomaRt)
ensembl <- useEnsemblGenomes(biomart = "plants_mart", dataset = "nattenuata_eg_gene")
gene2go <- getBM(attributes =c("ensembl_gene_id", "go_id"), mart=ensembl)4.6 Visualize enriched GO terms as a directed acyclic graph
The goplot() function can accept the output of enrichGO and visualize the enriched GO induced graph.
(ref:goplotscap) Goplot of enrichment analysis.
(ref:goplotcap) Goplot of enrichment analysis.
goplot(ego)4.7 Summary
GO semantic similarity can be calculated by GOSemSim (Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.