Chapter 5 Gene Ontology Analysis
5.1 Supported organisms
GO analyses (groupGO()
, enrichGO()
and gseGO()
) support organisms that have an OrgDb
object available.
Bioconductor have already provide OrgDb
for about 20 species. User can query OrgDb
online by AnnotationHub or build their own by AnnotationForge. An example can be found in the vignette of GOSemSim.
If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher()
and gseGO()
functions to perform over-representation test and gene set enrichment analysis.
If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap
function, which will infer indirection annotation and generate a data.frame
that suitable for both enricher()
and gseGO()
.
5.2 GO classification
In clusterProfiler, groupGO
is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList
provided by DOSE. Please refer to vignette of DOSE for more details.
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
head(gene.df)
## ENTREZID ENSEMBL SYMBOL
## 1 4312 ENSG00000196611 MMP1
## 2 8318 ENSG00000093009 CDC45
## 3 10874 ENSG00000109255 NMU
## 4 55143 ENSG00000134690 CDCA8
## 5 55388 ENSG00000065328 MCM10
## 6 991 ENSG00000117399 CDC20
## ID Description Count
## GO:0005886 GO:0005886 plasma membrane 55
## GO:0005628 GO:0005628 prospore membrane 0
## GO:0005789 GO:0005789 endoplasmic reticulum membrane 8
## GO:0019867 GO:0019867 outer membrane 3
## GO:0031090 GO:0031090 organelle membrane 16
## GO:0034357 GO:0034357 photosynthetic membrane 0
## GeneRatio
## GO:0005886 55/207
## GO:0005628 0/207
## GO:0005789 8/207
## GO:0019867 3/207
## GO:0031090 16/207
## GO:0034357 0/207
## geneID
## GO:0005886 S100A9/MELK/S100A8/MARCO/ASPM/CXCL10/LAMP3/CEP55/UGT8/UBE2C/SLC7A5/CXCL9/FADS2/MSLN/IL1R2/KIF18A/S100P/GZMB/TRAT1/GABRP/AQP9/GPR19/SLC2A6/KIF20A/LAG3/NUDT1/CACNA1D/VSTM4/ITPR1/SYT17/SLC16A4/CORIN/KCNK15/CA12/KCNE4/HLA-DQA1/ADH1B/PDZK1/C7/ACKR1/COL17A1/PSD3/EMCN/SLC44A4/LRP2/NLGN4X/MAPT/ERBB4/CX3CR1/LAMP5/ABCA8/STEAP4/PTPRT/TMC5/CYBRD1
## GO:0005628
## GO:0005789 FADS2/CDK1/CHODL/ITPR1/HLA-DQA1/CYP4F8/CYP4B1/FMO5
## GO:0019867 BCL2A1/MAOB/PGR
## GO:0031090 MARCO/BCL2A1/LAMP3/DUSP2/SLC2A6/DTL/NUDT1/MAOB/ITPR1/GASK1B/HLA-DQA1/LRP2/LAMP5/STEAP4/PGR/CYBRD1
## GO:0034357
The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb
).
If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.
5.3 GO over-representation test
Over-representation test(Boyle et al. 2004) were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
## ID
## GO:0005819 GO:0005819
## GO:0005876 GO:0005876
## GO:0000779 GO:0000779
## GO:0072686 GO:0072686
## GO:0000775 GO:0000775
## GO:0000776 GO:0000776
## Description
## GO:0005819 spindle
## GO:0005876 spindle microtubule
## GO:0000779 condensed chromosome, centromeric region
## GO:0072686 mitotic spindle
## GO:0000775 chromosome, centromeric region
## GO:0000776 kinetochore
## GeneRatio BgRatio pvalue p.adjust
## GO:0005819 25/200 272/11816 4.695505e-12 1.493171e-09
## GO:0005876 12/200 48/11816 1.623758e-11 2.581776e-09
## GO:0000779 15/200 91/11816 2.808998e-11 2.628919e-09
## GO:0072686 15/200 92/11816 3.306816e-11 2.628919e-09
## GO:0000775 18/200 154/11816 1.061302e-10 6.749884e-09
## GO:0000776 15/200 107/11816 3.047081e-10 1.614953e-08
## qvalue
## GO:0005819 1.378996e-09
## GO:0005876 2.384361e-09
## GO:0000779 2.427899e-09
## GO:0072686 2.427899e-09
## GO:0000775 6.233756e-09
## GO:0000776 1.491466e-08
## geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876 CENPE/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0072686 KIF23/CENPE/ASPM/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/AURKB/KIFC1/KIF18B/AURKA
## GO:0000775 CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
## Count
## GO:0005819 25
## GO:0005876 12
## GO:0000779 15
## GO:0072686 15
## GO:0000775 18
## GO:0000776 15
As I mentioned before, any gene ID type that supported in OrgDb
can be directly used in GO analyses. User need to specify the keyType
parameter to specify the input gene ID type.
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
Gene ID can be mapped to gene Symbol by using paramter readable=TRUE
or setReadable
function.
5.3.1 drop specific GO terms or level
enrichGO
test the whole GO corpus and enriched result may contains very general terms. With dropGO
function, user can remove specific GO terms or GO level from results obtained from both enrichGO
and compareCluster
.
5.3.2 test GO at sepcific level
enrichGO
doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter
to restrict the result at specific GO level. It works with results obtained from both enrichGO
and compareCluster
.
5.3.3 reduce redundancy of enriched GO terms
GO is organized in parent-child structure, thus a parent term can be overlap with a large proportion with all its child terms. This can result in redundant findings. To solve this issue, clusterProfiler implement simplify
method to reduce redundant GO terms from the outputs of enrichGO
and gseGO
. The function internally called GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term. An example can be found in the blog post.
5.4 GO Gene Set Enrichment Analysis
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
For algorithm details, please refer to the vignette of DOSE.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000,
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
GSEA use permutation test, user can set nPerm for number of permutations. Only gene Set size in [minGSSize, maxGSSize]
will be tested.
If you have issues in preparing your own geneList
, please refer to the wiki
page.
5.5 GO Semantic Similarity Analysis
GO semantic similarity can be calculated by GOSemSim(Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.
5.5.1 GO analysis for non-model organisms
Both enrichGO
and gseGO
functions require an OrgDb
object as background annotation. For organisms that don’t have OrgDb
provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb
available, users can obtain GO annotation from other sources, e.g. from biomaRt or Blast2GO. Then using enricher
or GSEA
function to analyze, similar to the examples using wikiPathways and MSigDB. Another solution is to create OrgDb
by your own using AnnotationForge package.
References
Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18): 3710–5. https://doi.org/10.1093/bioinformatics/bth456.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.