% \VignetteEngine{knitr} % \VignetteIndexEntry{An introduction to clusterProfiler} % \VignettePackage{clusterProfiler}} % To compile this document, run the commands within R % require(cacheSweave); Sweave('clusterProfiler.Rnw', driver=cacheSweaveDriver()); tools::texi2dvi("clusterProfiler.tex", pdf=TRUE) \documentclass[12pt]{article} \usepackage[a4paper,left=3cm,top=2cm,bottom=3cm,right=3cm,ignoreheadfoot]{geometry} \pagestyle{empty} \usepackage[usenames,dvipsnames]{xcolor} \usepackage{helvet} \usepackage[titletoc]{appendix} \usepackage{tocloft} \setlength{\parindent}{0em} \setlength{\parskip}{.5em} \renewcommand{\familydefault}{\sfdefault} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\small\texttt{#1}}} \newcommand{\Rfunction}[1]{\Robject{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rparameter}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\R}{\textsf{R}} \usepackage[ bookmarks,% colorlinks,% linktoc=section,% linkcolor=RedViolet,% citecolor=RedViolet,% urlcolor=RedViolet,% linkbordercolor={1 1 1},% citebordercolor={1 1 1},% urlbordercolor={1 1 1},% raiselinks,% plainpages,% pdftex ]{hyperref} \usepackage{url} \usepackage{cite} \renewcommand{\floatpagefraction}{0.9} \usepackage{sectsty} \sectionfont{\sffamily\bfseries\color{RoyalBlue}\sectionrule{0pt}{0pt}{-1ex}{1pt}} \subsectionfont{\sffamily\bfseries\color{RoyalBlue}} \subsubsectionfont{\sffamily\bfseries\color{RoyalBlue}} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead{} \fancyfoot{} \lfoot{}\cfoot{}\rfoot{} \renewcommand{\headrule}{} \usepackage{graphicx} %\usepackage{xstring} <>= library(DOSE) library(GO.db) library(org.Hs.eg.db) library(pathview) library(clusterProfiler) library(knitr) opts_chunk$set(tidy=TRUE,tidy.opts=list(keep.blank.line=FALSE, width.cutoff=50),out.truncate=80,out.lines=6,cache=TRUE,dev='pdf',include=TRUE,fig.width=6,fig.height=6,resolution=150) @ \newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}} \title{\textsf{\textbf{Using \Rpackage{clusterProfiler} to identify and compare functional profiles of gene lists}}} \author{Guangchuang Yu \\ School of Biological Sciences \\ The University of Hong Kong, Hong Kong SAR, China \\ email: \texttt{gcyu@connect.hku.hk}} \begin{document} \maketitle <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \tableofcontents \section{Introduction} In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular moleculars at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities \cite{yu2010}. To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes. After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed \Rpackage{clusterProfiler} \cite{yu2012}, for comparing and visualizing functional profiles among gene clusters. \section{Citation} Please cite the following articles when using \Rpackage{clusterProfiler}. \\ G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. \textit{OMICS: A Journal of Integrative Biology}. 2012, 16(5), 284-287. \\ \section{Gene Ontology Classification} In \Rpackage{clusterProfiler}, \Rfunction{groupGO} is designed for gene classification based on GO distribution at a specific level. <>= require(DOSE) data(geneList) gene <- names(geneList)[abs(geneList) > 2] head(gene) ggo <- groupGO(gene=gene, organism="human", ont="BP", level=3, readable=TRUE) head(summary(ggo)) @ \section{Enrichment Analysis} \subsection{Hypergeometric model} Enrichment analysis \cite{boyle2004} is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected. To determine whether any terms annotate a specified list of genes at frequency greater than that would be expected by chance, \Rpackage{clusterProfiler} calculates a p-value using the hypergeometric distribution: $ p = 1 - \displaystyle\sum_{i = 0}^{k-1} \frac{ {M \choose i} {{N-M} \choose {n-i}} } { {N \choose n} } $ In this equation, \textit{N} is the total number of genes in the background distribution, \textit{M} is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest, \textit{n} is the size of the list of genes of interest and \textit{k} is the number of genes within that list which are annotated to the node. The background distribution by default is all the genes that have annotation. P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control. \subsection{GO enrichment analysis} <>= ego <- enrichGO(gene=gene, universe = names(geneList), organism="human", ont="CC", pvalueCutoff=0.01, readable=TRUE) head(summary(ego)) @ \subsection{KEGG pathway enrichment analysis} <>= kk <- enrichKEGG(gene=gene, organism="human", pvalueCutoff=0.01, readable=TRUE) head(summary(kk)) @ \subsection{DO enrichment analysis} Disease Ontology (DO) enrichment analysis is implemented in \Rpackage{DOSE}, please refer to the package vignettes. The \Rfunction{enrichDO} function is very useful for identifying disease association of interesting genes. \subsection{Reactome pathway enrichment analysis} With the demise of KEGG (at least without subscription), the KEGG pathway data in Bioconductor will not update and we encourage user to analyze pathway using \Rpackage{ReactomePA} which use Reactome as a source of pathway data. The function call of \Rfunction{enrichPathway} in \Rpackage{ReactomePA} is consistent with \Rfunction{enrichKEGG}. \subsection{Function call} The function calls of \Rfunction{groupGO}, \Rfunction{enrichGO}, \Rfunction{enrichKEGG}, \Rfunction{enrichDO} and \Rfunction{enrichPathway} are consistent. The input parameters of \textit{gene} is a vector of entrezgene (for human and mouse) or ORF (for yeast) IDs, and \textit{organism} should be supported species (please refer to the manual of the specific function). For GO analysis, \textit{ont} must be assigned to one of "BP", "MF", and "CC" for biological process, molecular function and cellular component, respectively. In \Rfunction{groupGO}, the \textit{level} specify the GO level for gene projection. In enrichment analysis, the \textit{pvalueCutoff} is to restrict the result based on their pvalues and the adjusted p values. \textit{Q-values} were also calculated for controlling false discovery rate (FDR). The \textit{readable} is a logical parameter to indicate the input gene IDs will map to gene symbols or not. \subsection{Visualization} The output of \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG} can be visualized by bar plot and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart. \subsubsection{barplot} <>= barplot(ggo, drop=TRUE, showCategory=12) @ <>= barplot(ego, showCategory=8) @ \subsubsection{cnetplot} In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed \Rfunction{cnetplot} function to extract the complex association. <>= cnetplot(ego, categorySize="pvalue", foldChange=geneList) @ <>= cnetplot(kk, categorySize="geneNum", foldChange=geneList) @ \subsubsection{pathview from pathview package} \Rpackage{clusterProfiler} users can also use \Rfunction{pathview} from the \Rpackage{pathview} \cite{luo_pathview} to visualize KEGG pathway. The following example illustrate how to visualize "hsa04110" pathway, which was enriched in our previous analysis. <>= require(pathview) hsa04110 <- pathview(gene.data=geneList, pathway.id="hsa04110", species="hsa", limit=list(gene=max(abs(geneList)), cpd=1)) @ \begin{figure}[ht] \centering \includegraphics[width=.9\textwidth,type=png,ext=.png,read=.png]{hsa04110.pathview} \caption{visualize KEGG pathway using pathview} \label{viewKEGG} \end{figure} For further information, please refer to the vignette of \Rpackage{pathview} \cite{luo_pathview}. \section{Biological theme comparison} \Rpackage{clusterProfiler} was developed for biological theme comparison, and it provides a function, \Rfunction{compareCluster}, to automatically calculate enriched functional categories of each gene clusters. <>= data(gcSample) ck <- compareCluster(geneCluster=gcSample, fun="enrichKEGG") plot(ck) @ By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter \textit{showCategory} to specify how many categories of each cluster to be plotted, and if \textit{showCategory} was set to \textit{NULL}, the whole result will be plotted. The dot sizes were based on their corresponding row percentage by default, and user can set the parameter \textit{by} to "count" to make the comparison based on gene counts. We choose "percentage" as default parameter to represent the size of dots, since some categories may contain a large number of genes, and make the dot sizes of those small categories too small to compare. To provide the full information, we also provide number of identified genes in each category (numbers in parentheses), as shown in Figure 3. If the dot sizes were based on "count", the row numbers will not shown. The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter \textit{pvalueCutoff}, and FDR can be estimated by \textit{qvalue}. User can refer to the example in \cite{yu2012}; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) \cite{schmidt2008}. We identified 8 gene clusters from differentially expressed genes, and using \Rfunction{compareCluster} to compare these gene clusters by their enriched biological process. Another example was shown in \cite{yu2011}, we calculated functional similarities among viral miRNAs using method described in \cite{yu_new_2011}, and compared significant KEGG pathways regulated by different viruses using \Rfunction{compareCluster}. The comparison function was designed as a general-package for comparing gene clusters of any kind of ontology associations, not only \Rfunction{groupGO}, \Rfunction{enrichGO}, and \Rfunction{enrichKEGG} this package provided, but also other biological and biomedical ontologies, for instance, \Rfunction{enrichDO} from \Rpackage{DOSE} and \Rfunction{enrichPathway} from \Rpackage{ReactomePA} work fine with \Rfunction{compareCluster} for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of \Rpackage{DOSE} and \Rpackage{ReactomePA}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{unsrt} \bibliography{clusterProfiler} \end{document}