% \VignetteIndexEntry{biosvd} % \VignetteDepends{} % \VignetteKeywords{biosvd} % \VignettePackage{biosvd} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{Sweave} \usepackage{amssymb} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{authblk} \textwidth=6.5in \textheight=8.5in \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \title{The biosvd package for high-throughput data processing, outlier detection, noise removal and dynamic modeling} \author[1]{Anneleen Daemen\thanks{daemena@gene.com}} \author[1]{Matthew Brauer\thanks{matthejb@gene.com}} \affil[1]{Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, CA} \date{\today} \renewcommand\Authands{ and } \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Singular Value Decomposition (SVD) is a popular method, suited for longitudinal data processing and modeling. Simply stated, SVD decomposes data to a linear combination of main major modes of intensity. SVD for the analysis of genome-wide expression data was introduced by Alter and colleagues in 2000, with the major modes referred to as eigengenes and eigenarrays [Alter et al, 2000]. Sorting the expression data to these modes revealed clusters of genes or arrays with similar function or biologic phenotype. Here, we extend the application of SVD to any data type. This BioConductor R package allows for high-throughput data processing, outlier detection, noise removal and dynamic modeling, based on the framework of Singular Value Decomposition. It provides the user with summary graphs and an interactive html report. This gives a global picture of the dynamics of expression/intensity levels, in which individual features and assays are classified in groups of similar regulation and function or similar cellular state and biological phenotype. \subsection{Singular Value Decomposition (SVD)} Singular Value Decomposition is a linear transformation of a data set $E$ from the $M$ features x $N$ assays space to a reduced $L$ eigenfeatures x $L$ eigenassays space, with \(L=\min \{M,N\}\). In mathematical terms, this corresponds to \(E = U \Sigma V^T\). $U$ and $V^T$ define the $M$ features x $L$ eigenassays and the $L$ eigenfeatures x $N$ assays orthonormal basis sets. Each column in $U$ corresponds to a left singular vector, representing genome-wide expression, proteome-wide abundance or metabolome-wide intensity in the $k$-th eigenassay. Accordingly, each row in $V^T$ is called a right singular vector and represents the expression, abundance or intensity of the $k$-th eigenfeature across all assays. \vspace{3 mm} $\Sigma$ is a diagonal matrix with expression of each eigenfeature restricted to the corresponding eigenassay, reflecting the decoupling and decorrelation of the data. The eigenexpression levels along the diagonal indicate the relative significance of each $\{$eigenfeature, eigenassay$\}$-pair. The relative fraction of overall expression that the $k$-th eigenfeature and eigenassay capture is called {\it eigenexpression fraction} and is defined as \(f_l = \epsilon^2_l / \sum_{k=1}^L \epsilon_k^2\). Finally, data complexity is expressed as the Shannon entropy, defined as \(0 \leqslant \frac{-1}{log(L)} \sum_{k=1}^L f_k log(f_k) \leqslant 1\). An entropy of 0 corresponds to an ordered and redundant data set, with all expression captured by a single $\{$eigenfeature, eigenassay$\}$-pair. On the other hand, the entropy is 1 in case of a disordered and random data set with all $\{$eigenfeature, eigenassay$\}$-pairs equally expressed. We refer to \cite{Alter00} for a more detailed description of SVD. \subsection{biosvd Package Overview} \includegraphics{biosvd-package-overview.png} \vspace{3 mm} The biosvd package consists of 4 main functions. First, \tt compute \rm reduces the input data set from the feature x assay space to the reduced diagonalized eigenfeature x eigenassay space, with the eigenfeatures and eigenassays unique orthonormal superpositions of the features and assays, respectively. Results of SVD applied to the data can subsequently be inspected based on the graphs generated with \tt plot\rm. These graphs include a heatmap of the eigenfeature x assay matrix ($V^T$), a bar plot with the eigenexpression fractions of all eigenfeatures, and the levels of the eigenfeatures across the assays. These graphs aid in deciding which eigenfeatures and eigenassays to filter out (i.e., eigenfeatures representing steady state, noise, or experimental artifacts) (\tt exclude\rm). Filtering out steady-state expression/intensity corresponds to centering the expression/intensity patterns at steady-state level (arithmetic mean of intensity $\sim$ 0). \vspace{3 mm} Secondly, the three functions \tt compute\rm, \tt plot \rm and \tt exclude \rm can be applied to the variance in the data, in order to filter out steady-scale variance. This corresponds to a normalization by the steady scale of expression/intensity variance (geometric mean of variance $\sim$ 1). \vspace{3 mm} Thirdly, after possible removal of steady state expression, steady-scale variance, noise and experimental artifacts, SVD is re-applied to the normalized data, followed by the generation of a summary report with \tt report\rm. The generated html report of the eigensystem contains polar plots of the assays and features, displaying the features/assays according to their correlation with two selected eigenfeatures/eigenassays, and a table with the list of features, sortable according to their coordinates, radius and phase in the polar plot. This function also generates a visualization of the data sorted according to the two selected eigenfeatures and eigenassays, with a heatmap of the feature x assay matrix with colored feature/assay annotation information when provided, a heatmap of the feature x eigenassay matrix, and the expression/intensity levels of the two sorted eigenassays across all features. \subsection{Case Studies Overview} In this vignette, three case studies are provided. In the first 2 case studies, the expression pattern of genes throughout the cell cycle are studied in yeast and human, respectively. As use of this package is not restricted to expression data, the third case study focuses on cellular metabolites in bacteria and yeast after carbon and nitrogen starvation. \vspace{3 mm} The essential data must be provided as a feature x assay matrix, a data frame, ExpressionSet or an object from class eigensystem obtained from a former run. For the examples provided in this vignette, ExpressionSets were created containing the gene x sample expression data with gene annotation and sample information, or the metabolite x assay intensity data with metabolite annotation and assay information. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 1: Yeast Cell Cycle Expression} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As a first example, Spellman and colleagues created a comprehensive catalog of genes in {\it Saccharomyces cerevisiae} whose transcript levels vary periodically within the cell cycle \cite{Spellman98}. To this end, mRNA levels in samples from yeast cultures were synchronized in G1 phase with $\alpha$ factor arrest. After release of the $\alpha$ factor, cells were sampled every 7 minutes over a timespan of 140 minutes, during which the cells synchronously completed two cell cycles. The gene x sample expression data comprise the (un-logtransformed) ratio of gene expression to reference mRNA from an asynchronous yeast culture. For each sample, the cell cycle phase is known as determined by Spellman {\it et al}. For 800 cell cycle-regulated genes, the phase in which these genes reach their peak expression was determined by Spellman {\it et al} based on published timing of the expression of known cell cycle-regulated genes. \vspace{3 mm} We start the example by loading the data and all required libraries: <>= library(biosvd) data(YeastData_alpha) YeastData @ The input data set is first reduced from the gene x sample space to the reduced diagonalized eigenfeature x eigenassay space. <>= eigensystem <- compute(YeastData) @ The eigenfeatures and eigenassays can subsequently be inspected based on the graphs generated with \tt plot\rm. Up to 5 figures are displayed (\tt fig=TRUE\rm) or saved as a pdf file (default \tt fig=FALSE\rm), using the \tt plots \rm argument as follows: \begin{itemize} \item {\bf fraction}: bar plot with the eigenexpression fractions of all eigenfeatures \item {\bf zoomedFraction}: bar plot with the eigenexpression fractions of the eigenfeatures after removal of the dominant eigenfeature(s) \item {\bf heatmap}: heatmap of the eigenfeature x assay matrix with use of a given contrast factor \item {\bf lines}: expression/intensity levels of eigenfeatures 1 to 4 across the assays \item {\bf allLines}: expression/intensity levels of all eigenfeatures across the assays \end{itemize} For this example, the bar plot with all eigenfeatures (\tt fraction\rm), the expression levels of all eigenfeatures across the samples (\tt allLines\rm), and the heatmap of the eigenfeature x assay matrix (\tt heatmap\rm) are generated, with {\it YeastData} as prefix for visualization purpose. The bar plot shows that the first eigenfeature captures 89\% of the overall relative expression in the experiment. The entropy of the data set is therefore low (0.18 $\ll$ 1). The expression level of the first eigenfeature across the samples as displayed in the \tt allLines \rm graph shows a time-invariant relative expression during the cell cycle. The low entropy in combination with the steady-state expression captured in the first eigenfeature suggests that the underlying processes are manifested by weak perturbations of a steady state of expression. The second eigenfeature describes an initial transient increase in relative expression superimposed over time-invariant relative expression, and is therefore inferred to represent response to synchronization in the cell cycle. Inspecting the remaining eigenfeature patterns across the samples reveals that eigenfeature 8 and 10 to 18 all show rapidly varying relative expression during the cell cycle. They can therefore be considered as noise. The heatmap of the eigenfeatures by samples reveals the same information, with a clear constant expression for eigenfeature 1. <>= plot(eigensystem, plots="fraction", figure=TRUE) fractions(eigensystem)[[1]] @ <>= plot(eigensystem, plots="allLines", figure=TRUE) @ <>= plot(eigensystem, plots="heatmap", figure=TRUE, prefix="YeastData") @ We now use \tt exclude \rm to filter out steady-state expression captured by eigenfeature 1, an experimental artifact captured by eigenfeature 2 (i.e. initial response to synchronization in the cell cycle), and noise captured by eigenfeatures 8 and 10 to 18. <>= eigensystem <- exclude(eigensystem,excludeEigenfeatures=c(1,2,8,10:18)) @ Subsequently, we apply the same strategy to the variance in the data. The first eigenfeature now captures 88\% of overall information, representing a time-invariant scale of expression variance, with an entropy of 0.23. This eigenfeature is therefore removed from the data set. Besides exclusion of the specified eigenfeatures, \tt exclude \rm regenerates the eigensystem for the normalized expression data after removal of steady-scale variance. <>= eigensystem <- compute(eigensystem, apply='variance') entropy(eigensystem) fractions(eigensystem)[[1]] plot(eigensystem, plots="lines", figure=TRUE) eigensystem <- exclude(eigensystem, excludeEigenfeatures=1) @ Now that steady-state expression, steady-scale variance, experimental artifacts and noise have been removed, a summary html report and key graphs are generated. For the coloring of the genes and samples in the polar plots, the cell cycle phase information from the ExpressionSet \tt YeastData \rm is used. The polar plots show the genes and samples in the subspace spanned by two selected eigenfeatures and eigenassays, respectively. Because the first and second eigenfeature capture together more than 45\% of the overall normalized expression, these eigenfeatures were used as subspace (default). These two \{eigenfeature, eigenassay\}-pairs are sufficient to approximate the expression data when genes and samples have most of their normalized expression in this subspace (i.e., $0.5 <$ radius $< 1$). <>= fractions(eigensystem)[c(1,2)] report(eigensystem, colorIdAssays="Cell.cycle.stage", colorIdFeatures="Cell.cycle.stage", prefix="YeastData") @ <>= library(grid) eigenfeature.xaxis <- 2 eigenfeature.yaxis <- 1 colorIdAssays <- assayMatrix(eigensystem)[,"Cell.cycle.stage"] unique.col.ids <- sort(unique(colorIdAssays), na.last=NA) col.assays <- rep(0,ncol(matrix(eigensystem))) col.map <- rainbow(length(unique.col.ids)) for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]} coordinates.assays <- base::matrix(0,nrow=ncol(matrix(eigensystem)),ncol=2) for (z in c(1:ncol(eigenassays(eigensystem)))) {coordinates.assays[z,] <- c(assaycorrelations(eigensystem)[eigenfeature.xaxis,z]/sqrt(matrix(eigensystem)[,z] %*% matrix(eigensystem)[,z]), assaycorrelations(eigensystem)[eigenfeature.yaxis,z]/sqrt(matrix(eigensystem)[,z] %*% matrix(eigensystem)[,z]))} radii.assays <- signif(sqrt(coordinates.assays[,1]^2+coordinates.assays[,2]^2),3) names(radii.assays) <- colnames(matrix(eigensystem)) phase.assays <- atan(assaycorrelations(eigensystem)[eigenfeature.yaxis,]/assaycorrelations(eigensystem)[eigenfeature.xaxis,])/pi names(phase.assays) <- colnames(matrix(eigensystem)) coordinates.assays <- signif(coordinates.assays,3) rownames(coordinates.assays) <- colnames(matrix(eigensystem)) vp0 <- viewport(x=0,width=0.05,just="left",name="vp0") vp1 <- viewport(x=0.1,y=0.1,width=0.75,height=0.75,just=c("left","bottom"),name="vp1") vp2 <- viewport(x=0.1,y=0,width=0.75,height=0.1,just=c("left","bottom"),name="vp2") vp3 <- viewport(x=1,width=0.2,just="right",name="vp3") pushViewport(vp0) grid.text(paste("Assay correlation with eigenassay ",eigenfeature.yaxis,sep=""), y=0.5, rot=90) upViewport() pushViewport(vp1) grid.circle(x=0.5,y=0.5,r=0.5,gp=gpar(lty="dashed")) grid.circle(x=0.5,y=0.5,r=0.25,gp=gpar(lty="dashed",fill="grey")) grid.lines(x=unit(c(0,1),"npc"),y=unit(c(0.5,0.5),"npc"),arrow=NULL) grid.lines(x=unit(c(0.5,0.5),"npc"),y=unit(c(0,1),"npc"),arrow=NULL) for (z in c(1:length(unique.col.ids))) { indices <- which(colorIdAssays %in% unique.col.ids[z]) grid.points(x=unit((coordinates.assays[indices,1]+1)/2,"npc"),y=unit((coordinates.assays[indices,2]+1)/2,"npc"), pch=z,gp=gpar(col=col.map[z],cex=0.7)) } grid.text(c(1:length(colorIdAssays)),just="left",x=(coordinates.assays[,1]+1)/2+0.02,y=(coordinates.assays[,2]+1)/2) grid.lines(x=unit(c(0.5,(coordinates.assays[1,1]+1)/2),"npc"),y=unit(c(0.5,(coordinates.assays[1,2]+1)/2),"npc"),arrow=arrow(angle=30, length=unit(0.02,"npc"),ends="last",type="open")) upViewport() pushViewport(vp2) grid.text(paste("Assay correlation with eigenassay ",eigenfeature.xaxis,sep=""), x=0.5, y=0.5) upViewport() pushViewport(vp3) grid.points(pch=1:length(unique.col.ids),x=unit(rep(0.5,length(unique.col.ids)),"lines"),y=unit(1,"npc")-unit(c(1:length(unique.col.ids)),"lines"),gp=gpar(col=col.map)) grid.text(unique.col.ids,just="left",x=unit(rep(1.5,length(unique.col.ids)),"lines"), y=unit(1,"npc")-unit(c(1:length(unique.col.ids)),"lines"),gp=gpar(col=col.map)) upViewport() @ <>= colorIdFeatures <- featureMatrix(eigensystem)[,"Cell.cycle.stage"] unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA) col.features <- rep(0,nrow(matrix(eigensystem))) row.map <- rainbow(length(unique.row.ids)) for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures %in% unique.row.ids[z])] <- row.map[z]} coordinates.features <- base::matrix(0,nrow=nrow(matrix(eigensystem)),ncol=2) for (z in c(1:nrow(matrix(eigensystem)))) {coordinates.features[z,] <- c(featurecorrelations(eigensystem)[eigenfeature.xaxis,z]/sqrt(matrix(eigensystem)[z,] %*% matrix(eigensystem)[z,]), featurecorrelations(eigensystem)[eigenfeature.yaxis,z]/sqrt(matrix(eigensystem)[z,] %*% matrix(eigensystem)[z,]))} radii.features <- signif(sqrt(coordinates.features[,1]^2+coordinates.features[,2]^2),3) names(radii.features) <- rownames(matrix(eigensystem)) phase.features <- atan(featurecorrelations(eigensystem)[eigenfeature.yaxis,]/featurecorrelations(eigensystem)[eigenfeature.xaxis,])/pi names(phase.features) <- rownames(matrix(eigensystem)) coordinates.features <- signif(coordinates.features,3) rownames(coordinates.features) <- rownames(matrix(eigensystem)) phase.features.converted <- phase.features*pi phase.features.converted[which(coordinates.features[,1]<0)] <- phase.features.converted[which(coordinates.features[,1]<0)]+pi phase.features.converted[which(coordinates.features[,1]>0 & coordinates.features[,2]<0)] <- phase.features.converted[which(coordinates.features[,1]>0 & coordinates.features[,2]<0)]+(2*pi) phase.features.converted <- signif(phase.features.converted,3) vp0 <- viewport(x=0,width=0.05,just="left",name="vp0") vp1 <- viewport(x=0.1,y=0.1,width=0.75,height=0.75,just=c("left","bottom"),name="vp1") vp2 <- viewport(x=0.1,y=0,width=0.75,height=0.1,just=c("left","bottom"),name="vp2") vp3 <- viewport(x=1,width=0.2,just="right",name="vp3") pushViewport(vp0) grid.text(paste("Feature correlation with eigenfeature ",eigenfeature.yaxis,sep=""), y=0.5, rot=90) upViewport() pushViewport(vp1) grid.circle(x=0.5,y=0.5,r=0.5,gp=gpar(lty="dashed")) grid.circle(x=0.5,y=0.5,r=0.25,gp=gpar(lty="dashed",fill="grey")) grid.lines(x=unit(c(0,1),"npc"),y=unit(c(0.5,0.5),"npc"),arrow=NULL) grid.lines(x=unit(c(0.5,0.5),"npc"),y=unit(c(0,1),"npc"),arrow=NULL) for (z in c(1:length(unique.row.ids))) { indices <- which(colorIdFeatures %in% unique.row.ids[z]) grid.points(x=unit((coordinates.features[indices,1]+1)/2,"npc"),y=unit((coordinates.features[indices,2]+1)/2,"npc"), pch=z,gp=gpar(col=row.map[z],cex=0.7)) } grid.lines(x=unit(c(0.5,(coordinates.features[1,1]+1)/2),"npc"),y=unit(c(0.5,(coordinates.features[1,2]+1)/2),"npc"),arrow=arrow(angle=30, length=unit(0.02,"npc"),ends="last",type="open")) upViewport() pushViewport(vp2) grid.text(paste("Feature correlation with eigenfeature ",eigenfeature.xaxis,sep=""), x=0.5, y=0.5) upViewport() pushViewport(vp3) grid.points(pch=1:length(unique.row.ids),x=unit(rep(0.5,length(unique.row.ids)),"lines"),y=unit(1,"npc")-unit(c(1:length(unique.row.ids)),"lines"),gp=gpar(col=row.map)) grid.text(unique.row.ids,just="left",x=unit(rep(1.5,length(unique.row.ids)),"lines"), y=unit(1,"npc")-unit(c(1:length(unique.row.ids)),"lines"),gp=gpar(col=row.map)) upViewport() @ As can be seen on the assay polar plot, eigenassay 1 is positively associated with mainly samples from G1 phase, whilst this eigenassay is negatively associated with samples from the S, G2 and M phases. Eigenassay 2 is positively associated with G1 and S, whilst negatively associated with M and M/G1. The right upper quadrant is therefore dominated by samples from G1, the right lower quadrant by S and G2, and the left lower quadrant by M. Genes in each of these quadrants in the feature polar plot can subsequently be linked to and interpreted in function of each of these cell cycle phases. The genes are colored according to their expression correlation with known cell cycle-regulated genes (as determined by Spellman {\it et al}), with mainly G1 genes in the right upper quadrant, S and G2 genes in the right lower quadrant, and M genes in the left lower quadrant, confirming the findings obtained with the biosvd package. <>= library(gplots) eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis=eigenfeature.xaxis, eigenfeature.yaxis=eigenfeature.yaxis, colorIdFeatures=factor(colorIdFeatures)) col.features <- rep(0,nrow(matrix(eigensystem))) for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]} contrast <- 3 pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb") contrastMatrix <- contrast* matrix(eigensystem.sorted) contrastMatrix[which(contrastMatrix>1)] <- 1 contrastMatrix[which(contrastMatrix<(-1))] <- -1 contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix)) heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="YeastData", key=TRUE) legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7) @ While generating the html report, data were also sorted according to the first and second eigenfeature. The gene x sample heatmap with genes sorted according to the selected eigenfeatures shows a traveling wave of expression throughout the cell cycle. This visualization further aids in interpreting how genes are regulated by or function in the cell cycle. In the generated html report, the genes can be sorted according to their coordinates, radius and phase in the polar plot, or selected based on annotation information that was provided with the input ExpressionSet. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 2: Human HeLa Cell Cycle Expression} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Secondly, a similar experiment was performed in the human HeLa cervical carcinoma cell line \cite{Whitfield02}. Cells were arrested at the beginning of S phase by using a double thymidine block. Upon release from the thymidine block, cells were sampled every 1-2 hours for 44 hours during which the cells completed three cell cycles. Similar as for the Yeast experiment, expression data comprise the un-logtransformed ratio of gene expression to reference mRNA from an asynchronous HeLa culture. Moreover, cell cycle phase is known for each sample. For >850 genes that were identified by Whitfield {\it et al} to be periodically expressed during the cell cycle, the phase was determined based on correlation with genes known to be expressed in each cell cycle phase (e.g. {\it cyclin E1} at the G1/S boundary, {\it RAD51} in S phase, and {\it TOP2A} in G2). \vspace{3 mm} Similar as for the first case study, we load the HeLa data and compute the eigensystem. In this case, the first eigenfeature captures more than 90\% of the relative expression, with an entropy of 0.20. As can be seen on the plot of the expression level of eigenfeatures across samples, this eigenfeature represents steady-state expression. Besides eigenfeature 1, we also decided to remove eigenfeatures 7, 10, 11 and 12, all showing rapidly varying expression during the cell cycle. <>= data(HeLaData_exp_DoubleThym_2) HeLaData eigensystem <- compute(HeLaData) fractions(eigensystem)[[1]] entropy(eigensystem) plot(eigensystem, plots="allLines", figure=TRUE) eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,7,10:12)) @ As a second step, we apply the same strategy to the variance in the data. A time-invariant scale of expression variance was captured by the first eigenfeature and therefore removed. Regarding the plots generated by \tt plot\rm, these are not shown but saved as pdf files because the \tt figure \rm argument is by default \tt FALSE\rm. <>= eigensystem <- compute(eigensystem, apply='variance') entropy(eigensystem) fractions(eigensystem)[[1]] plot(eigensystem, plots=c("heatmap","fraction","lines"), prefix="HeLaData") eigensystem <- exclude(eigensystem, excludeEigenfeatures=1) @ As final step, we now generate the html report with polar plots and a visualization of the expression data after sorting according to the first and second eigenfeature. Similar as before, the cell cycle phase information for both the genes and samples from the ExpressionSet \tt HeLaData \rm is used for coloring of the polar plots and the sorted heatmaps. The gene x sample heatmap with genes sorted according to eigenfeatures 1 and 2 is shown below and displays a traveling wave of expression throughout the cell cycle. <>= report(eigensystem, colorIdAssays="Cell.cycle.stage", colorIdFeatures="Cell.cycle.stage", prefix="HeLaData") @ <>= eigenfeature.xaxis <- 2 eigenfeature.yaxis <- 1 colorIdAssays <- assayMatrix(eigensystem)[,"Cell.cycle.stage"] unique.col.ids <- sort(unique(colorIdAssays), na.last=NA) col.assays <- rep(0,ncol(matrix(eigensystem))) col.map <- rainbow(length(unique.col.ids)) for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]} eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis, eigenfeature.yaxis, "Cell.cycle.stage") unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA) col.features <- rep(0,nrow(matrix(eigensystem))) row.map <- rainbow(length(unique.row.ids)) for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]} contrast <- 3 pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb") contrastMatrix <- contrast* matrix(eigensystem.sorted) contrastMatrix[which(contrastMatrix>1)] <- 1 contrastMatrix[which(contrastMatrix<(-1))] <- -1 contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix)) heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="HeLaData", key=TRUE) legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 3: Starvation Metabolomics} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To show that use of this R package is not restricted to expression data, this third example features metabolomics data. Brauer and colleagues studied metabolic response to starvation in two microbes, {\it Escherichia coli} and {\it Saccharomyces cerevisae}, to determine whether metabolome response to nutrient deprivation is similar across both organisms \cite{Brauer06}. Sixty-eight cellular metabolites were analyzed by LC-MS/MS in both bacteria and yeast, after nutrient starvation with carbon and nitrogen. Cells were sampled for 8 hours. The metabolomics data comprise the log-transformed relative metabolite concentration changes with respect to experiment initiation at time point 0 hours. \vspace{3 mm} This case study not only differs from the two previous case studies in data type, but also in heterogeneity with four experiments combined into this data set (bacteria - carbon, bacteria - nitrogen, yeast - carbon, and yeast - nitrogen). This increased heterogeneity in the data explains the much higher entropy compared to the two previous studies (0.51), with the first and second eigenfeature capturing 42\% and 29\% of the relative intensity, respectively. Contrary to the two previous case studies, eigenfeature 1 no longer captures steady-state expression but shows a decreasing trend over time for each of the experiments, regardless of organism (B for bacteria vs. Y for yeast) and nutrient (C for carbon vs. N for nitrogen). Because we are not interested in a generic starvation response, we decided to remove the first eigenfeature at the metabolite intensity level. Furthermore, eigenfeatures 11, 12, and 14 to 24 rapidly vary along the assays and can therefore be considered as noise. <>= data(StarvationData) StarvationData eigensystem <- compute(StarvationData) fractions(eigensystem)[c(1,2)] plot(eigensystem, plots=c("fraction","lines","allLines"), figure=TRUE, prefix="StarvationData") eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,11,12,14:24)) @ <>= data(StarvationData) StarvationData eigensystem <- compute(StarvationData) fractions(eigensystem)[c(1,2)] plot(eigensystem, plots="fraction", figure=TRUE, prefix="StarvationData") @ <>= plot(eigensystem, plots="lines", figure=TRUE, prefix="StarvationData") @ <>= plot(eigensystem, plots="allLines", figure=TRUE, prefix="StarvationData") eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,11,12,14:24)) @ After removal of noise and the organism- and nutrient-inspecific starvation response, we investigated the variance in the data. Contrary to the two previous case studies, no steady-scale variance was present in the data and therefore no eigenfeatures were removed at the variance level. <>= eigensystem <- compute(eigensystem, apply='variance') plot(eigensystem, plots="lines", figure=TRUE) eigensystem <- exclude(eigensystem, excludeEigenfeatures=0) @ Finally, the summary report and graphs are generated, with use of species information for the coloring of the assays. The metabolite x assay heatmap with metabolites sorted according to eigenfeatures 1 and 2 is shown below, and allows determining the organism- and nutrient-specific metabolites. <>= report(eigensystem, colorIdAssays="Species", prefix="StarvationData") @ <>= eigenfeature.xaxis <- 2 eigenfeature.yaxis <- 1 colorIdAssays <- assayMatrix(eigensystem)[,"Species"] unique.col.ids <- sort(unique(colorIdAssays), na.last=NA) col.assays <- rep(0,ncol(matrix(eigensystem))) col.map <- rainbow(length(unique.col.ids)) for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]} colorIdFeatures <- rep(1,nrow(matrix(eigensystem))) eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis, eigenfeature.yaxis, colorIdFeatures) unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA) col.features <- rep(0,nrow(matrix(eigensystem))) row.map <- rainbow(length(unique.row.ids)) for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]} contrast <- 3 pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb") contrastMatrix <- contrast*matrix(eigensystem.sorted) contrastMatrix[which(contrastMatrix>1)] <- 1 contrastMatrix[which(contrastMatrix<(-1))] <- -1 contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix)) heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="StarvationData", key=TRUE) legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% These analyses were done using the following versions of R, the operating system, and add-on packages: <>= toLatex(sessionInfo(), locale=FALSE) @ \begin{thebibliography}{9} \bibitem{Alter00} Alter O, Brown PO, and Botstein D. \emph{Singular value decomposition for genome-wide expression data processing and modeling}. Proc Nat Acad Sci U.S.A. 97(18), 10101-10106 (2000). \bibitem{Spellman98} Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, and Futcher B. \emph{Comprehensive identification of cell cycle-regulated genes of the Yeast {\it Saccharomyces cerevisiae} by microarray hybridization}. Mol Biol Cell 9, 3273-3297 (1998). \bibitem{Whitfield02} Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, and Botstein D. \emph{Identification of genes periodically expressed in the human cell cycle and their expression in tumors}. Mol Biol Cell 13, 1977-2000 (2002). \bibitem{Brauer06} Brauer MJ, Yuan J, Bennett BD, Lu W, Kimball E, Botstein D, and Rabinowitz JD. \emph{Conservation of the metabolomic response to starvation across two divergent microbes}. Proc Nat Acad Sci U.S.A. 103(51), 19302-19307 (2006). \end{thebibliography} \end{document}