% \VignetteIndexEntry{biosvd}
% \VignetteDepends{}
% \VignetteKeywords{biosvd}
% \VignettePackage{biosvd}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{Sweave}
\usepackage{amssymb}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{authblk}

\textwidth=6.5in
\textheight=8.5in
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\title{The biosvd package for high-throughput data processing, outlier
  detection, noise removal and dynamic modeling}
\author[1]{Anneleen Daemen\thanks{daemena@gene.com}}
\author[1]{Matthew Brauer\thanks{matthejb@gene.com}}
\affil[1]{Department of Bioinformatics and Computational Biology,
  Genentech Inc., South San Francisco, CA}
\date{\today}
\renewcommand\Authands{ and }

\begin{document}

\maketitle
\tableofcontents
\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Singular Value Decomposition (SVD) is a popular method, suited for longitudinal data processing and modeling. Simply stated, SVD decomposes data to a linear combination of main major modes of intensity. SVD for the analysis of genome-wide expression data was introduced by Alter and colleagues in 2000, with the major modes referred to as eigengenes and eigenarrays [Alter et al, 2000]. Sorting the expression data to these modes revealed clusters of genes or arrays with similar function or biologic phenotype. Here, we extend the application of SVD to any data type. This BioConductor R package allows for high-throughput data processing, outlier detection, noise removal and dynamic modeling, based on the framework of Singular Value Decomposition. It provides the user with summary graphs and an interactive html report. This gives a global picture of the dynamics of expression/intensity levels, in which individual features and assays are classified in groups of similar regulation and function or similar cellular state and biological phenotype.

\subsection{Singular Value Decomposition (SVD)}
Singular Value Decomposition is a linear transformation of a data set
$E$ from the $M$ features x $N$ assays space to a reduced $L$
eigenfeatures x $L$ eigenassays space, with \(L=\min \{M,N\}\). In
mathematical terms, this corresponds to \(E = U \Sigma V^T\). $U$ and
$V^T$ define the $M$ features x $L$ eigenassays and the $L$
eigenfeatures x $N$ assays orthonormal basis sets. Each column in $U$
corresponds to a left singular vector, representing genome-wide
expression, proteome-wide abundance or metabolome-wide intensity in the $k$-th
eigenassay. Accordingly, each row in $V^T$ is called a right singular
vector and represents the expression, abundance or intensity of the
$k$-th eigenfeature across all assays.

\vspace{3 mm}
$\Sigma$ is a diagonal matrix
with expression of each eigenfeature restricted to the corresponding
eigenassay, reflecting the decoupling and decorrelation of the
data. The eigenexpression levels along the diagonal
indicate the relative significance of each $\{$eigenfeature,
  eigenassay$\}$-pair. The relative fraction of overall expression that
the $k$-th eigenfeature and eigenassay capture is called {\it
  eigenexpression fraction} and is defined as \(f_l = \epsilon^2_l /
\sum_{k=1}^L \epsilon_k^2\). Finally, data complexity is expressed as
the Shannon entropy, defined as \(0 \leqslant \frac{-1}{log(L)} \sum_{k=1}^L f_k log(f_k) \leqslant 1\). An entropy of 0 corresponds to
an ordered and redundant data set, with all expression captured by a
single $\{$eigenfeature, eigenassay$\}$-pair. On the other hand, the entropy is 1 in case of a disordered and random data set with all
$\{$eigenfeature, eigenassay$\}$-pairs equally expressed. We refer to
\cite{Alter00} for a more detailed description of SVD.

\subsection{biosvd Package Overview}
\includegraphics{biosvd-package-overview.png}

\vspace{3 mm}
The biosvd package consists of 4 main functions. First, \tt
compute \rm reduces the input data set from the feature x
assay space to the reduced diagonalized eigenfeature x eigenassay
space,  with the eigenfeatures and eigenassays unique orthonormal superpositions of the
features and assays, respectively. Results of SVD applied to the data
can subsequently be inspected based on the graphs generated with \tt
plot\rm. These graphs include a heatmap of the
eigenfeature x assay matrix ($V^T$), a bar plot with the eigenexpression
fractions of all eigenfeatures, and the levels of the eigenfeatures
across the assays. These graphs aid in deciding which eigenfeatures
and eigenassays to filter out (i.e., eigenfeatures representing steady
state, noise, or experimental artifacts) (\tt exclude\rm). Filtering out steady-state
expression/intensity corresponds to centering the
expression/intensity patterns at steady-state level (arithmetic mean
of intensity $\sim$ 0).

\vspace{3 mm}
Secondly, the three functions \tt compute\rm, \tt
plot \rm and \tt exclude \rm can be applied to the
variance in the data, in order to filter out steady-scale
variance. This corresponds to a normalization by the steady scale of
expression/intensity variance (geometric mean of variance $\sim$ 1).

\vspace{3 mm}
Thirdly, after possible removal of steady state expression,
steady-scale variance, noise and experimental artifacts, SVD is
re-applied to the normalized data, followed by the generation of a
summary report with \tt report\rm. The generated html report of the
eigensystem contains polar plots of the assays and features, displaying the features/assays according to their
correlation with two selected eigenfeatures/eigenassays, and a table with the
list of features, sortable according to their coordinates, radius and
phase in the polar plot. This function also generates a visualization
of the data sorted according to the two selected eigenfeatures and
eigenassays, with a heatmap of the feature x assay matrix with colored
feature/assay annotation information when provided, a heatmap of the
feature x eigenassay matrix, and the expression/intensity levels of the two sorted
eigenassays across all features.

\subsection{Case Studies Overview}
In this vignette, three case studies are provided. In the first 2 case
studies, the expression pattern of genes throughout the cell cycle
are studied in yeast and human, respectively. As use of this package
is not restricted to expression data, the third case study focuses on
cellular metabolites in bacteria and yeast after carbon and nitrogen starvation.

\vspace{3 mm}
The essential data must be provided as a feature x assay matrix, a
data frame, ExpressionSet or an object from class eigensystem obtained from a former run. For
the examples provided in this vignette, ExpressionSets were created
containing the gene x sample expression data with gene annotation and
sample information, or the metabolite x assay intensity data with
metabolite annotation and assay information.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Case Study 1: Yeast Cell Cycle Expression}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

As a first example, Spellman and colleagues created a comprehensive
catalog of genes in {\it Saccharomyces cerevisiae} whose
transcript levels vary periodically within the cell cycle \cite{Spellman98}. To this end, mRNA levels in samples from yeast cultures were synchronized in
G1 phase with $\alpha$ factor arrest. After release of the $\alpha$
factor, cells were sampled every 7 minutes over a timespan of 140
minutes, during which the cells synchronously completed two cell
cycles. The gene x sample expression data comprise the
(un-logtransformed) ratio of gene
expression to reference mRNA from an asynchronous yeast culture. For
each sample, the cell cycle phase is known as determined by Spellman
{\it et al}. For 800 cell cycle-regulated genes, the phase
in which these genes reach their peak expression was determined by
Spellman {\it et al} based on published timing of the expression of
known cell cycle-regulated genes.

\vspace{3 mm}
We start the example by loading the data and all required libraries:

<<yeast_data_import>>=
library(biosvd)
data(YeastData_alpha)
YeastData
@ 

The input data set is first reduced from the gene x sample space to
the reduced diagonalized eigenfeature x eigenassay space.

<<yeast_compute_eigensystem_data>>=
eigensystem <- compute(YeastData)
@ 

The eigenfeatures and eigenassays can subsequently be inspected based on the graphs generated with
\tt plot\rm. Up to 5 figures are displayed (\tt fig=TRUE\rm) or saved
as a pdf file (default \tt fig=FALSE\rm), using the \tt plots \rm argument as follows:
\begin{itemize}
\item {\bf fraction}: bar plot with the eigenexpression fractions of all eigenfeatures
\item {\bf zoomedFraction}: bar plot with the eigenexpression fractions of the eigenfeatures
after removal of the dominant eigenfeature(s)
\item {\bf heatmap}: heatmap of the eigenfeature x assay matrix with use of
  a given contrast factor
\item {\bf lines}: expression/intensity levels of eigenfeatures 1 to 4
  across the assays
\item {\bf allLines}: expression/intensity levels of all eigenfeatures
  across the assays
\end{itemize}

For this example, the bar plot with all eigenfeatures (\tt fraction\rm), the expression
levels of all eigenfeatures across the samples (\tt allLines\rm), and the heatmap of the
eigenfeature x assay matrix (\tt heatmap\rm) are generated, with {\it
YeastData} as prefix for visualization purpose. The bar
plot shows that the first eigenfeature captures 89\% of the overall
relative expression in the experiment. The entropy of the data set is
therefore low (0.18 $\ll$ 1). The expression level of the first
eigenfeature across the samples as displayed in the \tt allLines \rm
graph shows a time-invariant relative expression during the cell
cycle. The low entropy in combination with the steady-state expression
captured in the first eigenfeature suggests that the underlying
processes are manifested by weak perturbations of a steady state of
expression. The second eigenfeature
describes an initial transient increase in relative expression
superimposed over time-invariant relative expression, and is therefore
inferred to represent response to synchronization in the cell
cycle. Inspecting the remaining eigenfeature patterns across the samples
reveals that eigenfeature 8 and 10 to 18 all show rapidly varying
relative expression during the cell cycle. They can therefore be
considered as noise. The heatmap of the eigenfeatures by samples
reveals the same information, with a clear constant expression for
eigenfeature 1.

<<yeast_plot_eigensystem_data_fraction, fig=TRUE>>=
plot(eigensystem, plots="fraction", figure=TRUE)
fractions(eigensystem)[[1]]
@ 

<<yeast_plot_eigensystem_data_lines, fig=TRUE>>=
plot(eigensystem, plots="allLines", figure=TRUE)
@ 

<<yeast_plot_eigensystem_data_heatmap, fig=TRUE>>=
plot(eigensystem, plots="heatmap", figure=TRUE, prefix="YeastData")
@ 

We now use \tt exclude \rm to filter out steady-state
expression captured by eigenfeature 1, an experimental artifact
captured by eigenfeature 2 (i.e. initial response to synchronization
in the cell cycle), and noise captured by eigenfeatures 8 and 10 to 18.

<<yeast_remove_eigenfeature_data>>=
eigensystem <- exclude(eigensystem,excludeEigenfeatures=c(1,2,8,10:18))
@ 

Subsequently, we apply the same strategy to the variance in the
data. The first eigenfeature now captures 88\% of overall information,
representing a time-invariant scale of expression variance, with an
entropy of 0.23. This eigenfeature is therefore removed from the data
set. Besides exclusion of the specified eigenfeatures, \tt exclude \rm regenerates the
eigensystem for the normalized expression data after removal of
steady-scale variance.

<<yeast_compute_eigensystem_variance, fig=TRUE>>=
eigensystem <- compute(eigensystem, apply='variance')
entropy(eigensystem)
fractions(eigensystem)[[1]]
plot(eigensystem, plots="lines", figure=TRUE)
eigensystem <- exclude(eigensystem, excludeEigenfeatures=1)
@ 

Now that steady-state expression, steady-scale variance, experimental artifacts and noise have
been removed, a summary html report and key graphs are generated. For
the coloring of the genes and samples in the polar plots, the cell cycle phase information from the ExpressionSet \tt
YeastData \rm is used. The polar plots show the genes and samples in the subspace
spanned by two selected eigenfeatures and eigenassays,
respectively. Because the first and second eigenfeature capture
together more than 45\% of the overall normalized expression, these
eigenfeatures were used as subspace (default). These two \{eigenfeature,
eigenassay\}-pairs are sufficient to approximate the expression data
when genes and samples have most of their normalized expression in
this subspace (i.e., $0.5 <$ radius $< 1$).
 
<<yeast_generate_report, keep.source=FALSE>>=
fractions(eigensystem)[c(1,2)]
report(eigensystem, colorIdAssays="Cell.cycle.stage", colorIdFeatures="Cell.cycle.stage", prefix="YeastData")
@ 

<<yeast_polarplot_assays, fig=TRUE, echo=FALSE>>=
library(grid)
eigenfeature.xaxis <- 2
eigenfeature.yaxis <- 1
colorIdAssays <- assayMatrix(eigensystem)[,"Cell.cycle.stage"]
unique.col.ids <- sort(unique(colorIdAssays), na.last=NA)
col.assays <- rep(0,ncol(matrix(eigensystem)))
col.map <- rainbow(length(unique.col.ids))
for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]}

coordinates.assays <- base::matrix(0,nrow=ncol(matrix(eigensystem)),ncol=2)
for (z in c(1:ncol(eigenassays(eigensystem)))) {coordinates.assays[z,] <- c(assaycorrelations(eigensystem)[eigenfeature.xaxis,z]/sqrt(matrix(eigensystem)[,z] %*% matrix(eigensystem)[,z]), assaycorrelations(eigensystem)[eigenfeature.yaxis,z]/sqrt(matrix(eigensystem)[,z] %*% matrix(eigensystem)[,z]))}
radii.assays <- signif(sqrt(coordinates.assays[,1]^2+coordinates.assays[,2]^2),3)
names(radii.assays) <- colnames(matrix(eigensystem))
phase.assays <- atan(assaycorrelations(eigensystem)[eigenfeature.yaxis,]/assaycorrelations(eigensystem)[eigenfeature.xaxis,])/pi
names(phase.assays) <- colnames(matrix(eigensystem))
coordinates.assays <- signif(coordinates.assays,3)
rownames(coordinates.assays) <- colnames(matrix(eigensystem))
  
vp0 <- viewport(x=0,width=0.05,just="left",name="vp0")
vp1 <- viewport(x=0.1,y=0.1,width=0.75,height=0.75,just=c("left","bottom"),name="vp1")
vp2 <- viewport(x=0.1,y=0,width=0.75,height=0.1,just=c("left","bottom"),name="vp2")
vp3 <- viewport(x=1,width=0.2,just="right",name="vp3")
pushViewport(vp0)
grid.text(paste("Assay correlation with eigenassay ",eigenfeature.yaxis,sep=""), y=0.5, rot=90)
upViewport()
pushViewport(vp1)
grid.circle(x=0.5,y=0.5,r=0.5,gp=gpar(lty="dashed"))
grid.circle(x=0.5,y=0.5,r=0.25,gp=gpar(lty="dashed",fill="grey"))
grid.lines(x=unit(c(0,1),"npc"),y=unit(c(0.5,0.5),"npc"),arrow=NULL)
grid.lines(x=unit(c(0.5,0.5),"npc"),y=unit(c(0,1),"npc"),arrow=NULL)
for (z in c(1:length(unique.col.ids))) {
  indices <- which(colorIdAssays %in% unique.col.ids[z])
  grid.points(x=unit((coordinates.assays[indices,1]+1)/2,"npc"),y=unit((coordinates.assays[indices,2]+1)/2,"npc"), pch=z,gp=gpar(col=col.map[z],cex=0.7))
}
grid.text(c(1:length(colorIdAssays)),just="left",x=(coordinates.assays[,1]+1)/2+0.02,y=(coordinates.assays[,2]+1)/2)
grid.lines(x=unit(c(0.5,(coordinates.assays[1,1]+1)/2),"npc"),y=unit(c(0.5,(coordinates.assays[1,2]+1)/2),"npc"),arrow=arrow(angle=30, length=unit(0.02,"npc"),ends="last",type="open"))
upViewport()
pushViewport(vp2)
grid.text(paste("Assay correlation with eigenassay ",eigenfeature.xaxis,sep=""), x=0.5, y=0.5)
upViewport()
pushViewport(vp3)
grid.points(pch=1:length(unique.col.ids),x=unit(rep(0.5,length(unique.col.ids)),"lines"),y=unit(1,"npc")-unit(c(1:length(unique.col.ids)),"lines"),gp=gpar(col=col.map))
grid.text(unique.col.ids,just="left",x=unit(rep(1.5,length(unique.col.ids)),"lines"), y=unit(1,"npc")-unit(c(1:length(unique.col.ids)),"lines"),gp=gpar(col=col.map))
upViewport()
@ 

<<yeast_polarplot_features, fig=TRUE, echo=FALSE>>=
colorIdFeatures <- featureMatrix(eigensystem)[,"Cell.cycle.stage"]
unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA)
col.features <- rep(0,nrow(matrix(eigensystem)))
row.map <- rainbow(length(unique.row.ids))
for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures %in% unique.row.ids[z])] <- row.map[z]}

coordinates.features <- base::matrix(0,nrow=nrow(matrix(eigensystem)),ncol=2)
for (z in c(1:nrow(matrix(eigensystem)))) {coordinates.features[z,] <- c(featurecorrelations(eigensystem)[eigenfeature.xaxis,z]/sqrt(matrix(eigensystem)[z,] %*% matrix(eigensystem)[z,]), featurecorrelations(eigensystem)[eigenfeature.yaxis,z]/sqrt(matrix(eigensystem)[z,] %*% matrix(eigensystem)[z,]))}
radii.features <- signif(sqrt(coordinates.features[,1]^2+coordinates.features[,2]^2),3)
names(radii.features) <- rownames(matrix(eigensystem))
phase.features <- atan(featurecorrelations(eigensystem)[eigenfeature.yaxis,]/featurecorrelations(eigensystem)[eigenfeature.xaxis,])/pi
names(phase.features) <- rownames(matrix(eigensystem))
coordinates.features <- signif(coordinates.features,3)
rownames(coordinates.features) <- rownames(matrix(eigensystem))
phase.features.converted <- phase.features*pi
phase.features.converted[which(coordinates.features[,1]<0)] <- phase.features.converted[which(coordinates.features[,1]<0)]+pi
phase.features.converted[which(coordinates.features[,1]>0 & coordinates.features[,2]<0)] <- phase.features.converted[which(coordinates.features[,1]>0 & coordinates.features[,2]<0)]+(2*pi)
phase.features.converted <- signif(phase.features.converted,3)

vp0 <- viewport(x=0,width=0.05,just="left",name="vp0")
vp1 <- viewport(x=0.1,y=0.1,width=0.75,height=0.75,just=c("left","bottom"),name="vp1")
vp2 <- viewport(x=0.1,y=0,width=0.75,height=0.1,just=c("left","bottom"),name="vp2")
vp3 <- viewport(x=1,width=0.2,just="right",name="vp3")
pushViewport(vp0)
grid.text(paste("Feature correlation with eigenfeature ",eigenfeature.yaxis,sep=""), y=0.5, rot=90)
upViewport()
pushViewport(vp1)
grid.circle(x=0.5,y=0.5,r=0.5,gp=gpar(lty="dashed"))
grid.circle(x=0.5,y=0.5,r=0.25,gp=gpar(lty="dashed",fill="grey"))
grid.lines(x=unit(c(0,1),"npc"),y=unit(c(0.5,0.5),"npc"),arrow=NULL)
grid.lines(x=unit(c(0.5,0.5),"npc"),y=unit(c(0,1),"npc"),arrow=NULL)
for (z in c(1:length(unique.row.ids))) {
  indices <- which(colorIdFeatures %in% unique.row.ids[z])
  grid.points(x=unit((coordinates.features[indices,1]+1)/2,"npc"),y=unit((coordinates.features[indices,2]+1)/2,"npc"), pch=z,gp=gpar(col=row.map[z],cex=0.7))
}
grid.lines(x=unit(c(0.5,(coordinates.features[1,1]+1)/2),"npc"),y=unit(c(0.5,(coordinates.features[1,2]+1)/2),"npc"),arrow=arrow(angle=30, length=unit(0.02,"npc"),ends="last",type="open"))
upViewport()
pushViewport(vp2)
grid.text(paste("Feature correlation with eigenfeature ",eigenfeature.xaxis,sep=""), x=0.5, y=0.5)
upViewport()
pushViewport(vp3)
grid.points(pch=1:length(unique.row.ids),x=unit(rep(0.5,length(unique.row.ids)),"lines"),y=unit(1,"npc")-unit(c(1:length(unique.row.ids)),"lines"),gp=gpar(col=row.map))
grid.text(unique.row.ids,just="left",x=unit(rep(1.5,length(unique.row.ids)),"lines"), y=unit(1,"npc")-unit(c(1:length(unique.row.ids)),"lines"),gp=gpar(col=row.map))
upViewport()
@ 

As can be seen on the assay polar plot, eigenassay 1 is positively
associated with mainly samples from G1 phase, whilst this eigenassay
is negatively associated with samples from the S, G2 and M
phases. Eigenassay 2 is positively associated with G1 and S, whilst
negatively associated with M and M/G1. The right upper
quadrant is therefore dominated by samples from G1, the right lower
quadrant by S and G2, and the left lower quadrant by M. Genes in each
of these quadrants in the feature polar plot can subsequently be
linked to and interpreted in function of each of these cell cycle
phases. The genes are colored according to their expression
correlation with known cell cycle-regulated genes (as determined by
Spellman {\it et al}), with mainly G1 genes in the right upper
quadrant, S and G2 genes in the right lower quadrant, and M genes in
the left lower quadrant, confirming the findings obtained with the
biosvd package.

<<yeast_sortedheatmap, fig=TRUE, echo=FALSE>>=
library(gplots)
eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis=eigenfeature.xaxis, eigenfeature.yaxis=eigenfeature.yaxis, colorIdFeatures=factor(colorIdFeatures))
col.features <- rep(0,nrow(matrix(eigensystem)))
for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]}
contrast <- 3  
pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb")
contrastMatrix <- contrast* matrix(eigensystem.sorted)
contrastMatrix[which(contrastMatrix>1)] <- 1
contrastMatrix[which(contrastMatrix<(-1))] <- -1
contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix))
heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="YeastData", key=TRUE)
legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7)
@ 

While generating the html report, data were also sorted according to
the first and second eigenfeature. The gene x sample heatmap with
genes sorted according to the selected eigenfeatures shows a traveling
wave of expression throughout the cell cycle. This visualization
further aids in interpreting how genes are regulated by or function in
the cell cycle. In the generated html report, the genes can be sorted
according to their coordinates, radius and phase in the polar plot, or
selected based on annotation information that was provided with the
input ExpressionSet.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Case Study 2: Human HeLa Cell Cycle Expression}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Secondly, a similar experiment was performed in the human HeLa
cervical carcinoma cell line \cite{Whitfield02}. Cells were arrested at the beginning of S phase by using a
double thymidine block. Upon release from the thymidine block, cells were sampled every
1-2 hours for 44 hours during which the cells completed three cell
cycles. Similar as for the Yeast experiment, expression data comprise
the un-logtransformed ratio of gene expression to reference mRNA from
an asynchronous HeLa culture. Moreover, cell cycle phase is known for each
sample. For >850 genes that were identified by
Whitfield {\it et al} to be periodically expressed during the cell
cycle, the phase was determined based on correlation with genes known
to be expressed in each cell cycle phase (e.g. {\it cyclin E1} at the G1/S
boundary, {\it RAD51} in S phase, and {\it TOP2A} in G2).

\vspace{3 mm}
Similar as for the first case study, we load the HeLa data and compute the
eigensystem. In this case, the first eigenfeature captures more than 90\% of the
relative expression, with an entropy of 0.20. As can be seen on the
plot of the expression level of eigenfeatures across samples,
this eigenfeature represents steady-state expression. Besides
eigenfeature 1, we also decided to remove eigenfeatures 7, 10, 11 and
12, all showing rapidly varying expression during the cell cycle.  

<<hela_compute_eigensystem_data, fig=TRUE>>=
data(HeLaData_exp_DoubleThym_2)
HeLaData
eigensystem <- compute(HeLaData)
fractions(eigensystem)[[1]]
entropy(eigensystem)
plot(eigensystem, plots="allLines", figure=TRUE)
eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,7,10:12))
@ 

As a second step, we apply the same strategy to the variance in the
data. A time-invariant scale of expression variance was captured by
the first eigenfeature and therefore removed. Regarding the plots
generated by \tt plot\rm, these are not shown but saved as pdf
files because the \tt figure \rm argument is by default \tt FALSE\rm.

<<hela_compute_eigensystem_variance>>=
eigensystem <- compute(eigensystem, apply='variance')
entropy(eigensystem)
fractions(eigensystem)[[1]]
plot(eigensystem, plots=c("heatmap","fraction","lines"), prefix="HeLaData")
eigensystem <- exclude(eigensystem, excludeEigenfeatures=1)
@ 

As final step, we now generate the html report with polar plots and a
visualization of the expression data after sorting according to the
first and second eigenfeature. Similar as before, the cell cycle phase
information for both the genes and samples from the ExpressionSet \tt
HeLaData \rm is used for coloring of the polar plots and the sorted
heatmaps. The gene x sample heatmap with
genes sorted according to eigenfeatures 1 and 2 is shown below and displays a traveling
wave of expression throughout the cell cycle. 

<<hela_generate_report, keep.source=FALSE>>=
report(eigensystem, colorIdAssays="Cell.cycle.stage", colorIdFeatures="Cell.cycle.stage", prefix="HeLaData")
@ 

<<hela_sortedheatmap, fig=TRUE, echo=FALSE>>=
eigenfeature.xaxis <- 2
eigenfeature.yaxis <- 1
colorIdAssays <- assayMatrix(eigensystem)[,"Cell.cycle.stage"]
unique.col.ids <- sort(unique(colorIdAssays), na.last=NA)
col.assays <- rep(0,ncol(matrix(eigensystem)))
col.map <- rainbow(length(unique.col.ids))
for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]}

eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis, eigenfeature.yaxis, "Cell.cycle.stage")
unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA)
col.features <- rep(0,nrow(matrix(eigensystem)))
row.map <- rainbow(length(unique.row.ids))
for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]}
contrast <- 3  
pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb")
contrastMatrix <- contrast* matrix(eigensystem.sorted)
contrastMatrix[which(contrastMatrix>1)] <- 1
contrastMatrix[which(contrastMatrix<(-1))] <- -1
contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix))
heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="HeLaData", key=TRUE)
legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7)
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Case Study 3: Starvation Metabolomics}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

To show that use of this R package is not restricted to
expression data, this third example features metabolomics
data. Brauer and colleagues studied metabolic response to starvation
in two microbes, {\it Escherichia coli} and {\it Saccharomyces
  cerevisae}, to determine whether metabolome response to nutrient
deprivation is similar across both organisms \cite{Brauer06}. Sixty-eight cellular
metabolites were analyzed by LC-MS/MS in both bacteria and yeast, after nutrient starvation
with carbon and nitrogen. Cells were sampled for 8 hours. The
metabolomics data comprise the log-transformed relative metabolite
concentration changes with respect to experiment initiation at time point 0 hours.

\vspace{3 mm}
This case study not only differs from the two previous case studies in
data type, but also in heterogeneity with four experiments combined into
this data set (bacteria - carbon, bacteria - nitrogen, yeast - carbon,
and yeast - nitrogen). This increased heterogeneity in the data explains the
much higher entropy compared to the two previous studies (0.51), with
the first and second eigenfeature capturing 42\% and 29\% of the
relative intensity, respectively. Contrary to the two
previous case studies, eigenfeature 1 no longer captures steady-state expression but
shows a decreasing trend over time for each of the experiments, regardless of
organism (B for bacteria vs. Y for yeast) and nutrient (C for carbon
vs. N for nitrogen). Because we are not interested in a generic
starvation response, we decided to remove the first eigenfeature at the
metabolite intensity level. Furthermore, eigenfeatures 11, 12, and 14
to 24 rapidly vary along the assays and can therefore be considered as noise.

<<starvation_compute_eigensystem_data, eval=FALSE, keep.source=FALSE>>=
data(StarvationData)
StarvationData
eigensystem <- compute(StarvationData)
fractions(eigensystem)[c(1,2)]
plot(eigensystem, plots=c("fraction","lines","allLines"), figure=TRUE, prefix="StarvationData")
eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,11,12,14:24))
@ 

<<starvation_plot_eigensystem_1, fig=TRUE, echo=FALSE>>=
data(StarvationData)
StarvationData
eigensystem <- compute(StarvationData)
fractions(eigensystem)[c(1,2)]
plot(eigensystem, plots="fraction", figure=TRUE, prefix="StarvationData")
@ 

<<starvation_plot_eigensystem_2, fig=TRUE, echo=FALSE>>=
plot(eigensystem, plots="lines", figure=TRUE, prefix="StarvationData")
@ 

<<starvation_plot_eigensystem_3, fig=TRUE, echo=FALSE>>=
plot(eigensystem, plots="allLines", figure=TRUE, prefix="StarvationData")
eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,11,12,14:24))
@ 

After removal of noise and the organism- and nutrient-inspecific starvation
response, we investigated the variance in the data. Contrary to the
two previous case studies, no steady-scale variance was present in the
data and therefore no eigenfeatures were removed at the variance level.

<<starvation_compute_eigensystem_variance, fig=TRUE>>=
eigensystem <- compute(eigensystem, apply='variance')
plot(eigensystem, plots="lines", figure=TRUE)
eigensystem <- exclude(eigensystem, excludeEigenfeatures=0)
@ 

Finally, the summary report and graphs are generated, with use of
species information for the coloring of the assays. The metabolite x
assay heatmap with metabolites sorted according to eigenfeatures 1 and
2 is shown below, and allows determining the organism- and
nutrient-specific metabolites. 

<<starvation_generate_report>>=
report(eigensystem, colorIdAssays="Species", prefix="StarvationData")
@ 

<<starvation_sortedheatmap, fig=TRUE, echo=FALSE>>=
eigenfeature.xaxis <- 2
eigenfeature.yaxis <- 1
colorIdAssays <- assayMatrix(eigensystem)[,"Species"]
unique.col.ids <- sort(unique(colorIdAssays), na.last=NA)
col.assays <- rep(0,ncol(matrix(eigensystem)))
col.map <- rainbow(length(unique.col.ids))
for (z in c(1:length(unique.col.ids))) {col.assays[which(colorIdAssays %in% unique.col.ids[z])] <- col.map[z]}

colorIdFeatures <- rep(1,nrow(matrix(eigensystem)))
eigensystem.sorted <- sort(eigensystem, decreasing=FALSE, eigenfeature.xaxis, eigenfeature.yaxis, colorIdFeatures)
unique.row.ids <- sort(unique(colorIdFeatures), na.last=NA)
col.features <- rep(0,nrow(matrix(eigensystem)))
row.map <- rainbow(length(unique.row.ids))
for (z in c(1:length(unique.row.ids))) {col.features[which(colorIdFeatures(eigensystem.sorted) %in% unique.row.ids[z])] <- row.map[z]}
contrast <- 3  
pal <- colorRampPalette(c(rgb(1,0,0), rgb(0,1,0)), space="rgb")
contrastMatrix <- contrast*matrix(eigensystem.sorted)
contrastMatrix[which(contrastMatrix>1)] <- 1
contrastMatrix[which(contrastMatrix<(-1))] <- -1
contrastMatrix <- (contrastMatrix - min(contrastMatrix))/(max(contrastMatrix)-min(contrastMatrix))
heatmap.2(contrastMatrix, Rowv=NA, Colv=NA, RowSideColors=col.features, ColSideColors=col.assays, scale="none", dendrogram="none", col=pal, trace="none", xlab="Assays", ylab="Features", labRow=NA, margins=c(9,3), main="StarvationData", key=TRUE)
legend("left",legend=c("Assay annotation",as.character(unique.col.ids),"","Feature annotation",as.character(unique.row.ids)),fill=c("white",col.map,"white","white",row.map), bty="n", border=FALSE, cex=0.7, y.intersp=0.7)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Session Info}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

These analyses were done using the following versions of
R, the operating system, and add-on packages:
<<sessionInfo, echo=F, results=tex>>=
toLatex(sessionInfo(), locale=FALSE)
@


\begin{thebibliography}{9}

\bibitem{Alter00}
  Alter O, Brown PO, and Botstein D. \emph{Singular value decomposition for genome-wide expression
    data processing and modeling}. Proc Nat Acad Sci U.S.A. 97(18), 10101-10106 (2000).

\bibitem{Spellman98}
  Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB,
  Brown PO, Botstein D, and Futcher B. \emph{Comprehensive identification of cell cycle-regulated
    genes of the Yeast {\it Saccharomyces cerevisiae} by microarray
    hybridization}. Mol Biol Cell 9, 3273-3297 (1998).

\bibitem{Whitfield02}
  Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander
  KE, Matese JC, Perou CM, Hurt MM, Brown PO, and Botstein
  D. \emph{Identification of genes periodically expressed in the human cell cycle and their expression in tumors}. Mol Biol Cell 13, 1977-2000 (2002).

\bibitem{Brauer06}
  Brauer MJ, Yuan J, Bennett BD, Lu W, Kimball E, Botstein D, and
  Rabinowitz JD. \emph{Conservation of the metabolomic response to starvation
    across two divergent microbes}. Proc Nat Acad Sci U.S.A. 103(51), 19302-19307 (2006).

\end{thebibliography}
  
\end{document}