% \VignetteEngine{knitr::knitr}
% \VignetteIndexEntry{06. RNASeq Analysis}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@ 

\title{RNASeq Analysis}
\author{Martin T.\ Morgan\footnote{\url{mtmorgan@fhcrc.org}}}
\date{27-28 February 2014}

\newcommand{\Hsap}{\emph{H.~sapiens}}
\newcommand{\Dmel}{\emph{D.~melanogaster}}

\begin{document}

\maketitle
\tableofcontents

\section{General work flows}

\paragraph{A running example: the \emph{pasilla} data set}

As a running example, we use the \emph{pasilla} data set, derived from
\cite{pasilla}. The authors investigate conservation of RNA
regulation between \Dmel{} and mammals. Part of their study used RNAi
and RNA-seq to identify exons regulated by Pasilla (\emph{ps}), the
\Dmel{} ortholog of mammalian NOVA1 and NOVA2. Briefly, their
experiment compared gene expression as measured by RNAseq in S2-DRSC
cells cultured with, or without, a 444 bp dsRNA fragment corresponding
to the \emph{ps} mRNA sequence. Their assessment investigated
differential exon use, but our worked example will focus on gene-level
differences.  For several examples we look at a subset of the
\emph{ps} data, corresponding to reads obtained from lanes of their
RNA-seq experiment, and to the same reads aligned to a \Dmel{}
reference genome. Reads were obtained from GEO and the Short Read
Archive (SRA), and were aligned to the \Dmel{} reference genome
\emph{dm3} as described in the \Rpackage{pasilla} experiment data
package.

\paragraph{Work flow}

At a very high level, one can envision a work flow that starts with a
challenging biological question (how does \emph{ps} influence gene and
transcript regulation?). The biological question is framed in terms of
wet-lab protocols coupled with an appropriate and all-important
experimental design. There are several well-known statistical
challenges, common to any experimental data What treatments are going
to be applied? How many replicates will there be of each? Is there
likely to be sufficient power to answer the biologically relevant
question?  Reality is also important at this stage, as evidenced in
the \emph{pasilla} data where, as we will see, samples were collected
using different methods (single versus paired end reads) over a time
when there were rapid technological changes. Such reality often
introduces confounding factors that require appropriate statistical
treatment in subsequent analysis.

The work flow proceeds with data generation, involving both a wet-lab
(sample preparation) component and actual sequencing. It is essential
to acknowledge the biases and artifacts that are introduced at each of
these stages. Sample preparation involves non-trivial amounts of time
and effort. Large studies are likely to have batch effects (e.g.,
because work was done by different lab members, or different batches
of reagent). Samples might have been prepared in ways that are likely
to influence down-stream analysis, e.g., using a protocol involving
PCR and hence introducing opportunities for sample-specific bias. DNA
isolation protocols may introduce many artifacts, e.g., non-uniform
representation of reads across the length of expressed genes in
RNA-seq.  The sequencing reaction itself is far from bias-free, with
known artifacts of called base frequency, cycle-dependent accuracy and
bias, non-uniform coverage, etc. At a minimum, the research needs to
be aware of the opportunities for bias that can be introduced during
sample preparation and sequencing.

The informatics component of work flows becomes increasing important
during and after sequence generation.  The sequencer is often treated
as a `black box', producing short reads consisting of 10's to 100's of
nucleotides and with associated quality scores. Usually, the chemistry
and informatics processing pipeline are sufficiently well documented
that one can arrive at an understanding of biases and quality issues
that might be involved; such an understanding is likely to be
particularly important when embarking on questions or using protocols
that are at the fringe of standard practice (where, after all, the
excitement is).  

The first real data seen by users are \emph{fastq} files. These files
are often simple text files consisting of many millions of records,
and are described in greater detail earlier in the course. The center
performing the sequencing typically vets results for quality, but
these quality measures are really about the performance of their
machines. It is very important to assess quality with respect to the
experiment being undertaken -- Are the numbers of reads consistent
across samples? Is the GC content and other observable aspects of the
reads consistent with expectation? Are there anomalies in the sequence
results that reflect primers or other reagents used during sample
preparation? Are well-known artifacts of the protocol used evident in
the reads in hand?

The next step in many work flows involves alignment of reads to a
reference genome.  There are many aligners available, including
\href{http://bio-bwa.sourceforge.net/}{BWA} \cite{pmid20080505},
\href{http://bowtie-bio.sourceforge.net/}{Bowtie} /
\href{http://bowtie-bio.sourceforge.net/bowtie2/}{Bowtie2}
\cite{pmid19261174}, and
\href{http://research-pub.gene.com/gmap/}{GSNAP}; merits of these are
discussed in the literature. \Bioconductor{} packages `wrapping' these
tools are increasingly common (e.g., \Biocpkg{Rbowtie},
\Biocpkg{gmapR}; \Biocpkg{cummeRbund} for parsing output of the
\software{cufflinks} transcript discovery pathway). There are also
alignment algorithms implemented in \Bioconductor{} (e.g.,
\Rfunction{matchPDict} in the \Biocpkg{Biostrings} package, and the
\Biocpkg{Rsubread} package); \Rfunction{matchPDict} is particularly
useful for flexible alignment of moderately sized subsets of
data. Most main-stream aligners produce output in `SAM' or `BAM'
(binary alignment) format. BAM files are the primary starting point
for many analyses, and their manipulation and use in \Bioconductor{}
was introduced earlier in the course.

\section{RNA-seq: case study}

\subsection{Varieties of RNA-seq}

RNA-seq experiments typically ask about differences in transcription of
genes or other features across experimental groups. The analysis of
designed experiments is statistical, and hence an ideal task for
\R. The overall structure of the analysis, with tens of thousands of
features and tens of samples, is reminiscent of microarray analysis;
some insights from the microarray domain will apply, at least
conceptually, to the analysis of RNA-seq experiments.

The most straight-forward RNA-seq experiments quantify abundance for
known gene models. The known models are derived from reference
databases, reflecting the accumulated knowledge of the community
responsible for the data. The `knownGenes' track of the UCSC genome
browser represents one source of such data. A track like this
describes, for each gene, the transcripts and exons that are expected
based on current data. The \Biocpkg{GenomicFeatures} package allows
ready access to this information by creating a local database out of
the track information.  This data base of known genes is coupled with
high throughput sequence data by counting reads overlapping known
genes and modeling the relationship between treatment groups and
counts.

A more ambitious approach to RNA-seq attempts to identify novel
transcripts. This requires that sequenced reads be assembled into
contigs that, presumably, correspond to expressed transcripts that are
then located in the genome. Regions identified in this way may
correspond to known transcripts, to novel arrangements of known exons
(e.g., through alternative splicing), or to completely novel
constructs. We will not address the identification of completely novel
transcripts here, but will instead focus on the analysis of the
designed experiments: do the transcript abundances, novel or
otherwise, differ between experimental groups?

\subsection{RNA-seq work flows}

RNA-seq work flows aim at measuring gene expression through assessment
of mRNA abundance. Work flows involve:
\begin{enumerate}
\item Experimental design.
\item Wet-lab protocols for mRNA extraction and reverse transcription
  to cDNA.
\item Sequencing; QA.
\item Alignment of sequenced reads to a reference genome; QA.
\item Summarizing of the number of reads aligning to a region; QA.
\item Normalization of samples to accommodate purely technical
  differences in preparation.
\item Statistical assessment of differential representation, including
  specification of an appropriate error model.
\item Interpretation of results in the context of original biological
  questions; QA.
\end{enumerate}
The inference is that higher levels of gene expression translate to
more abundant cDNA, and greater numbers of reads aligned to the
reference genome. The enumeration above seems simplistic, but oddly
enough one has concerns and commentary on each point.

\subsection{Wet-lab protocols, sequencing, and alignment}

The important point here is that wet-lab protocols, sequencing
reactions, and alignment introduce artifacts that need to be
acknowledged and, if possible, accommodated in down-stream
analysis. These artifacts and approaches to their remediation are
discussed in the following sections.

\section{Statistical issues}

Important statistical issues are summarized in
Table~\ref{tab:rnaseqissues}.

\begin{table}
  \centering
  \caption{Statistical issues in RNA-seq differential expression.}
  \label{tab:rnaseqissues}
  \begin{tabular}{lp{.7\textwidth}}
    Analysis stage & Issues \\
    \hline
    Experimental design & Replication, complexity, feasibility\\

    Batch effects & Known and unknown factors. \\

    Summarize & Counts versus RPKM and other summaries. \\

    Normalize & Robust estimates of library size. \\

    Differential expression & Appropriate error model (Negative
    Binomial, Poisson, \ldots); dispersion (under negative binomial) as
    parameter requiring estimation; `shrinkage' to balance accuracy of
    per-gene estimates with precision of experiment-wide estimates.\\

    Testing & Filtering to reduce multiple comparisons \& false discovery rate.\\

    \hline
  \end{tabular}
\end{table}

\subsection{Experimental design}

\paragraph{Technical versus biological replication}

Obviously one should follow best practices for designing experiments
appropriate for the data under analysis. A typical experiment will
have one or several groups. Because there is uncertainty in each
measurement, we require replication. Previous work shows that
technical replication (repeating identical wet-lab and sequencing
protocols on a single biological sample) introduces variation that is
small~\cite{marioni2008rna} compared to biological replicates (using
different samples). Most RNA-seq experiments require biological
replication, and seldom include technical replicates. 

\paragraph{Sample size} 

How many biological replicates? It is helpful to think in terms of
orders of magnitude -- biological treatments with strong and
consistent consequences for gene expression will be detected with a
handful -- 2 or 3 -- replicates per treatment. Conversely,
statistically subtle effects will not be much revealed by samples of
say 5 or 8, but will instead require 10's or 100's of samples.  The
\Biocpkg{RNASeqPower} package provides data-driven guidance on power
calculations in RNA-seq experiments; \Biocpkg{CSSP} provides ChIP-seq
power calculations based on Bayesian estimation for local counting
processes.

\paragraph{Complexity}

How complicated an experimental design? The advice must be to `keep it
simple'. There are many interesting biological questions that one
could ask, but experimental designs with more than one or at most two
factors, or with multiple levels per factor, will undermine
statistical power and complicate analysis. There are exceptions of
course, for instance a time course design or an experiment with two or
more factors, but these require strong \emph{a priori} motivation and
confidence that the design is amenable to analysis even in the face of
wet-lab or sequencing catastrophe.

\paragraph{Feasibility of intended statistical analysis}

What kind of treatment? Two `lessons learned' from microarray analysis
and applicable to RNA-seq inform this question. (a) It is necessary to
normalize observations between samples to accommodate purely technical
variation in overall patterns of expression. For example, samples
provided to the sequencer have different amounts of DNA, resulting in
variation in total numbers of sequenced and aligned reads independent
of any difference in gene-level differential representation. This
implies that the treatment should \emph{affect only a fraction of the
  genes assayed}, otherwise treatment effects and protocol artifacts
are confounded. (b) Between-gene measures of expression differ for
reasons unrelated to levels of expression. For instance, standard
protocols mean that a long gene is sequenced more often than a short
gene, even when the number of mRNA molecules of the two genes are
identical. This means that the most productive approach to
differential representation will \emph{compare genes across samples},
rather than compare levels of representation of different genes (gene
set enrichment analysis and other approaches to between-gene
comparison are statistically interesting in part because of the need
to overcome between-gene differences arising for purely technical
reasons). The combination of lessons (a) and (b) dictate that the
treatment should affect only a subset of the genes under study, and
that `interesting' results correspond to treatment groups with
differences at the gene level. \emph{A priori} motivation, e.g., about
well-defined pathways as targets of differential representation, may
trump part (b) of this guideline.

\subsection{Batch effects}

The reality of executing designed experiments may mean that there are
known but unavoidable factors that confound the analysis, but that are
not of fundamental biological interest. Perhaps samples are being
processed by different groups, or processing is spread over several
months to accommodate personnel or sequencer availability. It is
essential to avoid confounding such factors with biologically relevant
parts of the experiment.  Such batch effects are pervasive in
high-throughput analysis of diverse data types \cite{leek:2010aa};
addressing batch effects helps to reduce dependence, stabilize error
rate estimates, and improve reproducibility.

Having acknowledged a potentially confounding factor, what is to be
done? A first reaction might be randomization -- arrange for samples
to be processed in a random order, for instance, rather than by
treatment group -- but a better strategy is usually to include a
blocking factor, e.g., processed by lab `A' versus lab `B' and to
ensure that treatments are represented by replicates in each blocking
factor. The down-stream analysis can then use replication to
statistically accommodate such effects.

An alternative to explicitly modeling batch effects is to identify
`surrogate variables'. Surrogate variables are covariates constructed
directly from the data, and can be used in subsequent analysis to
adjust for unknown, un-modeled, or latent sources of noise
\cite{leek:storey:2007, leek:storey:2008, leek:2011aa}.  The
\Biocpkg{sva} package implements surrogate variable analysis, and can
be used with RNA-seq and many other high dimensional data types.
\Biocpkg{sva} estimates surrogate variables for inclusion in
subsequent analysis, or removes known batch effects using ComBat
\cite{johnson:2007aa}.

An interesting approach to addressing batch effects in studies where
new samples are accumulated incrementally (e.g., patient assays from
physician offices) is to create a `frozen' correction on a training
data set, and perform per-sample correction on new samples as they
become available. This is similar to the `frozen' RMA approach to
normalization developed by McCall et al., \cite{mccall2010}, and is
implemented by the \Rfunction{fsva} function in the \Biocpkg{sva}
package.

\subsection{Summarizing}

The summary process tallies the number of reads aligning in each
region (e.g., gene) of interest. The simplest method is to simply
count reads overlapping each region, dividing by the length of the
region of interest to accommodate differences in gene length. This is
the `RPKM' (reads per kilobase per million reads) of Mortazavi et
al.~\cite{mortazavi2008mapping}. One problem with this approach is
that reads are not sampled uniformly across genes
(Figure~\ref{fig:HansenEtAl2010}; \cite{hansen2010biases}), so gene
length (the `PK' part of RPKM) is not a good proxy for expression
level.

\begin{figure}
  \centering
  \includegraphics[width=.5\textwidth,height=!]{figures/HansenEtAl2010}
  \caption{Nucleotide frequency versus position relative to start of
    alignment, various experiments and protocols;
    see~\cite{hansen2010biases}.}
  \label{fig:HansenEtAl2010}
\end{figure}

More fundamentally, each read represents an observation, and
contributes to the certainty with which a gene is measured as
`expressed'. A summary measure like RPKM fails to incorporate
uncertainty -- a particular value of RPKM might result from alignment
of one or 100 reads. This contrasts with a simple count of the number
of reads in the region of interest. Furthermore, count data has known
statistical properties that can be exploited in down-stream
statistical analysis. Thus the result of summarization most useful for
assessing differential expression is read count.

How to count? For instance, should a read that partly overlaps a 5'
UTR or an intron be included in a tally? What about reads that overlap
multiple genes? This is a non-trivial question because alignment is
only approximate (reflecting sequencing and other biases) and because
sample preparation protocols and organism biology (e.g., whether the
UTR or fully mature mRNA is sequenced) may dictate particular counting
strategies; more elaborate counting strategies might be entertained
for paired end reads. Anders enumerates some counting
strategies\footnote{\url{http://www-huber.embl.de/users/anders/HTSeq/doc/count.html}};
these are implemented in his \software{HTSeq} python scripts, in
\Rfunction{summarizeOverlaps} in the \Biocpkg{GenomicRanges} package,
or in functions in the (linux-only) \Biocpkg{Rsubread} and
\Biocpkg{gmapR} packages.

\subsection{Normalization}

Normalization arises from the need to correct for purely technical
differences between samples. The most common symptom of the need for
normalization is differences in the total number of aligned reads. The
`M' part of RPKM measure mentioned in the context of summarization is
one way of normalizing for total count. This normalization is not
appropriate, because the distribution of aligned reads across genes
within a sample is not uniform -- some regions receive many more
alignments than do others -- and this distribution may differ between
samples.

The overall strategy with normalization is to choose an appropriate
baseline, and express sample counts relative to that baseline. There
are several approaches to choice of appropriate baseline. One might
choose total count for normalization, but this is a poor choice when
one or a few regions of interest are very well represented -- we are
normalizing to the well-represented genes rather than to sequencing
depth in each sample. Other straight-forward approaches include use of
house-keeping genes, or the expression level from a particular
quantile of the distribution of gene expression values of each
sample~\cite{bullard2010evaluation}. One might attempt a robust
estimate of sample abundance that is less sensitive to extreme
outliers, e.g., the trimmed geometric mean of
counts~\cite{anders2010differential}. Another approach is
TMM~\cite{pmid19910308}, which measures the trimmed mean of M and A
values (M values are the log fold change in the number of reads
aligning to a region of interest measured relative to an average or
arbitrary sample, A is the average count of a gene; the trimmed mean
discards regions of interest that have extreme M or A values and
calculates the mean M value of the remainder); the inverse of this
mean is used to weight samples. More data-driven approaches exploiting
the gene-specific properties include conditional quantile
normalization (implemented in the \Biocpkg{cqn}
package;~\cite{cqn2012}).

Another approach to normalization, increasingly popular as experiment
size and data consistency increases, is to perform a data
transformation and apply normalization methods developed for analysis
of microarrays. Examples of this approach include
\Rcode{varianceStabilizingTransformation} from the \Biocpkg{DESeq2}
package, and \Rfunction{voom} from the \Biocpkg{limma} package; see
the corresponding help pages of these functions for details).

\subsection{Error model}

A Negative Binomial error model is often appropriate for `smaller'
experiments. These models combine Poisson (`shot' noise, i.e.,
within-sample techincal and sampling variation in read counts) with
variation between biological samples. The
\Biocpkg{edgeR}~\cite{McCarthy2012differential} and
\Biocpkg{DESeq}~\cite{anders2010differential} (now \Biocpkg{DESeq2})
packages implement these models.  Negative bionomial error models
involve estimation of dispersion parameters, which are estimated
poorly in small samples. \Biocpkg{edgeR} and \Biocpkg{DESeq2} adopt
different data-driven approaches to arrive at more robust dispersion
estimates; the packages, relying on different strategies to moderate
per-gene estimates with more robust local estimates derived from genes
with similar expression values. Other approaches are possible;
\Biocpkg{DSS}~\cite{DSS2012} estimates are based on $\gamma$-Poisson
or $\beta$-Binomial distributions.

As number of replicates become large, the importance of explicitly
modeling biological sampling variance decrease. This encourages use of
the Poisson-Tweedie family of distributions to model count
data~\cite{tweeDEseq}.

\subsection{Multiple comparison}

\begin{enumerate}
\item Increase statistical power and reduce false discovery rate by
  filtering regions of interest prior to analysis.
\item Motivation (a): just because genes are assayed does not mean,
  \emph{a priori}, that they represent something requiring a
  statistical test. (b) Some observations, e.g., zero counts across
  all samples, cannot possibly be statistically significant,
  independent of hypothesis under investigation.
\item Approach -- detection or `$K$ over $A$'-style filter;
  representation of a minimum of $A$ (normalized) read counts in at
  least $K$ samples. $A$ usually measured as counts per
  million. Guidelines for choice of values a little \emph{ad hoc};
  see, e.g., the \Biocpkg{edgeR} user manual. Variance filter, e.g.,
  IQR (inter-quartile range) provides a robust estimate of variability;
  can be used to rank and discard least-varying regions.
\end{enumerate}

\section{Selected \Bioconductor{} software for RNA-seq Analysis}

\Bioconductor{} packages play a role in several stages of an RNA-seq
analysis (Table~\ref{tab:rnaseq-pkgs}; a more comprehensive list is
under the
\href{http://bioconductor.org/packages/2.10/BiocViews.html#___RNAseq}{RNAseq}
and
\href{http://bioconductor.org/packages/2.10/BiocViews.html#___HighThroughputSequencing}{HighThroughputSequencing}
BiocViews terms). The \Biocpkg{GenomicRanges} infrastructure can be
effectively employed to quantify known exon or transcript
abundances. Quantified abundances are in essence a matrix of counts,
with rows representing features and columns samples. The
\Biocpkg{edgeR}~\cite{pmid19910308} and
\Biocpkg{DESeq2}~\cite{anders2010differential} packages facilitate
analysis of this data in the context of designed experiments, and are
appropriate when the questions of interest involve between-sample
comparisons of relative abundance. The \Biocpkg{DEXSeq} package
extends the approach in \Biocpkg{edgeR} and \Biocpkg{DESeq2} to ask
about within-gene, between group differences in exon use, i.e., for a
given gene, do groups differ in their exon use?

\begin{table}
  \centering
  \caption{Selected \Bioconductor{} packages for RNA-seq analysis.}
  \label{tab:rnaseq-pkgs}
  \begin{tabular}{lp{.7\textwidth}}
    Package & Description\\
    \hline

    \Biocpkg{EDASeq} & Exploratory analysis and QA; also
    \Biocpkg{qrqc}, \Biocpkg{ShortRead}, \Biocpkg{DESeq2}.\\

    \Biocpkg{edgeR}, \Biocpkg{DESeq2} & Generalized Linear Models using
    negative binomial error. \\
    
    \Biocpkg{BitSeq} & Bayesian inference of individual transcript
    abundances followed by differential expression. \\

    \Biocpkg{DEXSeq} & Exon-level differential representation. \\
    
    \Biocpkg{DSS}, \Biocpkg{vsn}, \Biocpkg{cqn} & RNA-seq
    normalization methodologies. Also \Rfunction{voom} in
    \Biocpkg{limma}.\\

    \Biocpkg{goseq} & Gene set enrichment tailored to RNAseq count
    data; also \Biocpkg{limma}'s \Rfunction{roast} or \Rfunction{camera} after
    transformation with \Rfunction{voom}. \\

    \Biocpkg{QuasR} & Workflow. \\

    \Biocpkg{Rsubread} & Alignment (Linux only); also \Biocpkg{gmapR}; 
    \Biocpkg{Biostrings} \Rfunction{matchPDict} for special-purpose
    alignments. \\

    \Biocpkg{cummeRbund} & Exploration and analysis of Cufflinks results.\\
    \hline
  \end{tabular}
\end{table}

\section{DESeq2 Work Flow Exercises}

For this chapter, follow in-course instructions to work through the
Parathryoid \Biocpkg{DESeq2} analysis.

\bibliography{RNASeq}

\end{document}