% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{06. RNASeq Analysis} \documentclass{article} <>= BiocStyle::latex() @ \title{RNASeq Analysis} \author{Martin T.\ Morgan\footnote{\url{mtmorgan@fhcrc.org}}} \date{27-28 February 2014} \newcommand{\Hsap}{\emph{H.~sapiens}} \newcommand{\Dmel}{\emph{D.~melanogaster}} \begin{document} \maketitle \tableofcontents \section{General work flows} \paragraph{A running example: the \emph{pasilla} data set} As a running example, we use the \emph{pasilla} data set, derived from \cite{pasilla}. The authors investigate conservation of RNA regulation between \Dmel{} and mammals. Part of their study used RNAi and RNA-seq to identify exons regulated by Pasilla (\emph{ps}), the \Dmel{} ortholog of mammalian NOVA1 and NOVA2. Briefly, their experiment compared gene expression as measured by RNAseq in S2-DRSC cells cultured with, or without, a 444 bp dsRNA fragment corresponding to the \emph{ps} mRNA sequence. Their assessment investigated differential exon use, but our worked example will focus on gene-level differences. For several examples we look at a subset of the \emph{ps} data, corresponding to reads obtained from lanes of their RNA-seq experiment, and to the same reads aligned to a \Dmel{} reference genome. Reads were obtained from GEO and the Short Read Archive (SRA), and were aligned to the \Dmel{} reference genome \emph{dm3} as described in the \Rpackage{pasilla} experiment data package. \paragraph{Work flow} At a very high level, one can envision a work flow that starts with a challenging biological question (how does \emph{ps} influence gene and transcript regulation?). The biological question is framed in terms of wet-lab protocols coupled with an appropriate and all-important experimental design. There are several well-known statistical challenges, common to any experimental data What treatments are going to be applied? How many replicates will there be of each? Is there likely to be sufficient power to answer the biologically relevant question? Reality is also important at this stage, as evidenced in the \emph{pasilla} data where, as we will see, samples were collected using different methods (single versus paired end reads) over a time when there were rapid technological changes. Such reality often introduces confounding factors that require appropriate statistical treatment in subsequent analysis. The work flow proceeds with data generation, involving both a wet-lab (sample preparation) component and actual sequencing. It is essential to acknowledge the biases and artifacts that are introduced at each of these stages. Sample preparation involves non-trivial amounts of time and effort. Large studies are likely to have batch effects (e.g., because work was done by different lab members, or different batches of reagent). Samples might have been prepared in ways that are likely to influence down-stream analysis, e.g., using a protocol involving PCR and hence introducing opportunities for sample-specific bias. DNA isolation protocols may introduce many artifacts, e.g., non-uniform representation of reads across the length of expressed genes in RNA-seq. The sequencing reaction itself is far from bias-free, with known artifacts of called base frequency, cycle-dependent accuracy and bias, non-uniform coverage, etc. At a minimum, the research needs to be aware of the opportunities for bias that can be introduced during sample preparation and sequencing. The informatics component of work flows becomes increasing important during and after sequence generation. The sequencer is often treated as a `black box', producing short reads consisting of 10's to 100's of nucleotides and with associated quality scores. Usually, the chemistry and informatics processing pipeline are sufficiently well documented that one can arrive at an understanding of biases and quality issues that might be involved; such an understanding is likely to be particularly important when embarking on questions or using protocols that are at the fringe of standard practice (where, after all, the excitement is). The first real data seen by users are \emph{fastq} files. These files are often simple text files consisting of many millions of records, and are described in greater detail earlier in the course. The center performing the sequencing typically vets results for quality, but these quality measures are really about the performance of their machines. It is very important to assess quality with respect to the experiment being undertaken -- Are the numbers of reads consistent across samples? Is the GC content and other observable aspects of the reads consistent with expectation? Are there anomalies in the sequence results that reflect primers or other reagents used during sample preparation? Are well-known artifacts of the protocol used evident in the reads in hand? The next step in many work flows involves alignment of reads to a reference genome. There are many aligners available, including \href{http://bio-bwa.sourceforge.net/}{BWA} \cite{pmid20080505}, \href{http://bowtie-bio.sourceforge.net/}{Bowtie} / \href{http://bowtie-bio.sourceforge.net/bowtie2/}{Bowtie2} \cite{pmid19261174}, and \href{http://research-pub.gene.com/gmap/}{GSNAP}; merits of these are discussed in the literature. \Bioconductor{} packages `wrapping' these tools are increasingly common (e.g., \Biocpkg{Rbowtie}, \Biocpkg{gmapR}; \Biocpkg{cummeRbund} for parsing output of the \software{cufflinks} transcript discovery pathway). There are also alignment algorithms implemented in \Bioconductor{} (e.g., \Rfunction{matchPDict} in the \Biocpkg{Biostrings} package, and the \Biocpkg{Rsubread} package); \Rfunction{matchPDict} is particularly useful for flexible alignment of moderately sized subsets of data. Most main-stream aligners produce output in `SAM' or `BAM' (binary alignment) format. BAM files are the primary starting point for many analyses, and their manipulation and use in \Bioconductor{} was introduced earlier in the course. \section{RNA-seq: case study} \subsection{Varieties of RNA-seq} RNA-seq experiments typically ask about differences in transcription of genes or other features across experimental groups. The analysis of designed experiments is statistical, and hence an ideal task for \R. The overall structure of the analysis, with tens of thousands of features and tens of samples, is reminiscent of microarray analysis; some insights from the microarray domain will apply, at least conceptually, to the analysis of RNA-seq experiments. The most straight-forward RNA-seq experiments quantify abundance for known gene models. The known models are derived from reference databases, reflecting the accumulated knowledge of the community responsible for the data. The `knownGenes' track of the UCSC genome browser represents one source of such data. A track like this describes, for each gene, the transcripts and exons that are expected based on current data. The \Biocpkg{GenomicFeatures} package allows ready access to this information by creating a local database out of the track information. This data base of known genes is coupled with high throughput sequence data by counting reads overlapping known genes and modeling the relationship between treatment groups and counts. A more ambitious approach to RNA-seq attempts to identify novel transcripts. This requires that sequenced reads be assembled into contigs that, presumably, correspond to expressed transcripts that are then located in the genome. Regions identified in this way may correspond to known transcripts, to novel arrangements of known exons (e.g., through alternative splicing), or to completely novel constructs. We will not address the identification of completely novel transcripts here, but will instead focus on the analysis of the designed experiments: do the transcript abundances, novel or otherwise, differ between experimental groups? \subsection{RNA-seq work flows} RNA-seq work flows aim at measuring gene expression through assessment of mRNA abundance. Work flows involve: \begin{enumerate} \item Experimental design. \item Wet-lab protocols for mRNA extraction and reverse transcription to cDNA. \item Sequencing; QA. \item Alignment of sequenced reads to a reference genome; QA. \item Summarizing of the number of reads aligning to a region; QA. \item Normalization of samples to accommodate purely technical differences in preparation. \item Statistical assessment of differential representation, including specification of an appropriate error model. \item Interpretation of results in the context of original biological questions; QA. \end{enumerate} The inference is that higher levels of gene expression translate to more abundant cDNA, and greater numbers of reads aligned to the reference genome. The enumeration above seems simplistic, but oddly enough one has concerns and commentary on each point. \subsection{Wet-lab protocols, sequencing, and alignment} The important point here is that wet-lab protocols, sequencing reactions, and alignment introduce artifacts that need to be acknowledged and, if possible, accommodated in down-stream analysis. These artifacts and approaches to their remediation are discussed in the following sections. \section{Statistical issues} Important statistical issues are summarized in Table~\ref{tab:rnaseqissues}. \begin{table} \centering \caption{Statistical issues in RNA-seq differential expression.} \label{tab:rnaseqissues} \begin{tabular}{lp{.7\textwidth}} Analysis stage & Issues \\ \hline Experimental design & Replication, complexity, feasibility\\ Batch effects & Known and unknown factors. \\ Summarize & Counts versus RPKM and other summaries. \\ Normalize & Robust estimates of library size. \\ Differential expression & Appropriate error model (Negative Binomial, Poisson, \ldots); dispersion (under negative binomial) as parameter requiring estimation; `shrinkage' to balance accuracy of per-gene estimates with precision of experiment-wide estimates.\\ Testing & Filtering to reduce multiple comparisons \& false discovery rate.\\ \hline \end{tabular} \end{table} \subsection{Experimental design} \paragraph{Technical versus biological replication} Obviously one should follow best practices for designing experiments appropriate for the data under analysis. A typical experiment will have one or several groups. Because there is uncertainty in each measurement, we require replication. Previous work shows that technical replication (repeating identical wet-lab and sequencing protocols on a single biological sample) introduces variation that is small~\cite{marioni2008rna} compared to biological replicates (using different samples). Most RNA-seq experiments require biological replication, and seldom include technical replicates. \paragraph{Sample size} How many biological replicates? It is helpful to think in terms of orders of magnitude -- biological treatments with strong and consistent consequences for gene expression will be detected with a handful -- 2 or 3 -- replicates per treatment. Conversely, statistically subtle effects will not be much revealed by samples of say 5 or 8, but will instead require 10's or 100's of samples. The \Biocpkg{RNASeqPower} package provides data-driven guidance on power calculations in RNA-seq experiments; \Biocpkg{CSSP} provides ChIP-seq power calculations based on Bayesian estimation for local counting processes. \paragraph{Complexity} How complicated an experimental design? The advice must be to `keep it simple'. There are many interesting biological questions that one could ask, but experimental designs with more than one or at most two factors, or with multiple levels per factor, will undermine statistical power and complicate analysis. There are exceptions of course, for instance a time course design or an experiment with two or more factors, but these require strong \emph{a priori} motivation and confidence that the design is amenable to analysis even in the face of wet-lab or sequencing catastrophe. \paragraph{Feasibility of intended statistical analysis} What kind of treatment? Two `lessons learned' from microarray analysis and applicable to RNA-seq inform this question. (a) It is necessary to normalize observations between samples to accommodate purely technical variation in overall patterns of expression. For example, samples provided to the sequencer have different amounts of DNA, resulting in variation in total numbers of sequenced and aligned reads independent of any difference in gene-level differential representation. This implies that the treatment should \emph{affect only a fraction of the genes assayed}, otherwise treatment effects and protocol artifacts are confounded. (b) Between-gene measures of expression differ for reasons unrelated to levels of expression. For instance, standard protocols mean that a long gene is sequenced more often than a short gene, even when the number of mRNA molecules of the two genes are identical. This means that the most productive approach to differential representation will \emph{compare genes across samples}, rather than compare levels of representation of different genes (gene set enrichment analysis and other approaches to between-gene comparison are statistically interesting in part because of the need to overcome between-gene differences arising for purely technical reasons). The combination of lessons (a) and (b) dictate that the treatment should affect only a subset of the genes under study, and that `interesting' results correspond to treatment groups with differences at the gene level. \emph{A priori} motivation, e.g., about well-defined pathways as targets of differential representation, may trump part (b) of this guideline. \subsection{Batch effects} The reality of executing designed experiments may mean that there are known but unavoidable factors that confound the analysis, but that are not of fundamental biological interest. Perhaps samples are being processed by different groups, or processing is spread over several months to accommodate personnel or sequencer availability. It is essential to avoid confounding such factors with biologically relevant parts of the experiment. Such batch effects are pervasive in high-throughput analysis of diverse data types \cite{leek:2010aa}; addressing batch effects helps to reduce dependence, stabilize error rate estimates, and improve reproducibility. Having acknowledged a potentially confounding factor, what is to be done? A first reaction might be randomization -- arrange for samples to be processed in a random order, for instance, rather than by treatment group -- but a better strategy is usually to include a blocking factor, e.g., processed by lab `A' versus lab `B' and to ensure that treatments are represented by replicates in each blocking factor. The down-stream analysis can then use replication to statistically accommodate such effects. An alternative to explicitly modeling batch effects is to identify `surrogate variables'. Surrogate variables are covariates constructed directly from the data, and can be used in subsequent analysis to adjust for unknown, un-modeled, or latent sources of noise \cite{leek:storey:2007, leek:storey:2008, leek:2011aa}. The \Biocpkg{sva} package implements surrogate variable analysis, and can be used with RNA-seq and many other high dimensional data types. \Biocpkg{sva} estimates surrogate variables for inclusion in subsequent analysis, or removes known batch effects using ComBat \cite{johnson:2007aa}. An interesting approach to addressing batch effects in studies where new samples are accumulated incrementally (e.g., patient assays from physician offices) is to create a `frozen' correction on a training data set, and perform per-sample correction on new samples as they become available. This is similar to the `frozen' RMA approach to normalization developed by McCall et al., \cite{mccall2010}, and is implemented by the \Rfunction{fsva} function in the \Biocpkg{sva} package. \subsection{Summarizing} The summary process tallies the number of reads aligning in each region (e.g., gene) of interest. The simplest method is to simply count reads overlapping each region, dividing by the length of the region of interest to accommodate differences in gene length. This is the `RPKM' (reads per kilobase per million reads) of Mortazavi et al.~\cite{mortazavi2008mapping}. One problem with this approach is that reads are not sampled uniformly across genes (Figure~\ref{fig:HansenEtAl2010}; \cite{hansen2010biases}), so gene length (the `PK' part of RPKM) is not a good proxy for expression level. \begin{figure} \centering \includegraphics[width=.5\textwidth,height=!]{figures/HansenEtAl2010} \caption{Nucleotide frequency versus position relative to start of alignment, various experiments and protocols; see~\cite{hansen2010biases}.} \label{fig:HansenEtAl2010} \end{figure} More fundamentally, each read represents an observation, and contributes to the certainty with which a gene is measured as `expressed'. A summary measure like RPKM fails to incorporate uncertainty -- a particular value of RPKM might result from alignment of one or 100 reads. This contrasts with a simple count of the number of reads in the region of interest. Furthermore, count data has known statistical properties that can be exploited in down-stream statistical analysis. Thus the result of summarization most useful for assessing differential expression is read count. How to count? For instance, should a read that partly overlaps a 5' UTR or an intron be included in a tally? What about reads that overlap multiple genes? This is a non-trivial question because alignment is only approximate (reflecting sequencing and other biases) and because sample preparation protocols and organism biology (e.g., whether the UTR or fully mature mRNA is sequenced) may dictate particular counting strategies; more elaborate counting strategies might be entertained for paired end reads. Anders enumerates some counting strategies\footnote{\url{http://www-huber.embl.de/users/anders/HTSeq/doc/count.html}}; these are implemented in his \software{HTSeq} python scripts, in \Rfunction{summarizeOverlaps} in the \Biocpkg{GenomicRanges} package, or in functions in the (linux-only) \Biocpkg{Rsubread} and \Biocpkg{gmapR} packages. \subsection{Normalization} Normalization arises from the need to correct for purely technical differences between samples. The most common symptom of the need for normalization is differences in the total number of aligned reads. The `M' part of RPKM measure mentioned in the context of summarization is one way of normalizing for total count. This normalization is not appropriate, because the distribution of aligned reads across genes within a sample is not uniform -- some regions receive many more alignments than do others -- and this distribution may differ between samples. The overall strategy with normalization is to choose an appropriate baseline, and express sample counts relative to that baseline. There are several approaches to choice of appropriate baseline. One might choose total count for normalization, but this is a poor choice when one or a few regions of interest are very well represented -- we are normalizing to the well-represented genes rather than to sequencing depth in each sample. Other straight-forward approaches include use of house-keeping genes, or the expression level from a particular quantile of the distribution of gene expression values of each sample~\cite{bullard2010evaluation}. One might attempt a robust estimate of sample abundance that is less sensitive to extreme outliers, e.g., the trimmed geometric mean of counts~\cite{anders2010differential}. Another approach is TMM~\cite{pmid19910308}, which measures the trimmed mean of M and A values (M values are the log fold change in the number of reads aligning to a region of interest measured relative to an average or arbitrary sample, A is the average count of a gene; the trimmed mean discards regions of interest that have extreme M or A values and calculates the mean M value of the remainder); the inverse of this mean is used to weight samples. More data-driven approaches exploiting the gene-specific properties include conditional quantile normalization (implemented in the \Biocpkg{cqn} package;~\cite{cqn2012}). Another approach to normalization, increasingly popular as experiment size and data consistency increases, is to perform a data transformation and apply normalization methods developed for analysis of microarrays. Examples of this approach include \Rcode{varianceStabilizingTransformation} from the \Biocpkg{DESeq2} package, and \Rfunction{voom} from the \Biocpkg{limma} package; see the corresponding help pages of these functions for details). \subsection{Error model} A Negative Binomial error model is often appropriate for `smaller' experiments. These models combine Poisson (`shot' noise, i.e., within-sample techincal and sampling variation in read counts) with variation between biological samples. The \Biocpkg{edgeR}~\cite{McCarthy2012differential} and \Biocpkg{DESeq}~\cite{anders2010differential} (now \Biocpkg{DESeq2}) packages implement these models. Negative bionomial error models involve estimation of dispersion parameters, which are estimated poorly in small samples. \Biocpkg{edgeR} and \Biocpkg{DESeq2} adopt different data-driven approaches to arrive at more robust dispersion estimates; the packages, relying on different strategies to moderate per-gene estimates with more robust local estimates derived from genes with similar expression values. Other approaches are possible; \Biocpkg{DSS}~\cite{DSS2012} estimates are based on $\gamma$-Poisson or $\beta$-Binomial distributions. As number of replicates become large, the importance of explicitly modeling biological sampling variance decrease. This encourages use of the Poisson-Tweedie family of distributions to model count data~\cite{tweeDEseq}. \subsection{Multiple comparison} \begin{enumerate} \item Increase statistical power and reduce false discovery rate by filtering regions of interest prior to analysis. \item Motivation (a): just because genes are assayed does not mean, \emph{a priori}, that they represent something requiring a statistical test. (b) Some observations, e.g., zero counts across all samples, cannot possibly be statistically significant, independent of hypothesis under investigation. \item Approach -- detection or `$K$ over $A$'-style filter; representation of a minimum of $A$ (normalized) read counts in at least $K$ samples. $A$ usually measured as counts per million. Guidelines for choice of values a little \emph{ad hoc}; see, e.g., the \Biocpkg{edgeR} user manual. Variance filter, e.g., IQR (inter-quartile range) provides a robust estimate of variability; can be used to rank and discard least-varying regions. \end{enumerate} \section{Selected \Bioconductor{} software for RNA-seq Analysis} \Bioconductor{} packages play a role in several stages of an RNA-seq analysis (Table~\ref{tab:rnaseq-pkgs}; a more comprehensive list is under the \href{http://bioconductor.org/packages/2.10/BiocViews.html#___RNAseq}{RNAseq} and \href{http://bioconductor.org/packages/2.10/BiocViews.html#___HighThroughputSequencing}{HighThroughputSequencing} BiocViews terms). The \Biocpkg{GenomicRanges} infrastructure can be effectively employed to quantify known exon or transcript abundances. Quantified abundances are in essence a matrix of counts, with rows representing features and columns samples. The \Biocpkg{edgeR}~\cite{pmid19910308} and \Biocpkg{DESeq2}~\cite{anders2010differential} packages facilitate analysis of this data in the context of designed experiments, and are appropriate when the questions of interest involve between-sample comparisons of relative abundance. The \Biocpkg{DEXSeq} package extends the approach in \Biocpkg{edgeR} and \Biocpkg{DESeq2} to ask about within-gene, between group differences in exon use, i.e., for a given gene, do groups differ in their exon use? \begin{table} \centering \caption{Selected \Bioconductor{} packages for RNA-seq analysis.} \label{tab:rnaseq-pkgs} \begin{tabular}{lp{.7\textwidth}} Package & Description\\ \hline \Biocpkg{EDASeq} & Exploratory analysis and QA; also \Biocpkg{qrqc}, \Biocpkg{ShortRead}, \Biocpkg{DESeq2}.\\ \Biocpkg{edgeR}, \Biocpkg{DESeq2} & Generalized Linear Models using negative binomial error. \\ \Biocpkg{BitSeq} & Bayesian inference of individual transcript abundances followed by differential expression. \\ \Biocpkg{DEXSeq} & Exon-level differential representation. \\ \Biocpkg{DSS}, \Biocpkg{vsn}, \Biocpkg{cqn} & RNA-seq normalization methodologies. Also \Rfunction{voom} in \Biocpkg{limma}.\\ \Biocpkg{goseq} & Gene set enrichment tailored to RNAseq count data; also \Biocpkg{limma}'s \Rfunction{roast} or \Rfunction{camera} after transformation with \Rfunction{voom}. \\ \Biocpkg{QuasR} & Workflow. \\ \Biocpkg{Rsubread} & Alignment (Linux only); also \Biocpkg{gmapR}; \Biocpkg{Biostrings} \Rfunction{matchPDict} for special-purpose alignments. \\ \Biocpkg{cummeRbund} & Exploration and analysis of Cufflinks results.\\ \hline \end{tabular} \end{table} \section{DESeq2 Work Flow Exercises} For this chapter, follow in-course instructions to work through the Parathryoid \Biocpkg{DESeq2} analysis. \bibliography{RNASeq} \end{document}