%\VignetteIndexEntry{Minfi Guide} %\VignetteDepends{minfi} %\VignetteDepends{minfiData} %\VignettePackage{minfi} \documentclass[12pt]{article} <>= options(width=70) @ \SweaveOpts{eps=FALSE,echo=TRUE} \usepackage{times} \usepackage{color,hyperref} \usepackage{fullpage} \usepackage[numbers]{natbib} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \hypersetup{colorlinks,breaklinks, linkcolor=darkblue,urlcolor=darkblue, anchorcolor=darkblue,citecolor=darkblue} \usepackage{parskip} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \title{The minfi User's Guide\\ Analyzing Illumina 450k Methylation Arrays} \author{Kasper D.\ Hansen \and Martin J. Aryee} \date{Modified: October 9, 2011. Compiled: \today} \begin{document} \maketitle \setcounter{secnumdepth}{1} \section{Introduction} The \Rpackage{minfi} package provides tools for analyzing Illumina's Methylation arrays, with a special focus on the new 450k array for humans. At the moment Illumina's 27k methylation arrays are not supported. The tasks addressed in this package include preprocessing, QC assessments, identification of interesting methylation loci and plotting functionality. Analyzing these types of arrays is ongoing research in ours and others groups. In general, the analysis of 450k data is not straightforward and we anticipate many advances in this area in the near future. The input data to this package are IDAT files, representing two different color channels prior to normalization. It is possible to use Genome Studio files together with the data structures contained in this package, but in general Genome Studio files are already normalized and we do not recommend this. \subsubsection{Chip design and terminology} The 450k array has a complicated design. What follows is a quick overview. Each sample is measured on a single array, in two different color channels (red and green). Each array measures roughly 450,000 CpG positions. Each CpG is associated with two measurements: a methylated measurement and an ``un''-methylated measurement. These two values can be measured in one of two ways: using a ``Type I'' design or a ``Type II design''. CpGs measured using a Type I design are measured using a single color, with two different probes in the same color channel providing the methylated and the unmethylated measurements. CpGs measured using a Type II design are measured using a single probe, and two different colors provide the methylated and the unmethylated measurements. Practically, this implies that on this array there is \emph{not} a one-to-one correspondence between probes and CpG positions. We have therefore tried to be precise about this and we refer to a ``methylation position'' (or ``CpG'') when we refer to a single-base genomic locus. The previous generation 27k methlation array uses only the Type I design. In this package we refer to differentially methylated positions (DMPs) by which we mean a single genomic position that has a different methylation level in two different groups of samples (or conditions). This is different from differentially methylated regions (DMRs) which imply more that more than one methylation positions are different between conditions. Physically, each sample is measured on a single ``array''. There are 12 arrays on a single physical ``slide'' (organized in a 6 by 2 grid). Slides are organized into ``plates'' containing at most 8 slides (96 arrays). \subsubsection{Workflow and R data classes} A set of 450k data files will initially be read into an \Rcode{RGChannelSet}, representing the raw intensities as two matrices: one being the green channel and one being the red channel. This is a class which is very similar to an \Rcode{ExpressionSet} or an \Rcode{NChannelSet}. The \Rcode{RGChannelSet} is, together with a \Rcode{IlluminaMethylationManifest} object, preprocessed into a \Rcode{MethylSet}. The \Rcode{IlluminaMethylationManifest} object contains the array design, and describes how probes and color channels are paired together to measure the methylation level at a specific CpG. The object also contains information about control probes (also known as QC probes). The \Rcode{MethylSet} contains normalized data and essentially consists of two matrices containing the methylated and the unmethylated evidence for each CpG. Only the \Rcode{RGChannelSet} contains information about the control probes. The process described in the previous paragraph is very similar to the paradigm for analyzing Affymetrix expression arrays using the \Rpackage{affy} package (an \Rcode{AffyBatch} is preprocessed into an \Rcode{ExpressionSet} using array design information stored in a CDF environment (package)). A \Rcode{MethylSet} is the starting point for any post-normalization analysis, such as searching for DMPs or DMRs. \subsubsection{Getting Started} <>= require(minfi) require(minfiData) @ \section{Reading Data} This package supports analysis of IDAT files, containing the summarized bead information. In our experience, most labs use a ``Sample Sheet'' CSV file to describe the layout of the experiment. This is based on a sample sheet file provided by Illumina. Our pipeline assumes the existence of such a file(s), but it is relatively easy to create such a file using for example Excel, if it is not available. We use an example dataset with 6 samples, spread across two slides. First we obtain the system path to the IDAT files; this requires a bit since the data comes from an installed package <>= baseDir <- system.file("extdata", package = "minfiData") list.files(baseDir) @ This shows the typical layout of 450k data: each ``slide'' (containing 12 arrays) is stored in a separate directory, with a numeric name. The top level directory contains the sample sheet file. Inside the slide directories we find the IDAT files (and possible a number of JPG images or other files): <>= list.files(file.path(baseDir, "5723646052")) @ The files for each array has another numeric number and consists of a Red and a Grn (Green) IDAT file. Note that for this example data, each slide contains only 3 arrays and not 12. This was done because of file size limitations and because we only need 6 arrays to illustrate the package's functionality. First we read the sample sheet. We provide a convenience function for reading in this file \Rcode{read.450k.sheet}. This function has a couple of attractive bells and whistles. Let us look at the output <>= targets <- read.450k.sheet(baseDir) targets @ First the output: this is just a \Rcode{data.frame}. It contains a column \Rcode{Basename} that describes the location of the IDAT file corresponding to the sample, as well as two columns \Rcode{Array} and \Rcode{Slide}. In the sample sheet provided by Illumina, these two columns are named \Rcode{Sentrix\_Position} and \Rcode{Sentrix\_ID}, but we rename them. We provide more detail on the use of this function below. The \Rcode{Basename} column tend to be too large for display, here it is simplified relative to \Rcode{baseDir}: <>= sub(baseDir, "", targets$Basename) @ (This is just for display purposes). With this \Rcode{data.frame}, it is easy to read in the data <>= RGset <- read.450k.exp(base = baseDir, targets = targets) @ Let us look at the associated pheno data, which is really just the information contained in the targets object above. <>= RGset pd <- pData(RGset) pd[,1:4] @ The \Rcode{read.450k.exp} also makes it possible to read in an entire directory or directory tree (with \Rcode{recursive} set to \Rcode{TRUE}) by using the function just with the argument \Rcode{base} and \Rcode{targets=NULL}, like <>= RGset2 = read.450k.exp(file.path(baseDir, "5723646052")) RGset3 = read.450k.exp(baseDir, recursive = TRUE) @ \subsubsection{Advanced notes on Reading Data} The only important column in sheet \Rcode{data.frame} used in the \Rcode{targets} argument for the \Rcode{read.450k.exp} function is a column names \Rcode{Basename}. Typically, such an object would also have columns named \Rcode{Array}, \Rcode{Slide}, and (optionally) \Rcode{Plate}. We used sheet data files build on top of the Sample Sheet data file provided by Illumina. This is a CSV file, with a header. In this case we assume that the phenotype data starts after a line beginning with \Rcode{[Data]} (or that there is no header present). It is also easy to read a sample sheet ``manually'', using the function \Rcode{read.csv}. Here, we know that we want to skip the first 7 lines of the file. <>= targets2 <- read.csv(file.path(baseDir, "SampleSheet.csv"), stringsAsFactors = FALSE, skip = 7) targets2 @ We now need to populate a \Rcode{Basename} column. On possible approach is the following <>= targets2$Basename <- file.path(baseDir, targets2$Sentrix_ID, paste0(targets2$Sentrix_ID, targets2$Sentrix_Position)) @ Finally, \Rpackage{minfi} contains a file-based parser: \Rcode{read.450k}. The return object represents the red and the green channel measurements of the samples. A useful function that we get from the package \Rpackage{Biobase} is \Rcode{combine} that combines (``adds'') two sets of samples. This allows the user to manually build up an \Rcode{RGChannelSet}. \section{Quality Control} \Rcode{minfi} provides several plots that can be useful for identifying samples with data quality problems. These functions can display summaries of signal from the array (e.g. density plots) as well as the values of several types of control probes included on the array. Our understanding of the expected sample behavior in the QC plots is still evolving and will improve as the number of available samples from the array increases. A good rule of thumb is to be wary of samples whose behavior deviates from that of others in the same or similar experiments. The wrapper function \Rcode{qcReport} function can be used to produce a PDF QC report of the most common plots. If provided, the optional sample name and group options will be used to label and color plots. Samples within a group are assigned the same color. The sample group option can also be used as a very cursory way to check for batch effects (e.g. by setting it to a processing day variable.) <>= qcReport(RGset, sampNames = pd$Sample_Name, sampGroups = pd$Sample_Group, pdf = "qcReport.pdf") @ The components of the QC report can also be customized and produced individually as detailed below. \subsubsection{Density plots} The \Rcode{densityPlot} function produces density plots of the methylation Beta values for all samples, typically colored by sample group. While the density plots in Figure \ref{fig:densityPlot} are useful for identifying deviant samples, it is not easy to identify the specific problem sample. If there is a concern about outlier samples, a useful follow-up is the ``bean'' plot (Figure \ref{fig:densityBeanPlot}) that shows each sample in its own section. While the shape of the distribution for ``good'' samples will differ from experiment to experiment, many conditions have methylation profiles characterized by two modes - one with close to 0\% methylation, and a second at close to 100\% methylation. \begin{figure} \begin{center} <>= densityPlot(RGset, sampGroups = pd$Sample_Group, main = "Beta", xlab = "Beta") @ \end{center} \caption{Beta density plots} \label{fig:densityPlot} \end{figure} \begin{figure} \begin{center} <>= par(oma=c(2,10,1,1)) densityBeanPlot(RGset, sampGroups = pd$Sample_Group, sampNames = pd$Sample_Name) @ \end{center} \caption{Beta beanplots} \label{fig:densityBeanPlot} \end{figure} \subsubsection{Control probe plots} The \Rcode{controlStripPlot} function allows plotting of individual control probe types (Figure \ref{fig:controlStripPlot}). The following control probes are available on the array: \begin{verbatim} BISULFITE CONVERSION I 12 BISULFITE CONVERSION II 4 EXTENSION 4 HYBRIDIZATION 3 NEGATIVE 614 NON-POLYMORPHIC 4 NORM_A 32 NORM_C 61 NORM_G 32 NORM_T 61 SPECIFICITY I 12 SPECIFICITY II 3 STAINING 6 TARGET REMOVAL 2 \end{verbatim} \begin{figure} \begin{center} <>= controlStripPlot(RGset, controls="BISULFITE CONVERSION II", sampNames = pd$Sample_Name) @ \end{center} \caption{Beta stripplot} \label{fig:controlStripPlot} \end{figure} \section{Preprocessing (normalization)} Preprocessing (normalization) takes as input a \Rcode{RGChannelSet} and returns a \Rcode{MethylSet}. A number of preprocessing options are available (and we are working on more methods). Each set of methods are implemented as a function \Rcode{preprocessXXX} with \Rcode{XXX} being the name of the method. Each method may have a number of tuning parameters. ``Raw'' preprocessing means simply converting the Red and the Green channel into a Methylated and Unmethylated signal <>= MSet.raw <- preprocessRaw(RGset) @ We have also implemented preprocessing choices as available in Genome Studio. These choices follow the description provided in the Illumina documentation and has been validated by comparing the output of Genome Studio to the output of these algorithms, and this shows the two approaches to be roughly equivalent (for a precise statement, see the manual pages). Genome studio allows for background subtraction (also called background normalization) as well as something they term control normalization. Both of these are optional and turning both of them off is equivalent to raw preprocessing (\Rcode{preprocessRaw}). <>= MSet.norm <- preprocessIllumina(RGset, bg.correct = TRUE, normalize = "controls", reference = 2) @ The \Rcode{reference = 2} selects which array to use as ``reference'' which is an arbitrary array (we are not sure how Genome Studio makes its choice of reference). \subsubsection{Operating on a MethylSet} Once a \Rcode{MethylSet} has been generated, we have a various ways of getting access to the methylation data. The most basic functions are \Rcode{getMeth} and \Rcode{getUnmeth}, which returns unlogged methylation channels. The function \Rcode{getBeta} gets ``beta''-values which are values between 0 and 1 with 1 interpreted as very high methylation. If \Rcode{type = "Illumina"} (not the default) these are computed using Illumina's formula \begin{displaymath} \beta = \frac{M}{M + U + 100} \end{displaymath} Finally, we have the ``M-values'' (not to be confused with the methylation channel obtained by \Rcode{getMeth}). M-values are perhaps an unfortunate terminology, but it seems to be standard in the methylation array world. These are computed as $\textrm{logit}(\beta)$ and are obtained by \Rcode{getM}. <>= getMeth(MSet.raw)[1:4,1:3] getUnmeth(MSet.raw)[1:4,1:3] getBeta(MSet.raw, type = "Illumina")[1:4,1:3] getM(MSet.raw)[1:4,1:3] @ \subsubsection{MDS plots} After preprocessing the raw data to obtain methylation estimates, Multi-dimensional scaling (MDS) plots provide a quick way to get a first sense of the relationship between samples. They are similar to the more familiar PCA plots and display a two-dimensional approximation of sample-to-sample Euclidean distance. Note that while the plot visualizes the distance in epigenomic profiles between samples, the absolute positions of the points is not meaningful. One often expects to see greater between-group than within-group distances (although this clearly depends on the particular experiment). The most variable locations are used when calculating sample distances, with the number specified by the \Rcode{numPositions} option. Adding sample labels to the MDS plot is a useful way of identifying outliers (figure \ref{fig:MDS}) that behave differently from their peers. \begin{figure} \begin{center} <>= mdsPlot(MSet.norm, numPositions = 1000, sampGroups = pd$Sample_Group, sampNames = pd$Sample_Name) @ \end{center} \caption{Multi-dimensional scaling plot} \label{fig:MDS} \end{figure} \subsubsection{The validation of \Rcode{preprocessIllumina}} By validation we mean ``yielding output that is equivalent to Genome Studio''. Illumina offers two steps: control normalization and background subtraction (normalization). Using output from Genome Studio we are certain that the control normalization step is validated, with the following caveat: control normalization requires the selection of one array among the 12 arrays on a chip as a reference array. It is currently unclear how Genome Studio selects the reference; if you know the reference array we can recreate Genome Studio exactly. Background subtraction (normalization) is almost correct: for 18 out of 24 arrays we see exact equivalence and for the remaining 6 out of 24 arrays we only see small discrepancies (a per-array max difference of 1-4 for unlogged intensities). A script for doing this is in \texttt{scripts/GenomeStudio.R}. \subsubsection{Subset-quantile within array normalisation (SWAN)} SWAN (subset-quantile within array normalisation) is a new normalization method for Illumina 450k arrays. What follows is a brief description of the methodology (written by the authors of SWAN): Technical differences have been demonstrated to exist between the Type I and Type II assay designs within a single 450K array\cite{Bibikova:2011,Dedeurwaerder:2011}. Using the SWAN method substantially reduces the technical variability between the assay designs whilst maintaining the important biological differences. The SWAN method makes the assumption that the number of CpGs within the 50bp probe sequence reflects the underlying biology of the region being interrogated. Hence, the overall distribution of intensities of probes with the same number of CpGs in the probe body should be the same regardless of design type. The method then uses a subset quantile normalization approach to adjust the intensities of each array \cite{Maksimovic:2012}. SWAN takes a \Rcode{MethylSet} as input. This can be generated by either \Rcode{preprocessRaw} or \Rcode{preprocessIllumina}. Calling the function without specifying a \Rcode{MethylSet} uses \Rcode{preprocessRaw}. It should be noted that, in order to create the normalization subset, SWAN randomly selects Infinium I and II probes that have one, two and three underlying CpGs; as such, we recommend setting a seed (using \Rcode{set.seed})before using \Rcode{preprocessSWAN} to ensure that the normalized intensities will be identical, if the normalization is repeated. <>= Mset.swan <- preprocessSWAN(RGsetEx, MsetEx) @ The technical differences between Infinium I and II assay designs can result in aberrant beta value distributions (Figure~\ref{fig:plotBetaTypes}, panel ``Raw''). Using SWAN corrects for the technical differences between the Infinium I and II assay designs and produces a smoother overall beta value distribution (Figure~\ref{fig:plotBetaTypes}, panel ``SWAN''). \begin{figure} \centering <>= par(mfrow=c(1,2)) plotBetasByType(MsetEx[,1], main = "Raw") plotBetasByType(Mset.swan[,1], main = "SWAN") @ \caption{The effect of normalizing using SWAN.}\label{fig:plotBetaTypes} \end{figure} \section{Finding differentially methylated positions (DMPs)} We are now ready to use the normalized data to identify DMPs, defined as CpG positions where the methylation level correlates with a phenotype of interest. The phenotype may be categorical (e.g. cancer vs. normal) or continuous (e.g. blood pressure). We will create a 20,000 CpG subset of our dataset to speed up the demo: <>= mset <- MSet.norm[1:20000,] @ \subsubsection{Categorical phenotypes} The \Rcode{dmpFinder} function uses an F-test to identify positions that are differentially methylated between (two or more) groups. Tests are performed on logit transformed Beta values as recommended in Pan et al. Care should be taken if you have zeroes in either the Meth or the Unmeth matrix. One possibility is to threshold the beta values, so they are always in the interval $[\epsilon, 1-\epsilon]$. We call $\epsilon$ the betaThreshold Here we find the differences between GroupA and GroupB. <>= table(pd$Sample_Group) M <- getM(mset, type = "beta", betaThreshold = 0.001) dmp <- dmpFinder(M, pheno=pd$Sample_Group, type="categorical") head(dmp) @ \Rcode{dmpFinder} returns a table of CpG positions sorted by differential methylation p-value. We can use the \Rcode{plotCpG} function to plot methylation levels at individual positions: <>= cpgs <- rownames(dmp)[1:4] par(mfrow=c(2,2)) plotCpg(mset, cpg=cpgs, pheno=pd$Sample_Group) @ \subsubsection{Continuous phenotypes} We can also identify DMPs where the mean methylation level varies with a continuous covariate using linear regression. Since the sample dataset does not contain any continuous phenotypes we will simulate one for demonstration purposes: <>= set.seed(123) @ <>= continuousPheno <- rnorm(nrow(pd)) @ We now search for DMPs associated with this phenotype. <>= dmp <- dmpFinder(mset, pheno=continuousPheno, type="continuous") dmp[1:3,] @ The \Rcode{beta} column gives the change in mean phenotype for each unit increase of methylation. We can filter the DMP list to exclude positions with a small effect size: <>= dmp <- subset(dmp, abs(beta)>1) @ The \Rcode{plotCpg} function can be used to visualise these continuous DMPs: <>= cpgs <- rownames(dmp)[1:4] par(mfrow=c(2,2)) plotCpg(mset, cpg=cpgs, type="continuous", pheno=continuousPheno, xlab="Phenotype 1") @ \section{Advanced: The manifest object} In order to preprocess the data we need a ``manifest'' object. This object is similar to the union of a CDF environment and a probe package (and may be restructured). Essentially it describes what probes are on the array and how they are matched together. \emph{The manifest object only depends on the array design. It is not related to annotating the CpGs measured by the array.} The internal structure of the manifest object should not be of concern to users. However, it may be useful to know something about the array design. First we have a look at the object: <>= IlluminaHumanMethylation450kmanifest head(getProbeInfo(IlluminaHumanMethylation450kmanifest, type = "I"), n = 3) head(getProbeInfo(IlluminaHumanMethylation450kmanifest, type = "II"), n = 3) head(getProbeInfo(IlluminaHumanMethylation450kmanifest, type = "Control"), n = 3) @ The 450k array has a rather special design. It is a two-color array, so each array will have an associated Green signal and a Red signal. On the 450k array, a CpG may be measured by a ``type I'' or ``type II'' design. The literature often uses the term``type I/II probes'' which we believe is unfortunate (see next paragraph). Each CpG has an associated methylated and un-methylated signal. If the CpG is of ``type I'', the methylation and un-methylation signal are originating from two different probes (physical location on the array). There is one set of ``type I'' CpGs where the signal comes from the Green channel for both probes (and the Red channel measures nothing) and another set where the signal comes from the Red channel. If the CpG is of ``type II'', a single probe (physical location) is being used to measure the methylated/un-methylated signal and the methylated signal is always measured in the Green channel. This is reflected in the manifest object seen above: ``type I'' CpGs have ``AddressA'', ``AddressB'' (this is a link to the physical location on the array) as well as ``ProbeSeqA'' and ``ProbeSeqB''. They also have a ``Col'' indicator (which channel is the methylated signal coming from). In contrast ``type II'' CpGs have a single ``Address'', one ``ProbeSeq'' and no color information. Because CpGs of ``type I'' are measured using two different physical probes, we dislike calling the probes ``type I/II'' and instead attaches the type to the CpG itself. Note that Illumina uses a special ``cgXXX'' name for the CpGs. There is actually a meaning to this, not unlike the meaning associated with ``rsXXX'' numbers for SNPs. Essentially the XXX is a hash of the bases surrounding the CpG, making the cgXXX numbers independent of genome version. Illumina has a technical note describing this. \section{SessionInfo} <>= toLatex(sessionInfo()) @ \nocite{*} \bibliographystyle{unsrturl} \bibliography{minfi} \end{document} % Local Variables: % eval: (add-hook 'LaTeX-mode-hook '(lambda () (if (string= (buffer-name) "minfi.Rnw") (setq fill-column 100)))) % LocalWords: LocalWords methylation CpG CDF ExpressionSet Affymetrix preprocessed preprocessing unlogged % LocalWords: methylated Methylated unmethylated Unmethylated Grn IDAT dataset Illumina CSV % End: