%\VignetteIndexEntry{SomatiCA Vignette}
%\VignetteKeyword{sequencing}
%\VignetteKeyword{copy number variation}
%\VignettePackage{SomatiCA}

\documentclass[12pt]{article}

\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{times}
\usepackage{Sweave}              % Comment out if not using Sweave

\def\E{\mathord{I\kern-.35em E}}
\def\R{\mathord{I\kern-.35em R}}
\def\P{\mathord{I\kern-.35em P}}

\topmargin=-0.5in
\textheight=9in
\textwidth=6.5in
\oddsidemargin=0in

\newcommand{\proglang}[1]{\textbf{#1}}
\newcommand{\pkg}[1]{\texttt{\textsl{#1}}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\mg}[1]{{\textcolor {magenta} {#1}}}
\newcommand{\gr}[1]{{\textcolor {green} {#1}}}
\newcommand{\bl}[1]{{\textcolor {blue} {#1}}}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=0.2in
\evensidemargin=0.2in
\headheight=0in
\headsep=0in

\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em,
                                              frame=single}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em,
                                             frame=single}

\begin{document}
\SweaveOpts{concordance=TRUE}
\SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE}
\SweaveOpts{prefix.string=Fig}

\title{SomatiCA: identifying, characterizing and quantifying \\ somatic copy number aberrations from cancer genome sequencing}
\author{Mengjie Chen, Hongyu Zhao}
\date{17 Jan 2013}
\maketitle


\section{Introduction}
This guide provides a tour of the Bioconductor package \code{SomatiCA}, a R package that is capable of identifying, characterizing and quantifying somatic CNAs from cancer whole genome sequencing. It is especially designed for somatic copy number analyses taking into account: (i) an unknown fraction of normal cells (admixture rate) that are nearly always intermixed with cancer cells; and (ii) the heterogeneity of cancer cell population owing to ongoing subclonal evolution. 
The package implements a pipeline for characterizing somatic copy number aberrations based on different statistical methods: segmentation on Lesser Allele Frequency (LAF), Maximum Likelihood Estimation of somatic ratio (read depth ratio of tumor/ normal), admixture rate estimation by a Bayesian finite mixture model and subclonality characterization based on hypothesis testing.  It is especially suitable for studies designed to understand tumor evolution.  It currently works for cancer sample with control.

This manual is composed of basic test run on each module with a  simulated dataset. The main purpose is to help users familiar with basic functions and the data format. 

\subsection{User Guide}
A user guide with detailed analysis on real data, is available as part of the online documentation. To reach the User's Guide, install the SomaticCNA package and load it into an R session by \texttt{library(SomatiCA)}. In R for Windows, the User's Guide will then be available from the drop-down menu called ``Vignettes''.
In other operating systems, type
<<>>=
library(SomatiCA)
SomatiCAUsersGuide()
@
at the R prompt to open the User's Guide in a pdf viewer.

\section{Overview of capabilities}
\subsection{Input}
<<echo=FALSE>>=
options(width=60)
@
Let us simulate a simple dataset with 1450 snps on 2 chromosome.
<<>>=
set.seed(1)
rawLAF <- c(rnorm(300, 0.2, 0.05), rnorm(300, 0.4, 0.05), 
            rnorm(200, 0.3, 0.05), rnorm(200, 0.2, 0.05), 
            rnorm(200, 0.3, 0.05), rnorm(250, 0.4, 0.05)) 
rawLAF <- ifelse(rawLAF>0.5, 1-rawLAF, rawLAF) 
germLAF <- c(rnorm(800+650, 0.4, 0.05)) 
germLAF <- ifelse(germLAF>0.5, 1-germLAF, germLAF) 
reads1 <- c(rpois(300, 25), rpois(300, 50), rpois(200, 60),  
            rpois(200, 25), rpois(200, 40), rpois(250, 50))
reads2 <- rpois(800+650,50)
chr <- c(rep("chr1", 800), rep("chr2", 650))
position <- c(seq(1, 16000000, by=20000), seq(1, 13000000, by=20000))
zygo <- rep("het", 800+650)
data <- data.frame(chr, as.integer(position), as.character(zygo), as.integer(reads1), rawLAF, as.integer(reads2), germLAF) 

@
The input of \pkg{SomatiCA} is formatted as a GRanges instance as defined in \pkg{GenomicRanges}. 
<<>>=
colnames(data) <- c("seqnames", "start", "zygosity", "tCount",
                 "LAF", "tCountN", "germLAF")            
input <- SomatiCAFormat(data, missing = F, verbose = T)
input
@

 \subsection{Segmentation}
Given input in the format of a GRanges object, \code{larsCBSsegment()} segments each chromosome with LAF of heterozygous sites on that chromosomes. 
\code{larsCBSsegment()} extracts hetergozygous sites by `grep' any site with `het' in the column of zygosity. For Complete Genomic data, sites with zygosity of `het-ref' and `het-alt' will be extracted. If genotype calling results are obtained from other platforms, a transformation is needed. For example, the `0/1/2' coding from VCF file will be needed to transform to `het'/`hom' coding. Users can name `zygosity' in their own way but keep in mind that only names containing `het' will be used as heterozygous sites for segmentations in SomatiCA, such as `het', `het-ref', `heter', `heter1' etc.

\code{larsCBSsegment()} firstly calls a function \code{denoise()} to smooth the outliers. Then it segment each chromosome with CBS followed by a model selection procedure. The default model selection criteria is Bayesian Information Criterion (BIC). Users can set \code{rss=T} to apply BIC plus a minimum cut-off for change in residue sum of squares (RSS) between neighboring change points. \code{collapse.k} is an option to average LAF on each k SNPs. 
<<>>=
seg <- larsCBSsegment(input, collapse.k = 0, ncores = 1, verbose = T, rss = FALSE)
@
This is a toy example without much noise. Consider to use rss=T to select change points from sequencing data. 
Output of \code{larsCBSsegment()} includes two part: segmentation results and heterozygous sites used for segmentation (denoised). 
<<>>=
seg
@
The segmentation result can be examined by \code{plotSegment()}. If multiple chromosomes are segmented, use \code{k} to plot segmentation on each chromosome respectively by their orders in the input.

<<eval=FALSE>>=
plotSegment(seg$segment, input, k = 1, smooth = F)
@

\begin{figure}
\begin{center}
<<label=fig1, fig=TRUE, echo=FALSE>>=
plotSegment(seg$segment, input, k = 1, smooth = F, dev.new=FALSE)
@
\end{center}
\caption{Segmentation on chr1 based on LAF.}
\label{fig:1}
\end{figure}

\subsection{Estimation of somatic ratio}
Somatic ratio is defined as the ratio of read depths between a tumor and its paired normal sample for a given segment. SomatiCA implements different methods to estimate somatic ratio. For the "mle" method somatic ratio is estimated by a maximum likelihood approach. For the "mean" method, somatic ratio is estimated by the ratio between mean of tumor sample and normal sample. For the "geometric", somatic ratio is estimated by geometric mean of somatic ratios of all sites in a given segment. 
To estimate somatic ratio, both segmentation and input with read depths are required.  
<<>>=
segmentwithRatio <- somaticRatio(seg$segment, input, method = "mle") 
segmentwithRatio
@

\subsection{Refine segments}
Two adjacent segments are merged if the difference in the somatic ratios is less than "threshold2", which is tunable in the implementation with its default value being 0.05, equivalent to 5\% change in somatic copy-number without normal contamination. The MLEs of the somatic ratio for the refined segments are recalculated. This refinement procedure is applied repeatedly until no adjacent segments have somatic ratio difference less than T. "threshold1" is the threshold used to merge the segments based on median LAF. 
<<>>=
refined <- refineSegment(segmentwithRatio, input, method = "mle", 
                        adjust = TRUE, threshold1 = 0 , threshold2 = 0.05)   	
refined
@


\subsection{Estimation of admixture rate}
The estimation of the admixture rate is accomplished by fitting the somatic copy number (somatic ratio*2) of all segments with a Bayesian finite mixture model, with components centered at the discrete levels. Each segment was assigned with a discrete level based on corresponding posterior probability. Segments with ambiguous assignments will be classified as candidate subclonal events and excluded from admixture rate inference. The admixture rate will be estimated by an optimal solution contributed by explanation of tumor copy number with all remaining segments as integer level. 
\pkg{SomatiCA} implements a Markov Chain Monte Carlo with Metropolis Hasting algorithm to estimate posterior probabilities of the Bayesian finite mixture model. Option \code{mcmc} set the number of MCMC iteration, \code{burnin} set the number of MCMC iteration for burnin and \code{p} set the cutoff of posterior probability for ambiguous integer copy number assignments.
<<results=hide>>=
ll <- admixtureRate(refined, mcmc = 5000, burnin = 1000, p = 0.01) 
admix <- ll$admix
@

\code{copynumberCorrected} could take a segmentation profile and an admixture rate to calculate the integer somatic copy number in tumor sample then characterize somatic events based on corrected somatic ratio. Each segment will be annotated with "=", "Loss", "Gain", "LOH", "neutral LOH" or "double deletion". Note that somatic copy number here is calculated from the ratio with the assumption that the control sample is diploid. This result will likely be modified in next step with the calculation of ploidy of each segment in the control sample.
<<>>=
y <- copynumberCorrected(segmentwithRatio, admix)
y
@
\subsection{Subclonality characterization}
\pkg{SomatiCA} estimates subclonality for each somatic copy number abberation. To do this, it first calculates allelic copy number nB and nA (segment level allelic copy number is estimated by median in that segment) in a control sample based on GC corrected read counts. SomatiCA tests whether copy number change in corresponding tumor sample can result in a change of exactly one copy of one allele. If the somatic ratio (corrected by admixture rate) in the corresponding tumor sample is greater than 1, SomatiCA tests for one copy gain, otherwise it tests for one copy loss. With null hypothesis that clonal copy number ratio follows a normal distribution, p-value is calculated for each segment as the probability of obtaining a copy number ratio at least as extreme as the one that was actually observed. Segments with p-value less than 0.05 are classified as subclonal.
<<>>=
data(GCcontent)
segmentGCcorrected <- segmentGCbiasRemoval(y, input, GCcontent)
segmentClonality <- subclonality(segmentGCcorrected, admix)
@
Users can further merge neighboring segments with same somatic copy number and events together using \code{MergeSegment} if necessary. 

<<>>=
merged <- MergeSegment(segmentClonality)
@
The subclonality and somatic copy number for a sample can be visualized by \code{plotSubclonality}.

<<eval=FALSE>>=
plotSubclonality(segmentClonality)
@

\begin{figure}
\begin{center}
<<label=fig2, fig=TRUE,echo=FALSE>>=
plotSubclonality(segmentClonality, dev.new=FALSE)
@
\end{center}
\caption{An example of subclonality.}
\label{fig:2}
\end{figure}

Then you can output all the clonal events or the subclonal events with high proportion of aberration. 
<<>>=
merged[elementMetadata(merged)[, "clonality"]=="clonal", ]
@ 

\section{SomatiCA pipeline}
Get familiar with how \code{SomatiCA} works? If so, call \code{SomatiCApipe} directly to run all the steps described above automatically. Multithread computing is still supported through \code{ncores}. We recommend to use \code{verbose=T} to print working messages and track the working progress.     
<<eval=FALSE>>=
data(GCcontent)
res <- SomatiCApipe(input, ncores = 1, collapse.k = 0, method = "mle", 
                  mcmc = 50000, burnin = 10000, p = 0.001, GC = GCcontent)
merged <- MergeSegment(res$segment)
@

\section{Others}
\subsection{GC content}
In case the precalculated GC content will be out of date or users may want to use smaller window size, we provide a function first downloading the .fa.gz of a given chromosome from UCSC genome browser and then calculating the GC content for a given window size (10,000 bp in the following example).    
<<eval=FALSE>>=
chr <- paste("chr", c(1:22, "X"), sep="")
url <- "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/"
GC <- foreach(i=chr, .combine=rbind)%dopar%{
      return(GCcount(i, 10000, url))
}
@

\end{document}