\chapter{High Quality Graphics in R}
\label{Chap:3}


There are two important types of data visualization. The first enables a scientist to
effectively explore the data and make discoveries about the complex processes at work. The
other type of visualization should be a clear and informative illustration of the study's
results that she can show to others and eventually include in the final publication.

Both of these types of graphics can be made with R. In fact, R offers multiple systems 
for plotting data. This is because R is an extensible system, and because progress in 
R graphics has proceeded largely not be replacing the old functions, but by adding packages. 
Each of the different approaches has its own advantages and limitations. In this chapter we'll 
briefly get to know some of the base R plotting functions\footnote{They live in 
the \CRANpkg{graphics} package, which ships with every basic R installation.}. 
Subsequently we will switch to the  \CRANpkg{ggplot2} graphics system.

Base R graphics were historically first: 
simple, procedural, canvas-oriented. There are many specialized functions for different types of plots.
Recurring plot modifications, like legends, grouping of the data by using different plot symbols, colors or subpanels, have to be reinvented over and over again. Complex plots can quickly get messy to program.
A more high-level approach -- \emph{grammar of graphics}, plots are built in modular pieces, so that we
can easily try different visualization types for our data in an intuitive and easily
deciphered way, like we can switch in and out parts of a sentence in human language.

We'll explore faceting, for showing more than 2 variables at a time. Sometimes this is also called
lattice\footnote{The first major R package to implement this was \CRANpkg{lattice}; nowadays much of that
functionality is also provided through \CRANpkg{ggplot2}.} graphics, and it allows us to
visualise data to up to 4 or 5 dimensions.

In the end of the chapter, we cover some specialized forms of
plotting such as maps and ideograms, still building on the base concept of the grammar of
graphics. \fixme{Update}

%--------------------------------------------------
\section{Goals for this Chapter}
%--------------------------------------------------
\begin{itemize}
\item Review the basics of \Rpackage{base} R plotting
\item Understand the logic behind the \emph{grammar of graphics} concept
\item Introduce \CRANpkg{ggplot2}'s \Rfunction{ggplot} function
\item See how to plot 1D, 2D, 3-5D data, and understand faceting
\item Become good at rapidly exploring data sets by visualization
\item Create beautiful and intuitive plots for scientific presentations and publications
\end{itemize}

%--------------------------------------------------
\section{Base R Plotting}
%--------------------------------------------------

\begin{marginfigure}[-60mm]
\includegraphics[width=\linewidth]{chap3-r_basicplotting1-1}
\caption{Plot of concentration vs.\ density for an ELISA assay of DNase.}
\label{rgraphics:fig:basicplotting1}
\end{marginfigure}
%
The most basic function is the \Rfunction{plot} function. In the code below, the output of which is shown in
Figure~\ref{rgraphics:fig:basicplotting1}, it is used to plot data from an enzyme-linked
immunosorbent assay (ELISA) assay. The assay was used to quantify the activity of the
enzyme deoxyribonuclease (DNase), which degrades DNA. The data are assembled in the R
object \Robject{DNase}, which conveniently comes with base R.
\Robject{DNase} is a \Rclass{`r class(DNase)`} whose columns are \Robject{Run}, the assay run;
\Robject{conc}, the protein concentration that was used; and \Robject{density}, the
measured optical density.
%
```{r basicplotting1}
head(DNase)
plot(DNase$conc, DNase$density)
```
%
\begin{marginfigure}[-60mm]
\includegraphics[width=\linewidth]{chap3-r_basicplotting2-1}
\caption{Same data as in Figure~\ref{rgraphics:fig:basicplotting1} but with better axis labels and a different plot symbol.}
\label{rgraphics:fig:basicplotting2}
\end{marginfigure}
%
This basic plot can be customized, for example by changing the plotting symbol and axis labels
as shown in Figure~\ref{rgraphics:fig:basicplotting2} by using the parameters \Robject{xlab}, \Robject{ylab}
and \Robject{pch} (plot character). The information about the labels is stored in the
object \Robject{DNase}, and we can access it with the \Rfunction{attr} function.
%
```{r basicplotting2}
plot(DNase$conc, DNase$density,
  ylab = attr(DNase, "labels")$y,
  xlab = paste(attr(DNase, "labels")$x, attr(DNase, "units")$x),
  pch = 3, 
  col = "blue")
```
%$
Besides scatterplots, we can also use built-in functions to create
histograms and boxplots (Figure~\ref{rgraphics:fig:basicplotting3}).
%
```{r basicplotting3, fig.width=3, fig.height=3}
hist(DNase$density, breaks=25, main = "")
```
%$
```{r basicplotting4, fig.width=5.6, fig.height=3.5}
boxplot(density ~ Run, data = DNase)
```
%
Boxplots are convenient to show multiple distributions next to each other in a compact space,
and they are universally preferable to the barplots with error bars sometimes still seen in 
biological papers.
We will see more about plotting univariate distributions in Section~\ref{rgraphics:sec:univar}.

\begin{marginfigure}[-70mm]
\includegraphics[width=\linewidth]{chap3-r_basicplotting3-1}\\
\includegraphics[width=\linewidth]{chap3-r_basicplotting4-1}
\caption{Histogram of the density from the ELISA assay, and boxplots of these values
stratified by the assay run. The boxes are ordered along the axis in lexicographical order because
the runs were stored as text strings. We could use R's type conversion functions to achieve numerical ordering.}
\label{rgraphics:fig:basicplotting3}
\end{marginfigure}

These plotting functions are great for quick interactive exploration of data; but we
run quickly into their limitations if we want to create more sophisticated displays.
We are going to use a visualization framework
called the grammar of graphics, implemented in the package \CRANpkg{ggplot2}, that enables step by step
construction of high quality graphics in a logical and elegant manner. But first let us load up
an example dataset.


%--------------------------------------------------
\section{An Example Dataset}
\label{rgraphics:sec:exampledata}
%--------------------------------------------------
To properly testdrive the \CRANpkg{ggplot2} functionality, we are going to need a dataset
that is big enough and has some complexity so that it can be sliced and viewed from many
different angles.

\begin{marginfigure}
\includegraphics[width=\linewidth]{Yusukecells-2}
\caption{Single-section immunofluorescence image of the E3.5 mouse blastocyst stained for Serpinh1,
a marker of primitive endoderm (blue), Gata6 (red) and Nanog (green).
Scale bar: 10 $\mu$m.\label{rgraphics:fig:cells}}
\end{marginfigure}

We'll use a gene expression microarray data set that reports the transcriptomes
of around 100 individual cells from mouse embryos at different time points in early development.
The mammalian embryo starts out as a single cell, the fertilized egg.
Through synchronized waves of cell divisions, the egg multiplies into a clump of cells that at first
show no discernible differences between them. At some point, though, cells choose different lineages.
Eventually, by further and further specification, the different cell types and tissues arise that
are needed for a full organism. The aim of the experiment\cite{Ohnishi2014} was to investigate the 
gene expression changes that associated with the first symmetry breaking event in the embryo. 
We'll further explain the data as we go. More details can be found in the paper and in the 
documentation of the Bioconductor data package \Biocexptpkg{Hiiragi2013}. We first load the package 
and the data:
%
```{r loadHiiragi}
library("Hiiragi2013")
data("x")
dim(exprs(x))
```
```{r checkHiiragi, echo=FALSE}
stopifnot(packageDescription("Hiiragi2013")$Version >=  package_version("1.3.2"), # currently in Bioc-Devel
          "sampleGroup" %in% colnames(pData(x)))

```
%$
You can print out a more detailed summary of the \Rclass{`r class(x)`} object \Robject{x}
by just typing \Robject{x} at the R prompt.  The `r ncol(x)` columns of the data matrix
(accessed above through the \Rfunction{exprs} function)
correspond to the samples (and each of these to a single cell), the `r nrow(x)` rows correspond to the genes
probed by the array, an Affymetrix `r annotation(x)` array.  The data were
normalized using the RMA method\cite{Irizarry:Biostat:2003}. The raw data are also available in the
package (in the data object \Robject{a}) and at EMBL-EBI's ArrayExpress database under the
accession code E-MTAB-1681.

Let's have a look what information is available about the samples\footnote{The
notation \`r {z <- pData(x)[1, "sampleColour"]; stopifnot(substring(z, 1, 1)=="#"); z}` 
is a hexadecimal representation of the RGB coordinates
of a colour; more on this in Section~\ref{rgraphics:sec:colorspaces}.}.
% Note, the \'r ...' expression assumes that the R code evaluates to a string
% that begins with `#`, which unescaped would be a problem for LaTeX
%
```{r xpData}
head(pData(x), n = 2)
```
%
The information provided is a mix of information about the cells (i.e., age, size and
genotype of the embryo from which they were obtained) and technical information (scan
date, raw data file name). By convention, time in the development of the mouse embryo 
is measured in days, and reported as, for instance, \texttt{E3.5}. 
Moreover, in the paper the authors divided the cells into
`r length(unique(x$sampleGroup))` %$
biological groups (\Robject{sampleGroup}), based on age,
genotype and lineage, and they defined a colour scheme to represent these groups
(\Robject{sampleColour})\footnote{In this chapter we'll use the spelling
  \textit{colour} (rather than \textit{color}). This is to stay consistent with the spelling
  adopted by the R package \CRANpkg{ggplot2}. Other packages, like
  \CRANpkg{RColorBrewer}, use the other spelling. In some cases, function or argument
  names are duplicated to allow users to use either choice, although you cannot always rely on that.}.
Using the \Rfunction{group\_by} and \Rfunction{summarise} functions from the package \CRANpkg{dplyr}, 
we'll define a little \Rclass{data.frame} \Robject{groups} that contains
summary information for each group: the number of cells and the preferred colour.
%
```{r groupSize}
library("dplyr")
groups = group_by(pData(x), sampleGroup) %>% 
  summarise(n = n() , colour = unique(sampleColour)) 
groups
```
%
The cells in the groups whose name contains
\texttt{FGF4-KO} are from embryos in which the FGF4 gene, an important
regulator of cell differentiation, was knocked out.  Starting from E3.5, the wildtype cells (without
the FGF4 knock-out) undergo the first symmetry breaking event and differentiate into different 
cell lineages, called pluripotent epiblast (EPI) and primitive endoderm (PE).

%------------------------------------------------------------
\section{ggplot2}
\label{rgraphics:sec:ggplot}
%------------------------------------------------------------

The \CRANpkg{ggplot2} package is a package created by Hadley Wickham that implements the
idea of \emph{grammar of graphics} -- a concept created by Leland Wilkinson in his
eponymous book\cite{GrofGrbook}. Comprehensive documentation for the
package\cite{ggplot2} can be found \href{http://ggplot2.org}{on its website}.  The
online documentation includes example use cases for each of the graphic types that
are introduced in this chapter (and many more) and is an invaluable resource when creating
figures.

Let's start by loading the package and redoing the simple plot of Figure~\ref{rgraphics:fig:basicplotting1}.
%
```{r figredobasicplottingwithggplot, fig.width = 3.5, fig.height = 3}
library("ggplot2")
ggplot(DNase, aes(x = conc, y = density)) + geom_point()
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_figredobasicplottingwithggplot-1}
\caption{Our first \CRANpkg{ggplot2} figure, similar to the base graphics 
Figure~\ref{rgraphics:fig:basicplotting1}.}
\label{rgraphics:fig:figredobasicplottingwithggplot}
\end{marginfigure}
%
We just wrote our first "sentence" using the grammar of graphics. 
Let us deconstruct this sentence. 
First, we specified the \Rclass{`r class(DNase)`} that contains the data, \Robject{DNase}.
Then we told \Rfunction{ggplot} via the \Robject{aes}\footnote{This stands for \emph{aesthetic}, a terminology
that will become clearer below.} argument which variables we want on the $x$- and $y$-axes, respectively.
Finally, we stated that we want the plot to use points, by adding the result of calling the function 
\Rfunction{geom\_point}.

Now let's turn to the mouse single cell data and plot the number of samples for each of the
`r nrow(groups)` groups using the \Rfunction{ggplot} function. The result is shown
in Figure~\ref{rgraphics:fig:qplot1}. 
%
```{r figqplot1, fig.width=5, fig.height=4}
ggplot(data = groups, aes(x = sampleGroup, y = n)) + 
  geom_bar(stat = "identity")
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_figqplot1-1}
\caption{A barplot, produced with the \Rfunction{ggplot} function from the table of 
group sizes in the mouse single cell data.}
\label{rgraphics:fig:qplot1}
\end{marginfigure}
%
\Rfunction{ggplot} generally expects the data to be plotted to be assembled in a 
\Rclass{data.frame} -- as opposed to, say, two individual vectors. 
By adding the call to \Rfunction{geom\_bar} we told \Rfunction{ggplot} that we want each data item 
(each row of \Robject{groups}) to be represented by a bar.
Bars are one geometric object (\eindex{geom}) that \Rfunction{ggplot} knows 
about. We've already seen another geom in Figure~\ref{rgraphics:fig:figredobasicplottingwithggplot}: points.
We'll encounter many other possible geometric objects later. 
We used the \Rfunction{aes} to indicate that we want the groups shown along the $x$-axis 
and the sizes along the $y$-axis.
Finally, we provided the argument \Robject{stat = "identity"} (in other words, do nothing) to the 
\Rfunction{geom\_bar} function, since otherwise it would try to compute a histogram of the  
data (the default value of \Robject{stat} is \Robject{"count"}).  \Robject{stat} is short for 
\emph{statistic}, which is what we call any
function of data. The identity statistic just returns the data
themselves, but there are other more interesting statistics, such as binning, smoothing,
averaging, taking a histogram, or other operations that summarize the data in some way. 

\begin{ques}
\item Flip the $x$- and $y$-aesthetics to produce a horizontal barplot.
\end{ques}

These concepts --data,
geometrical objects, statistics-- are some of the ingredients of the grammar of graphics,
just as nouns, verbs and adverbs are ingredients of an English sentence.

The plot in Figure~\ref{rgraphics:fig:qplot1} is not bad, but there are several
potential improvements. We can use colour for the bars to help us quickly
see which bar corresponds to which group. This is particularly useful if we use the
same colour scheme in several plots. To this end, let's define a named vector
\Robject{groupColour} that contains our desired colours for each possible value of
\Robject{sampleGroup}\footnote{The information is completely equivalent to that in the 
\Robject{sampleGroup} and \Robject{colour} columns of the \Rclass{data.frame} \Robject{groups}, 
we're just adapting to the fact that \CRANpkg{ggplot2} expects this information in the 
form of a named vector.}.
%
```{r groupColour}
groupColour = setNames(groups$colour, groups$sampleGroup)
```
%
Another thing that we need to fix is the
readability of the bar labels. Right now they are running into each other --- a common
problem when you have descriptive names. 
%
```{r figqplot2, fig.width=5, fig.height=4}
ggplot(groups, aes(x = sampleGroup, y = n, fill = sampleGroup)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = groupColour, name = "Groups") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_figqplot2-1}
\caption{Similar to Figure~\ref{rgraphics:fig:qplot1}, but with coloured bars and
better bar labels.\label{rgraphics:fig:qplot2}}
\end{marginfigure}

Let's dissect the above "sentence". We added an argument, \Robject{fill} to the \Rfunction{aes} function that
states that we want the bars to be coloured (filled) based on \Robject{sampleGroup} 
(which in this case co-incidentally is also the value of the \Robject{x} argument, but that need not be so). 
Furthermore we added a call to the \Rfunction{scale\_fill\_manual} function, which takes as its 
input a colour map -- i.\,e., the mapping from the
possible values of a variable to the associated colours -- as a named vector. We also gave this colour map
a title (note that in more complex plots, there can be several different colour maps involved).
Had we omitted the call to \Rfunction{scale\_fill\_manual}, \CRANpkg{ggplot2} would have used its choice of 
default colours. We also added a call to \Rfunction{theme} stating that we want the $x$-axis labels
rotated by 90 degrees, and right-aligned (\Robject{hjust}; the default would be to center it).

%--------------------------------------------------
\section{The Grammar of Graphics}
%--------------------------------------------------
The components of \CRANpkg{ggplot2}'s grammar of graphics are

\begin{enumerate}
\item a dataset
\item one or more geometric objects that serve as the visual representations of the data
  -- for instance, points, lines, rectangles, contours
\item a description of how the variables in the data are mapped to visual properties
  (aesthetics) of the geometric objects, and an associated scale (e. g., linear,
  logarithmic, rank)
\item a statistical summarisation rule
\item a coordinate system
\item a facet specification, i.\,e. the use of several plots to look at the same data
\end{enumerate}
%
In the examples above, Figures \ref{rgraphics:fig:qplot1} and \ref{rgraphics:fig:qplot2}, the
dataset was \Robject{groupsize}, the variables were the numeric values as well as the
names of \Robject{groupsize}, which we mapped to the aesthetics $y$-axis and $x$-axis
respectively, the scale was linear on the $y$ and rank-based on the $x$-axis (the bars are
ordered alphanumerically and each has the same width), the geometric object was the
rectangular bar, and the statistical summary was the trivial one (i.\,e., none). We did not make
use of a facet specification in the plots above, but we'll see examples later on
(Section~\ref{rgraphics:sec:facets}).

\begin{marginfigure}[-20mm]
\includegraphics[width=\linewidth]{chap3-r_figscp2layers1-1}
\caption{A scatterplot with three layers that show different statistics of the same data:
points, a smooth regression line, and a confidence band.}
\label{rgraphics:fig:scp2layers1}
\end{marginfigure}

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_figscp2layers2-1}
\caption{As Figure~\ref{rgraphics:fig:scp2layers1}, but in addition with points coloured by
  the sample group (as in Figure~\ref{rgraphics:fig:qplot2}).
  We can now see that the expression values of the gene
  `r AnnotationDbi::select(mouse4302.db, "1418765_at", keytype = "PROBEID", columns = "SYMBOL")$SYMBOL` %$
  (whose mRNA is targeted by the probe 1418765\_at) are consistently high in the early time points,
  whereas its expression goes down in the EPI samples at days 3.5 and 4.5.
  In the FGF4-KO, this decrease is delayed - at E3.5, its expression is still high.
  Conversely, the gene
  `r AnnotationDbi::select(mouse4302.db, "1426642_at", keytype = "PROBEID", columns = "SYMBOL")$SYMBOL` %$
  (1426642\_at) is off in the early timepoints and then goes up at days 3.5 and 4.5.
  The PE samples (green) show a high degree of cell-to-cell variability.}
\label{rgraphics:fig:scp2layers2}
\end{marginfigure}
%
In fact, \CRANpkg{ggplot2}'s implementation of the grammar of graphics allows you to use
the same type of component multiple times, in what are called
layers\cite{Wickham:LayeredGrammar}.  For example, the code below uses three types of
geometric objects in the same plot, for the same data: points, a line and a confidence
band.
%
```{r loadlib, echo = FALSE, message = FALSE}
library("mouse4302.db")
```
```{r findprobepairs, echo = FALSE, eval = FALSE}
# I used this code to find the below two probes
idx = order(rowVars(exprs(x)), decreasing=TRUE)[seq_len(2000)]
cc  = cor(t(exprs(x)[idx,]))
cco = order(cc)[seq(1, 1001, by=2) ]
jj2 = rownames(exprs(x))[ idx[ (cco-1) %/% length(idx) + 1 ] ]
jj1 = rownames(exprs(x))[ idx[ (cco-1) %%  length(idx) + 1 ] ]
dftx = as.data.frame(t(exprs(x)))
par(ask=TRUE)
for(i in seq(along = cco)) {
  df = AnnotationDbi::select(mouse4302.db,
   keys = c(jj1[i], jj2[i]), keytype = "PROBEID",
   columns = c("SYMBOL", "GENENAME"))
  print(ggplot(dftx, aes( x = get(jj1[i]), y = get(jj2[i]))) +
  geom_point(shape = 1) +
  xlab(paste(jj1[i], df$SYMBOL[1])) +
  ylab(paste(jj2[i], df$SYMBOL[2])) +
  ggtitle(round(cc[jj1[i], jj2[i]], 3)) +
  geom_smooth(method = "loess"))
}
```
```{r figscp2layers1, fig.width = 3.5, fig.height = 3.5}
dftx = data.frame(t(exprs(x)), pData(x))
ggplot( dftx, aes( x = X1426642_at, y = X1418765_at)) +
  geom_point( shape = 1 ) +
  geom_smooth( method = "loess" )
```
%
Here we had to assemble a copy of the expression data (\Robject{exprs(x)}) and the sample annotation
data (\Robject{pData(x)}) all together into the \Rclass{data.frame} \Robject{dftx} --
since this is the data format that \Rpackage{ggplot2} functions most easily take as input
(more on this in Sections~\ref{rgraphics:sec:tidydata1} and \ref{rgraphics:sec:tidydata2}).
```{r checkclassdftx, echo=FALSE}
stopifnot(is(dftx, "data.frame"))
```

We can further enhance the plot by using colours -- since each of the points in
Figure~\ref{rgraphics:fig:scp2layers1} corresponds to one sample, it makes sense to use
the \Robject{sampleColour} information in the object \Robject{x}.
%
```{r figscp2layers2, fig.width = 3.5, fig.height = 3.5}
ggplot( dftx, aes( x = X1426642_at, y = X1418765_at ))  +
  geom_point( aes( colour = sampleColour), shape = 19 ) +
  geom_smooth( method = "loess" ) +
  scale_colour_discrete( guide = FALSE )
```
%
\begin{ques}
In the code above we defined the \Robject{colour} aesthetics
(\Robject{aes}) only for the \Robject{geom\_point} layer, while we defined the \Robject{x}
and \Robject{y} aesthetics for all layers. What happens if we set the \Robject{colour}
aesthetics for all layers, i.\,e., move it into the argument list of \Rfunction{ggplot}?
What happens if we omit the call to \Rfunction{scale\_colour\_discrete}?
\end{ques}

\begin{ques}
Is it always meaningful to summarize scatterplot data with a regression
line as in Figures \ref{rgraphics:fig:scp2layers1} and \ref{rgraphics:fig:scp2layers2}?
\end{ques}

% Note the downward-sloping quotation marks
% around the identifiers, 1426642\_at and
% 1418765\_at. R understands these quotation marks to indicate variable names (or, here,
% column names in the \Rclass{data.frame} \Robject{dftx}), and they are needed since
% these identifiers would, without explicit quotation, not be valid variable names.

As a small side remark, if we want to find out which genes are targeted by these probe
identifiers, and what they might do, we can call\footnote{Note that here were need to use
  the original feature identifiers (e.\,g., ``1426642\_at'', without the leading
  ``X''). This is the notation used by the microarray manufacturer, by the Bioconductor
  annotation packages, and also inside the object \Robject{x}. The leading ``X'' that we
  used above when working with \Robject{dftx} was inserted during the creation of
  \Robject{dftx} by the constructor fuction \Rfunction{data.frame}, since its 
  argument \Robject{check.names} is set to \Robject{TRUE} by default.
  Alternatively, we could have kept the original
  identifer notation by setting \Robject{check.names=FALSE}, but then we would need to
  work with the backticks, such as \Robject{aes( x = \`{}1426642\_at\`{}, ...)}, to make
  sure R understands them correctly.}.
% Remark, WH 16.5.2015: I tried this initially but then got difficult to trace errors in
% below code using facet_wrap ...
%
```{r mouse4302.db, results="hide", message=FALSE}
library("mouse4302.db")
```
```{r select}
AnnotationDbi::select(mouse4302.db,
   keys = c("1426642_at", "1418765_at"), keytype = "PROBEID",
   columns = c("SYMBOL", "GENENAME"))
```
%
Often when using \Rfunction{ggplot} you will only need to specify the data, aesthetics and a
geometric object. Most geometric objects implicitly call a suitable default statistical 
summary of the data. 
For example, if you are using \Robject{geom\_smooth}, \CRANpkg{ggplot2} 
by default uses \Robject{stat = "smooth"} and then displays a line; if you use  
\Robject{geom\_histogram}, the data are binned, and the result is displayed in barplot format. 
Here's an example:
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_fighists-1}
\caption{Histogram of probe intensities for one particular sample, cell number 20, which
was from day E3.25.}
\label{rgraphics:fig:hists}
\end{marginfigure}
%
```{r fighists, fig.width = 3.5, fig.height = 2.5}
dfx = as.data.frame(exprs(x))
ggplot(dfx, aes(x = `20 E3.25`)) +
               geom_histogram(binwidth = 0.2)
```
%
\begin{ques}
What is the difference between the objects \Robject{dfx} and \Robject{dftx}? 
Why did we need to create both of the?
\end{ques}
\begin{ques}
Check the \CRANpkg{ggplot2} documentation for examples of the usage of \Robject{stat}s.
\end{ques}

Let's come back to the barplot example from above.
%
```{r figbpgg1}
pb = ggplot(groups, aes(x = sampleGroup, y = n))
```
%
This creates a plot object \Robject{pb}. If we try to display it, 
it creates an empty plot, because we haven't specified what geometric object we want to use. 
All that we have in our \Robject{pb} object so far are the data and the aesthetics
(Fig.~\ref{rgraphics:fig:figbpempty})
%
```{r figbpempty, fig.width = 3.2, fig.height = 2.5}
pb
```
%
\begin{marginfigure}[-5mm]
\includegraphics[width=\linewidth]{chap3-r_figbpempty-1}
\caption{\Robject{pb}: without a geometric object, the plot remains empty.}
\label{rgraphics:fig:figbpempty}
\end{marginfigure}
%
Now we can literally add on the other components of our plot through using the 
\Robject{+} operator (Fig.~\ref{rgraphics:fig:bpgg3}):
%
```{r bpgg3, fig.width = 5, fig.height = 4}
pb = pb + geom_bar(stat = "identity") 
pb = pb + aes(fill = sampleGroup) 
pb = pb + theme(axis.text.x = element_text(angle = 90, hjust = 1))  
pb
class(pb)
```
%
\begin{marginfigure}[0mm]
\includegraphics[width=\linewidth]{chap3-r_bpgg3-1}
\caption{The graphics object \Robject{bp} in its full glory.}
\label{rgraphics:fig:bpgg3}
\end{marginfigure}
%
This step-wise buildup --taking a graphics object already produced in
some way and then further refining it-- can be more convenient and easy to manage than,
say, providing all the instructions upfront to the single function call that creates the
graphic.  We can quickly try out different visualisation ideas without having to rebuild
our plots each time from scratch, but rather store the partially finished object and then modify it in
different ways.  For example we can switch our plot to polar coordinates to create an
alternative visualization of the barplot.
%
```{r bpgg4, fig.width = 5, fig.height = 4}
pb.polar = pb + coord_polar() +
  theme(axis.text.x = element_text(angle = 0, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
  xlab("") + ylab("")
pb.polar
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_bpgg4-1}
\caption{A barplot in a polar coordinate system.}
\label{rgraphics:fig:bpgg7}
\end{marginfigure}
%
Note above that we can override previously set \Rfunction{theme} parameters by simply 
setting them to a new value --
no need to go back to recreating \Robject{pb}, where we originally set them.

%------------------------------------------------------------
\section{1D Visualisations}
\label{rgraphics:sec:univar}
%------------------------------------------------------------

A common task in biological data analysis is the comparison between several samples
of univariate measurements. In this section we'll explore some possibilities for
visualizing and comparing such samples. As an example, we'll use the intensities of a set
of four genes Fgf4, Sox2, Gata6, Gata4 and Gapdh\footnote{You can read more about these
genes in the paper associated with the data.}. On the array, they are represented by
```{r genes2ps1}
probes = c(Fgf4  = "1420085_at", Gata4 = "1418863_at",
           Gata6 = "1425463_at",  Sox2 = "1416967_at")
```
```{r genes2ps2, echo = FALSE, eval = FALSE}
# How I found the probes:
AnnotationDbi::select(mouse4302.db,
   keys = c("Fgf4", "Sox2", "Gata6", "Gata4"), keytype = "SYMBOL",
   columns = c("PROBEID"))
```
```{r genes2ps3, echo = FALSE}
probes2 = AnnotationDbi::select(mouse4302.db,
   keys = probes, keytype = "PROBEID", columns = c("SYMBOL"))
stopifnot(identical(sort(probes2$SYMBOL), sort(names(probes))),
          all(probes[probes2$SYMBOL] == probes2$PROBEID))
```
%$

%------------------------------------------------------------
\subsection{Data tidying I -- matrices versus data.frame}
\label{rgraphics:sec:tidydata1}
%------------------------------------------------------------
\fixme{Rewrite this section -- move the more philosophical parts to the Wrap-Up chapter.}

Getting data ready for visualisation often involves a lot of shuffling around
of the data, until they are in the right shape and format for the algorithm or plotting
routine.  \CRANpkg{ggplot2} likes its data in \Rclass{data.frame} objects, and more
specifically, in the long format\footnote{More about this below.}. The reasons behind this
choice are well explained in Hadley Wickham's paper on \emph{tidy
data}\cite{Wickham:TidyData}.  Essentially the long format table is a general way
of representing data that can accommodate pretty much any data type and is easy to programmatically
interface to. 

% \subsubsection{Intermezzo: tidy vs rich}
On the other hand, for a specific data type, it may not always be the most
efficient way of storing data, and it cannot easily transport rich metadata (i.\,e., data
about the data)\footnote{In other words, simple tables or \Rclass{data.frame}s cannot
offer all the nice features provided by object oriented approaches, such as encapsulation,
abstraction of interface from implementation, polymorphism, inheritance and reflection.}.
For instance, our example dataset \Robject{x} is stored as an object in Bioconductor's
\Rclass{ExpressionSet}
class, which has multiple components, most importantly, the matrix \Robject{exprs(x)} with
`r nrow(x)` rows and
`r ncol(x)` columns. The matrix elements are the gene expression measurements, and the
feature and sample associated with each measurement are implied by its position (row, column) in the matrix;
in contrast, in the long table format, the feature and sample identifiers need to be stored explicitly
with each measurement. Besides,
\Robject{x} has additional components, including the \Rclass{data.frame}s
\Robject{fData(x)} and \Robject{pData}, which provide various sets metadata about the microarray
\textbf{f}eatures and the \textbf{p}henotypic information about the samples.


To extract data from this representation and convert them into a \Rclass{data.frame},
we use the function \Rfunction{melt}, which we'll explain in more detail below.

```{r melt}
library("reshape2")
genes = melt(exprs(x)[probes, ], varnames = c("probe", "sample"))
head(genes)
```

For good measure, we also add a column that provides the gene symbol along with the probe identifiers.
%
```{r symbol}
genes$gene = names(probes)[ match(genes$probe, probes) ]
```


%------------------------------------------------------------
\subsection{Barplots}
%------------------------------------------------------------
A popular way to display data such as in our \Rclass{data.frame} \Robject{genes} is through barplots.
See Fig.~\ref{rgraphics:fig:onedbp1}.
```{r onedbp1, fig.width = 3, fig.height = 3.75}
ggplot(genes, aes( x = gene, y = value)) +
  stat_summary(fun.y = mean, geom = "bar")
```
\begin{marginfigure}
\includegraphics[width=0.8\linewidth]{chap3-r_onedbp1-1}
\caption{Barplots showing the means of the distributions of expression measurements from 4 probes.}
\label{rgraphics:fig:onedbp1}
\end{marginfigure}
%
In Figure~\ref{rgraphics:fig:onedbp1}, each bar represents the mean of the values for that gene. Such plots
are seen a lot in the biological sciences, as well as in the popular media. The data summarisation
into only the mean looses a lot of information, and given the amount of space it takes, a barplot
can be a poor way to visualise data\footnote{In fact, if the mean is an appropriate summary, such
as for highly skewed distributions, or data sets with outliers, the barplot can be outright misleading.}.

Sometimes we want to add error bars, and one way to achieve this in \CRANpkg{ggplot2} is as follows.
%
```{r onedbp2, fig.width = 3.75, fig.height = 3.75}
library("Hmisc")
ggplot(genes, aes( x = gene, y = value, fill = gene)) +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               width = 0.25)
```
%
Here, we see again the principle of layered graphics: we use two summary functions,
\Rfunction{mean} and \Rfunction{mean\_cl\_normal}, and two associated geometric objects,
\Robject{bar} and \Robject{errorbar}. The function \Rfunction{mean\_cl\_normal} is from
the \CRANpkg{Hmisc} package and computes the standard error (or \textbf{c}onfidence
\textbf{l}imits) of the mean; it's a simple function, and we could also compute it
ourselves using base R expressions if we wished to do so.  We also coloured the bars in
lighter colours for better contrast.
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_onedbp2-1}
\caption{Barplots with error bars indicating standard error of the mean.}
\label{rgraphics:fig:onedbp2}
\end{marginfigure}

%------------------------------------------------------------
\subsection{Boxplots}
%------------------------------------------------------------
It's easy to show the same data with boxplots.
%
```{r onedboxpl, fig.width = 3.75, fig.height = 3.75}
p = ggplot(genes, aes( x = gene, y = value, fill = gene))
p + geom_boxplot()
```

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_onedboxpl-1}
\caption{Boxplots.}
\label{rgraphics:fig:onedboxpl}
\end{marginfigure}

Compared to the barplots, this takes a similar amount of space, but is much more
informative. In Figure~\ref{rgraphics:fig:onedboxpl} we see that two of the genes (Gata4, Gata6)
have relatively concentrated distributions, with only a few data points venturing out to
the direction of higher values. For Fgf4, we see that the distribution is right-skewed:
the median, indicated by the horizontal black bar within the box is closer to the lower (or
left) side of the box. Analogously, for Sox2 the distribution is left-skewed.

%------------------------------------------------------------
\subsection{Violin plots}
%------------------------------------------------------------

A variation of the boxplot idea, but with an even more direct representation of the shape
of the data distribution, is the violin plot. Here, the shape of the violin gives a rough
impression of the distribution density.

```{r onedviolin, fig.width = 3.75, fig.height = 3.75}
p + geom_violin()
```

%------------------------------------------------------------
\subsection{Dot plots and beeswarm plots}
%------------------------------------------------------------

If the number of data points is not too large, it is possible to show the data points
directly, and it is good practice to do so, compared to using more abstract
summaries.

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_onedviolin-1}
\caption{Violin plots.}
\label{rgraphics:fig:onedviolin}
\end{marginfigure}


However, plotting the data directly will often lead to overlapping points, which can be
visually unpleasant, or even obscure the data. We can try to layout the points so that
they are as near possible to their proper locations without
overlap\cite{Wilkinson:DotPlots:1999}.

```{r oneddot, fig.width = 5, fig.height = 5}
p + geom_dotplot(binaxis = "y", binwidth = 1/6,
       stackdir = "center", stackratio = 0.75,
       aes(color = gene))
```

The plot is shown in the left panel of Figure~\ref{rgraphics:fig:oneddot}. The $y$-coordinates of
the points are discretized into bins (above we chose a bin size of 1/6), and then they
are stacked next to each other.

A fun alternative is provided by the package \CRANpkg{beeswarm}. It works with base R
graphics and is not directly integrated into \CRANpkg{ggplot2}'s data flows, so we can
either use the base R graphics output, or pass on the point coordinates to
\Rfunction{ggplot} as follows.

```{r onedbee, fig.width = 5, fig.height = 5}
library("beeswarm")
bee = beeswarm(value ~ gene, data = genes, spacing = 0.7)
ggplot(bee, aes( x = x, y = y, colour = x.orig)) +
  geom_point(shape = 19) + xlab("gene") + ylab("value") +
  scale_fill_manual(values = probes)
```

\begin{figure}
\includegraphics[width=.49\linewidth]{chap3-r_oneddot-1}
\includegraphics[width=.49\linewidth]{chap3-r_onedbee-2}
\caption{Left: dot plots, made using \Rfunction{geom\_dotplot} from \CRANpkg{ggplot2}.
Right: beeswarm plots, with layout obtained via the \CRANpkg{beeswarm} package and
plotted as a scatterplot with \Rfunction{ggplot}.}
\label{rgraphics:fig:oneddot}
\end{figure}

The plot is shown in the right panel of Figure~\ref{rgraphics:fig:oneddot}.
The default layout method used by \Rfunction{beeswarm} is called \emph{swarm}. It places
points in increasing order. If a point would overlap an existing point, it is shifted
sideways (along the $x$-axis) by a minimal amount sufficient to avoid overlap.

As you have seen in the above code examples, some twiddling with layout parameters is
usually needed to make a dot plot or a beeswarm plot look good for a particular dataset.

%------------------------------------------------------------
\subsection{Density plots}
%------------------------------------------------------------
Yet another way to represent the same data is by lines plots of the density plots
%
```{r oneddens, fig.width = 3.75, fig.height = 3.75}
ggplot(genes, aes( x = value, colour = gene)) + geom_density()
```
%
Density estimation has a number of complications, and you can see these in
Figure~\ref{rgraphics:fig:oneddens}.  In particular, the need for choosing a smoothing window. A
window size that is small enough to capture peaks in the dense regions of the data may
lead to instable (``wiggly'') estimates elsewhere; if the window is made bigger,
pronounced features of the density may be smoothed out.  Moreover, the density lines do
not convey the information on how much data was used to estimate them, and plots like
Figure~\ref{rgraphics:fig:oneddens} can become especially problematic if the sample sizes for the
curves differ.
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_oneddens-1}
\caption{Density plots.}
\label{rgraphics:fig:oneddens}
\end{marginfigure}


%------------------------------------------------------------
\subsection{ECDF plots}
%------------------------------------------------------------
The mathematically most robust way to describe the distribution of a one-dimensional
random variable $X$ is its cumulative distribution function (CDF), i.\,e., the function
\begin{equation}
F(x) = P(X\le x),
\end{equation}
where $x$ takes all values along the real axis. The density of $X$ is then the derivative
of $F$, if it exists\footnote{By its definition, $F$ tends to 0 for small $x$ ($x\to-\infty$) and
to 1 for large $x$ ($x\to+\infty$).}.  The definition of the CDF can also be
applied to finite samples of $X$, i.\,e., samples $x_1,\ldots,x_n$. The empirical
cumulative distribution function (ECDF) is simply
\begin{equation}
F_{n}(x) = \frac{1}{n}\sum_{i=1}^n \mathbbm{1}_{x\le x_i}.
\end{equation}
An important property is that even for limited sample sizes $n$, the ECDF
$F_{n}$ is not very far from the CDF, $F$.  This is not the case for the
empirical density! Without smoothing, the empirical density of a finite sample is a sum of
Dirac delta functions, which is difficult to visualize and quite different from any
underlying smooth, true density.  With smoothing, the difference can be less pronounced, but
is difficult to control, as discussed above.
%
```{r onedecdf, fig.width = 3.75, fig.height = 3.75}
ggplot(genes, aes( x = value, colour = gene)) + stat_ecdf()
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_onedecdf-1}
\caption{Empirical cumulative distribution functions (ECDF).}
\label{rgraphics:fig:onedecdf}
\end{marginfigure}


%------------------------------------------------------------
\subsection{Transformations}
%------------------------------------------------------------
It is tempting to look at histograms or density plots and inspect them for evidence of
bimodality (or multimodality) as an indication of some underlying biological phenomenon.
Before doing so, it is important to remember that the number of modes of a density depends
on scale transformations of the data, via the chain rule. A simple example, with a
mixture of two normal distributions, is shown in Figure~\ref{rgraphics:fig:onedtrsf}.
%
```{r gridExtra, echo = FALSE}
library("gridExtra")
```
```{r onedtrsf, fig.width = 3.75, fig.height = 7.5}
sim <- data_frame(
   x = exp(rnorm(
     n    = 1e5, 
     mean = sample(c(2, 5), size = 1e5, replace = TRUE))))
     
gridExtra::grid.arrange(
  ggplot(sim, aes(x)) + 
    geom_histogram(binwidth = 10, boundary = 0) + xlim(0, 400),
  ggplot(sim, aes(log(x))) + geom_histogram(bins = 30) )
```
%
\begin{ques} Consider a log-normal mixture model as in the code above.  What is the
  density function of $X$? What is the density function of $log(X)$?  How many modes do
  these densities have, as a function of the parameters of the mixture model (mean and
  standard deviation of the component normals, and mixture fraction)?
\end{ques}
%
\begin{marginfigure}
\includegraphics[width=0.8\linewidth]{chap3-r_onedtrsf-1}
\caption{Histograms of the same data, with and without logarithmic transformation. 
  The number of modes is different.}
\label{rgraphics:fig:onedtrsf}
\end{marginfigure}


%------------------------------------------------------------
\subsection{Data tidying II - wide vs long format}
\label{rgraphics:sec:tidydata2}
%------------------------------------------------------------

Let us revisit the \Rfunction{melt} command from above. In the resulting
\Rclass{data.frame} \Robject{genes}, each row corresponds to exactly one measured value,
stored in the column \Robject{value}. Then there are
additional columns `r colnames(genes)[1]` and  `r colnames(genes)[2]`,
which store the associated covariates.
Compare this to the following \Rclass{data.frame} (for space reasons we print only the first five columns):

```{r dt2}
as.data.frame(exprs(x)[probes, ])[, 1:5]
```

This \Rclass{data.frame} has several columns of data, one for each sample (annotated by
the column names). Its rows correspond to the four probes, annotated by the row
names. This is an example for a data table in \emph{wide format}.

Now suppose we want to store somewhere not only the probe identifiers but also the
associated gene symbols. We could stick them as an additional column into the wide format
\Rclass{data.frame}, and perhaps also throw in the genes' ENSEMBL identifier for good
measure.  But now we immediately see the problem: the \Rclass{data.frame} now has some
columns that represent different samples, and others that refer to information for all
samples (the gene symbol and identifier) and we somehow have to "know" this when
interpreting the \Rclass{data.frame}. This is what Hadley Wickham calls \emph{untidy
  data}\footnote{\href{http://en.wikipedia.org/wiki/Anna_Karenina_principle}{There are many different ways for data to be untidy.}}.
In contrast, in
the tidy \Rclass{data.frame} \Robject{genes}, we can add these columns, yet still know
that each row forms exactly one observation, and all information associated with that
observation is in the same row.

In tidy data\cite{Wickham:TidyData},
\begin{enumerate}
\item each variable forms a column,
\item each observation forms a row,
\item each type of observational unit forms a table.
\end{enumerate}
A potential drawback is efficiency: even though there are only `r length(probes)` probe --
gene symbol relationships, we are now storing them `r nrow(genes)` times in the rows of
the \Rclass{data.frame} \Robject{genes}. Moreover, there is no standardisation: we chose
to call this column \Robject{symbol}, but the next person might call it \Robject{Symbol}
or even something completely different, and when we find a \Rclass{data.frame} that was
made by someone else and that contains a column \Robject{symbol}, we can hope, but have no
guarantee, that these are valid gene symbols. Addressing such issues is behind the
object-oriented design of the specialized data structures in Bioconductor, such as the
\Rclass{ExpressionSet} class.

%------------------------------------------------------------
\section{2D Visualisation: Scatter Plots}
\label{rgraphics:sec:2d}
%------------------------------------------------------------
Scatter plots are useful for visualizing treatment--response comparisons (as in
Figure~\ref{rgraphics:fig:basicplotting2}), associations between variables (as in
Figure~\ref{rgraphics:fig:scp2layers2}), or paired data (e.\,g., a disease biomarker in
several patients before and after treatment). We use the two dimensions of our plotting
paper, or screen, to represent the two variables.

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_twodsp1-1}
\caption{Scatterplot of `r nrow(dfx)` expression measurements for two of the samples.}
\label{rgraphics:fig:twodsp1}
\end{marginfigure}
%
Let us take a look at differential expression between a wildtype and an FGF4-KO sample.
%
```{r fourfive, eval=FALSE, echo=FALSE}
scp = ggplot(dfx, aes( x = `37 E3.5 (PE)` ,
                        y = `84 E3.5 (FGF4-KO)`))
```
```{r twodsp1, fig.width = 3.75, fig.height = 3.75, dev = "png"}
scp = ggplot(dfx, aes( x = `59 E4.5 (PE)` ,
                        y = `92 E4.5 (FGF4-KO)`))
scp + geom_point()
```
%
The labels \Robject{`r scp$labels$x`} and \Robject{`r scp$labels$y`} refer to column names (sample names)
in the \Rclass{data.frame} \Robject{dfx}, which we created above. Since they contain
special characters (spaces, parentheses, hyphen) and start with numerals, we need to
enclose them with the downward sloping quotes to make them syntactically digestible for R.
The plot is shown in Figure~\ref{rgraphics:fig:onedbp1}.
We get a dense point cloud that we can try and interpret on the outskirts of the cloud,
but we really have no idea visually how the data are distributed within the denser regions of
the plot.

One easy way to ameliorate the overplotting is to adjust the transparency (alpha value) of
the points by modifying the \Robject{alpha} parameter of \Rfunction{geom\_point}
(Figure~\ref{rgraphics:fig:twodsp2}).
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_twodsp2-1}
\caption{As Figure~\ref{rgraphics:fig:twodsp1}, but with semi-transparent points to resolve
some of the overplotting.}
\label{rgraphics:fig:twodsp2}
\end{marginfigure}
%
```{r twodsp2, fig.width = 3.75, fig.height = 3.75, dev = "png"}
scp  + geom_point(alpha = 0.1)
```
%
This is already better than Figure~\ref{rgraphics:fig:twodsp1}, but in the very density regions
even the semi-transparent points quickly overplot to a featureless black mass, while the
more isolated, outlying points are getting faint.
An alternative is a contour plot of the 2D density,
which has the added benefit of not rendering all of the points on the plot, as in
Figure~\ref{rgraphics:fig:twodsp3}.
%
```{r twodsp3, fig.width = 3.75, fig.height = 3.75, dev = "png"}
scp + geom_density2d()
```
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_twodsp3-1}
\caption{As Figure~\ref{rgraphics:fig:twodsp1}, but rendered as a contour plot of the 2D density estimate.}
\label{rgraphics:fig:twodsp3}
\end{marginfigure}
%
However, we see in Figure~\ref{rgraphics:fig:twodsp3} that the point cloud at the bottom right (which
contains a relatively small number of points) is no longer represented. We can somewhat overcome
this by tweaking the bandwidth and binning parameters of \Rfunction{geom\_density2d}
(Figure~\ref{rgraphics:fig:twodsp4}, left panel).
%
```{r twodsp4, fig.width = 3.75, fig.height = 3.75, dev = "png"}
scp + geom_density2d(h = 0.5, bins = 60)
```
%
We can fill in each space between the contour lines with the relative density of points by
explicitly calling the function \Rfunction{stat\_density2d} (for which
\Rfunction{geom\_density2d} is a wrapper) and using the geometric object \emph{polygon},
as in the right panel of Figure~\ref{rgraphics:fig:twodsp4}.
%
```{r twodsp5, fig.width = 5.25, fig.height = 3.75, dev = "png", message = FALSE}
library("RColorBrewer")
colourscale = scale_fill_gradientn(
    colours = rev(brewer.pal(9, "YlGnBu")),
    values = c(0, exp(seq(-5, 0, length.out = 100))))

scp + stat_density2d(h = 0.5, bins = 60,
          aes( fill = ..level..), geom = "polygon") +
  colourscale + coord_fixed()
```
%
\begin{figure}
\includegraphics[width=0.416\linewidth]{chap3-r_twodsp4-1}
\includegraphics[width=0.582\linewidth]{chap3-r_twodsp5-1}
\caption{Left: as Figure~\ref{rgraphics:fig:twodsp3}, but with smaller smoothing bandwidth and tighter
binning for the contour lines. Right: with colour filling.}
\label{rgraphics:fig:twodsp4}
\end{figure}
%
We used the function \Rfunction{brewer.pal} from the package \CRANpkg{RColorBrewer} to
define the colour scale, and we added a call to \Rfunction{coord\_fixed} to fix the
aspect ratio of the plot, to make sure that the mapping of data range to $x$- and
$y$-coordinates is the same for the two variables.
Both of these issues merit a deeper look, and we'll talk more about plot shapes in
Section~\ref{rgraphics:sec:banking} and about colours in Section~\ref{rgraphics:sec:colour}.

The density based plotting methods in Figure~\ref{rgraphics:fig:twodsp4} are more visually
appealing and interpretable than the overplotted point clouds of
Figures~\ref{rgraphics:fig:twodsp1} and \ref{rgraphics:fig:twodsp2}, though we have to be careful in using
them as we loose a lot of the information on the outlier points in the sparser regions of
the plot. One possibility is using \Rfunction{geom\_point} to add such points back in.

But arguably the best alternative, which avoids the limitations of smoothing, is hexagonal
binning\cite{Carr1987hexbin}.

%
```{r twodsp6, fig.width = 5.25, fig.height = 3.75, dev = "png"}
library("hexbin")
scp + stat_binhex() + coord_fixed()
scp + stat_binhex(binwidth = c(0.2, 0.2)) + colourscale +
  coord_fixed()
```
%
\begin{figure}
\includegraphics[width=0.499\linewidth]{chap3-r_twodsp6-1}
\includegraphics[width=0.499\linewidth]{chap3-r_twodsp6-2}
\caption{Hexagonal binning. Left: default parameters. Right: finer bin sizes and
  customized colour scale.}
\label{rgraphics:fig:twodsp6}
\end{figure}


%------------------------------------------------------------
\subsection{Plot shapes}
\label{rgraphics:sec:banking}
%------------------------------------------------------------
Choosing the proper shape for your plot is important to make sure the information is
conveyed well. By default, the shape parameter, that is, the ratio, between the height of
the graph and its width, is chosen by \CRANpkg{ggplot2} based on the available space in
the current plotting device. The width and height of the device are specified when it is
opened in R, either explicitly by you or through default parameters\footnote{E.\,g., see
  the manual pages of the \Rfunction{pdf} and \Rfunction{png} functions.}. Moreover, the
graph dimensions also depend on the presence or absence of additional decorations, like
the colour scale bars in Figure~\ref{rgraphics:fig:twodsp6}.

There are two simple rules that you can apply for scatterplots:
\begin{itemize}
\item If the variables on the two axes are measured in the same units, then make sure that
  the same mapping of data space to physical space is used -- i.\,e., use
  \Rfunction{coord\_fixed}.  In the scatterplots above, both axes are the logarithm
  to base 2 of expression level measurements, that is a change by one unit has the same
  meaning on both axes (a doubling of the expression level). Another case is principal
  component analysis (PCA), where the $x$-axis typically represents component 1, and the
  $y$-axis component 2. Since the axes arise from an orthonormal rotation of input data
  space, we want to make sure their scales match. Since the variance of the data is (by
  definition) smaller along the second component than along the first component (or at
  most, equal), well-done PCA plots usually have a width that's larger than the height.
\item If the variables on the two axes are measured in different units, then we can still
  relate them to each other by comparing their dimensions. The default in many plotting
  routines in R, including \CRANpkg{ggplot2}, is to look at the range of the data and map
  it to the available plotting region. However, in particular when the data more or less
  follow a line, looking at the typical slope of the line can be useful.
  This is called banking\cite{Cleveland:1988:Banking}.
\end{itemize}
To illustrate banking, let's use the classic sunspot data from Cleveland's paper.
%
\begin{figure}
\includegraphics[width=0.8\linewidth]{chap3-r_banking1-1}\\
\includegraphics[width=0.8\linewidth]{chap3-r_banking2-1}
\caption{The sunspot data. In the upper panel, the plot shape is roughly quadratic, a
  frequent default choice. In the lower panel, a technique called \emph{banking} was used
  to choose the plot shape.}
\label{rgraphics:fig:banking}
\end{figure}
%
```{r banking1, fig.width = 3.75, fig.height = 3.75}
library("ggthemes")
sunsp = data.frame(year   = time( sunspot.year ),
                    number = as.numeric( sunspot.year ))
sp = ggplot(sunsp, aes(x = year, y = number)) + geom_line()
sp
```
%
The resulting plot is shown in the upper panel of Figure~\ref{rgraphics:fig:banking}. We can
clearly see long-term fluctuations in the amplitude of sunspot activity cycles, with
particularly low maximum activities in the early 1700s, early 1800s, and around the turn
of the 20$^\text{th}$ century. But now lets try out banking.
%
```{r banking2, fig.width = 3.75, fig.height = 2}
ratio = with(sunsp, bank_slopes(year, number))
sp + coord_fixed(ratio = ratio)
```
%
What the algorithm does is to look at the slopes in the curve, and in particular, the
above call to \Rfunction{bank\_slopes} computes the median absolute slope, and then with
the call to \Robject{coord\_fixed} we shape the plot such that this quantity becomes
1. The result is shown in the lower panel of Figure~\ref{rgraphics:fig:banking}. Quite
counter-intuitively, even though the plot takes much smaller space, we see more on it!
Namely, we can see the saw-tooth shape of the sunspot cycles, with sharp rises and more
slow declines.


%------------------------------------------------------------
\section{3--5D Data}
\label{rgraphics:sec:facets}
%------------------------------------------------------------
Sometimes we want to show the relations between more than two variables. Obvious choices
for including additional dimensions are the plot symbol shapes and colours. The
\Rfunction{geom\_point} geometric object offers the following aesthetics (beyond
\Robject{x} and \Robject{y}):
\begin{itemize}
\item \Robject{fill}
\item \Robject{colour}
\item \Robject{shape}
\item \Robject{size}
\item \Robject{alpha}
\end{itemize}
They are explored in the manual page of the \Rfunction{geom\_point}
function. \Robject{fill} and \Robject{colour} refer to the fill and outline colour of an
object; \Robject{alpha} to its transparency level. Above, in Figures~\ref{rgraphics:fig:twodsp2}
and following, we have used colour or transparency to reflect point density and avoid the
obscuring effects of overplotting. Instead, we can use them show other dimensions of the
data (but of course we can only do one or the other). In principle, we could use all the 5
aesthetics listed above simultaneously to show up to 7-dimensional data; however, such a
plot would be hard to decipher, and most often we are better off with one or two
additional dimensions and mapping them to a choice of the available aesthetics.

%------------------------------------------------------------
\subsection{Faceting}
%------------------------------------------------------------

Another way to show additional dimensions of the data is to show multiple plots that
result from repeatedly subsetting (or ``slicing'') our data based on one (or more) of the
variables, so that we can visualize each part separately. So we can, for instance,
investigate whether the observed patterns among the other variables are the same or
different across the range of the faceting variable. Let's look at an example\footnote{The
  first two lines this code chunk are not strictly necessary -- they're just reformatting 
  the \Robject{lineage} column of the \Robject{dftx} \Rclass{data.frame}, to make the 
  plots look better.}

```{r facet1alternative, eval = FALSE, echo = FALSE}
dftx = mutate(dftx, 
  lineage = factor(sub("^$", "no", lineage),
                   levels = c("no", "EPI", "PE", "FGF4-KO")))
```
```{r facet1, fig.width = 8, fig.height = 2}
library("magrittr")
dftx$lineage %<>% sub("^$", "no", .)
dftx$lineage %<>% factor(levels = c("no", "EPI", "PE", "FGF4-KO"))

ggplot(dftx, aes( x = X1426642_at, y = X1418765_at)) +
  geom_point() + facet_grid( . ~ lineage )
```
%$

\begin{figure}
\includegraphics[width=\linewidth]{chap3-r_facet1-1}
\caption{An example for \emph{faceting}: the same data as in
  Figure~\ref{rgraphics:fig:scp2layers1}, but now split by the categorical variable
  \Robject{lineage}.}
\label{rgraphics:fig:facet1}
\end{figure}

The result is shown in Figure~\ref{rgraphics:fig:facet1}. We used the formula
language to specify by which variable we want to do the splitting, and that the separate
panels should be in different columns: \Robject{facet\_grid( . $\sim$ lineage )}. In fact, we
can specify two faceting variables, as follows; the result is shown in
Figure~\ref{rgraphics:fig:facet2}.

```{r facet2, fig.width = 8, fig.height = 6}
ggplot( dftx,
  aes( x = X1426642_at, y = X1418765_at)) + geom_point() +
   facet_grid( Embryonic.day ~ lineage )
```

\begin{figure}
\includegraphics[width=\linewidth]{chap3-r_facet2-1}
\caption{\emph{Faceting}: the same data as in Figure~\ref{rgraphics:fig:scp2layers1},
  split by the categorical variables \Robject{Embryonic.day} (rows) and \Robject{lineage}
  (columns).}
\label{rgraphics:fig:facet2}
\end{figure}

Another useful function is \Rfunction{facet\_wrap}: if the faceting variables has too many
levels for all the plots to fit in one row or one column, then this function can be used
to wrap them into a specified number of columns or rows.

We can use a continuous variable by discretizing it into levels. The function
\Rfunction{cut} is useful for this purpose.

```{r facet3, fig.width = 4, fig.height = 4}
ggplot(mutate(dftx, Tdgf1 = cut(X1450989_at, breaks = 4)),
   aes( x = X1426642_at, y = X1418765_at)) + geom_point() +
   facet_wrap( ~ Tdgf1, ncol = 2 )
```

We see in Figure~\ref{rgraphics:fig:facet3} that the number of points in the four
panel is different, this is because \Rfunction{cut} splits into bins of equal length, not
equal number of points. If we want the latter, then we can use \Rfunction{quantile} in
conjunction with \Rfunction{cut}.

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_facet3-1}
\caption{\emph{Faceting}: the same data as in Figure~\ref{rgraphics:fig:scp2layers1},
  split by the continuous variable \Robject{X1450989\_at} and arranged by
  \Rfunction{facet\_wrap}.}
\label{rgraphics:fig:facet3}
\end{marginfigure}

\paragraph{Axes scales}
In Figures~\ref{rgraphics:fig:facet1}--\ref{rgraphics:fig:facet3}, the axes
scales are the same for all plots.  Alternatively, we could let them vary by setting the
\Robject{scales} argument of the \Rfunction{facet\_grid} and \Rfunction{facet\_wrap}; this
parameters allows you to control whether to leave the $x$-axis, the $y$-axis, or both to
be freely variable. Such alternatives scalings might allows us to see the full detail of
each plot and thus make more minute observations about what is going on in each. The
downside is that the plot dimensions are not comparable across the groupings.

\paragraph{Implicit faceting} You can also facet your plots (without explicit
calls to \Rfunction{facet\_grid} and \Rfunction{facet\_wrap}) by specifying the
aesthetics. A very simple version of implicit faceting is using a factor as your $x$-axis,
such as in Figures~\ref{rgraphics:fig:onedbp1}--\ref{rgraphics:fig:oneddot}

%------------------------------------------------------------
\subsection{Interactive graphics}
%------------------------------------------------------------
\fixme{Vlad wrote:} The plots generated in R are static images. For complex data it may be
useful to create interactive visualizations, which could be explored by navigating a
computer mouse through different parts of the graphic to view pop-up annotations, zooming
in and out, pulling the graphic to rotate in the image space, etc.

\subsubsection{plotly}
A great web-based tool for interactive graphic generation is \textbf{plotly} You can view
some examples of interactive graphics online \url{https://plot.ly}. To create your own
interactive plots in R use package \CRANpkg{plotly} 
%and pass your user name and API key to
%the function \Rfunction{plotly}, e.g. \Rfunction{plotly}(JohnDoe, 123456). After you
%signed in, you can find your API key in \emph{User Name} -> Settings.

You can learn how to use \CRANpkg{plotly} by viewing examples and clicking on "Code" and
selecting R in "Language" menu.

\subsubsection{ggvis}
\fixme{Describe}


\subsubsection{rgl, webgl}

To generate 3D interactive graphics in R ... there is package \CRANpkg{rgl}. 
\fixme{More interesting example.}
We can visualize the classic iris flower data set in a 3D scatter plot (Fig.~\ref{rgl-iris}).
\begin{marginfigure}
%\includegraphics{snapshot}
\caption{Scatter plot of iris data.}
\label{rgl-iris}
\end{marginfigure}

The iris data set is based on measurements of petal and sepal lengths and widths for 3
species of iris flowers. The axes in Fig.~\ref{rgl-iris} represent petal and sepal
dimensions. The color indicates which species the flower belongs to. Run the following
code chunk sequentially to view the 3D scatter plot in an interactive mode.

\fixme{Make this code live again}
%
```{r webGL, eval = FALSE}
library("rgl")
bbibrary("rglwidget")
with(iris, plot3d(Sepal.Length, Sepal.Width, Petal.Length,
                   type="s", col=as.numeric(Species)))
writeWebGL(dir=file.path(getwd(), "figure"), width=700)
```

Function \Rfunction{writeWebGL} exports the 3D scene as an HTML file that can be viewed
interactively in a browser.


%------------------------------------------------------------
\section{Colour}
\label{rgraphics:sec:colour}
%------------------------------------------------------------

An important consideration when making plots is the colouring that we use in them.  Most R
users are likely familiar with the built-in R colour scheme, used by base R graphics, as
shown in Figure~\ref{rgraphics:fig:simplecolourpie}.

```{r simplecolourpie, fig.width = 3, fig.height = 3}
pie(rep(1, 8), col=1:8)
```

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_simplecolourpie-1}
\caption{Basic R colours. }
\label{rgraphics:fig:simplecolourpie}
\end{marginfigure}

These colour choices date back from 1980s hardware, where graphics cards handled colours
by letting each pixel either fully use or not use each of the three basic colour channels
of the display: red, green and blue (RGB): this leads to $2^3=8$ combinations, which lie
at the 8 the extreme corners of the RGB color cube\footnote{Thus the $8^\text{th}$ colour
  should be white; in R, whose basic infastructure was put together when more
  sophisticated graphics display were already available, this was replaced by grey, as you
  can see in Figure~\ref{rgraphics:fig:simplecolourpie}.}  The colours in
Figure~\ref{rgraphics:fig:simplecolourpie} are harsh on the eyes, and there is no good
excuse any more for creating graphics that are based on this palette. Fortunately, the
default colours used by some of the more modern visualisation oriented packages (including
\CRANpkg{ggplot2}) are much better already, but sometimes we want to make our own choices.

In Section~\ref{rgraphics:sec:2d} we saw the function \Rfunction{scale\_fill\_gradientn},
which allowed us to create the colour gradient used in Figures \ref{rgraphics:fig:twodsp4}
and \ref{rgraphics:fig:twodsp6} by interpolating the basic colour palette defined by the
function \Rfunction{brewer.pal} in the \CRANpkg{RColorBrewer} package. This package
defines a great set of colour palettes, we can see all of them at a glance by using the
function \Rfunction{display.brewer.all} (Figure~\ref{rgraphics:fig:RColorBrewer}).
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_RColorBrewer-1}
\caption{RColorBrewer palettes.}
\label{rgraphics:fig:RColorBrewer}
\end{marginfigure}
%
```{r RColorBrewer, fig.width=4, fig.height=7}
display.brewer.all()
```
%
We can get information about the available colour palettes from the
\Rclass{`r class(brewer.pal.info)`} \Robject{brewer.pal.info}.
%
```{r colour3}
head(brewer.pal.info)
table(brewer.pal.info$category)
```
%$
The palettes are divided into three categories:
\begin{itemize}
\item qualitative: for categorical properties that have no intrinsic ordering. The
  \Robject{Paired} palette supports up to 6 categories that each fall into two
  subcategories - like \emph{before} and \emph{after}, \emph{with} and \emph{without}
  treatment, etc.
\item sequential: for quantitative properties that go from \emph{low} to \emph{high}
\item diverging: for quantitative properties for which there is a natural midpoint or
  neutral value, and whose value can deviate both up- and down; we'll see an example in
  Figure~\ref{rgraphics:fig:heatmap}.
\end{itemize}
To obtain the colours from a particular palette we use the function
\Rfunction{brewer.pal}. Its first argument is the number of colours we want (which can be
less than the available maximum number in \Robject{brewer.pal.info}).
%
```{r colour4}
brewer.pal(4, "RdYlGn")
```
%
If we want more than the available number of preset colours (for example so we can plot a
heatmap with continuous colours) we can use the \Rfunction{colorRampPalette} function
command to interpolate any of the \CRANpkg{RColorBrewer} presets -- or any set of colours:
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_colorRampPalette-1}
\caption{A quasi-continuous colour palette derived by interpolating between the colours
  \Robject{darkorange3}, \Robject{white} and \Robject{darkblue}. }
\label{rgraphics:fig:colorRampPalette}
\end{marginfigure}
%
```{r colorRampPalette, fig.width = 3, fig.height = .7}
mypalette  = colorRampPalette(c("darkorange3", "white",
                                          "darkblue"))(100)
head(mypalette)
par(mai = rep(0.1, 4))
image(matrix(1:100, nrow = 100, ncol = 10), col = mypalette,
        xaxt = "n", yaxt = "n", useRaster = TRUE)
```

%--------------------------------------------------
\subsection{Heatmaps}
%--------------------------------------------------

Heatmaps are a powerful of visualising large, matrix-like datasets and giving a quick
overview over the patterns that might be in there. There are a number of heatmap drawing
functions in R; one that is particularly versatile and produces good-looking output is the
function \Rfunction{pheatmap} from the eponymous package. In the code below, we first
select the top 500 most variable genes in the dataset \Robject{x}, and define a function
\Rfunction{rowCenter} that centers each gene (row) by subtracting the mean across
columns. By default, \Rfunction{pheatmap} uses the \emph{RdYlBu} colour palette
from \CRANpkg{RcolorBrewer} in
conjuction with the \Rfunction{colorRampPalette} function to interpolate the
`r brewer.pal.info["RdYlBu", "maxcolors"]` colour into a smooth-looking palette
(Figure~\ref{rgraphics:fig:heatmap}).
%
```{r heatmap, fig.width = 7, fig.height = 7}

library("pheatmap")
topGenes = order(rowVars(exprs(x)), decreasing = TRUE)[ seq_len(500) ]
rowCenter = function(x) { x - rowMeans(x) }
pheatmap( rowCenter(exprs(x)[ topGenes, ] ),
  show_rownames = FALSE, show_colnames = FALSE,
  breaks = seq(-5, +5, length = 101),
  annotation_col = pData(x)[, c("sampleGroup", "Embryonic.day", "ScanDate") ],
  annotation_colors = list(
    sampleGroup = groupColour,
    genotype = c(`FGF4-KO` = "chocolate1", `WT` = "azure2"),
    Embryonic.day = setNames(brewer.pal(9, "Blues")[c(3, 6, 9)], c("E3.25", "E3.5", "E4.5")),
    ScanDate = setNames(brewer.pal(nlevels(x$ScanDate), "YlGn"), levels(x$ScanDate))
  ),
  cutree_rows = 4
)
```
%
\begin{figure}
\includegraphics[width=.9\linewidth]{chap3-r_heatmap-1}
\caption{A heatmap of relative expression values, i.\,e., $\log_2$ fold change compared to
  the average expression of that gene (row) across all samples (columns). The colour scale
  uses a diverging palette, whose neutral midpoint is at 0.}
\label{rgraphics:fig:heatmap}
\end{figure}
%
Let us take a minute to deconstruct the rather massive-looking call to
\Rfunction{pheatmap}.  The options \Robject{show\_rownames} and \Robject{show\_colnames}
control whether the row and column names are printed at the sides of the matrix. Because
our matrix is large in relation to the available plotting space, the labels would anyway
not be readable, and we suppress them. The \Robject{annotation\_col} argument takes a data
frame that carries
% whose rows match the samples (matrix columns), and whose columns can carry
additional information about the samples. The information is shown in the coloured bars on
top of the heatmap. There is also a similar \Robject{annotation\_row} argument, which we
haven't used here, for coloured bars at the side. \Robject{annotation\_colors} is a list
of named vectors by which we can override the default choice of colours for the annotation
bars. Finally, with the \Robject{cutree\_rows} argument we cut the row dendrogram into
four (an arbitrarily chosen number) clusters, and the heatmap shows them by leaving a bit
of white space in between.  The \Rfunction{pheatmap} function has many further options,
and if you want to use it for your own data visualisations, it's worth studying them.
%


%--------------------------------------------------
\subsection{Colour spaces}
\label{rgraphics:sec:colorspaces}
%--------------------------------------------------

Colour perception in humans\cite{Helmholtz:1867} is three-dimensional\footnote{Physically,
  there is an infinite number of wave-lengths of light, and an infinite number of ways of
  mixing them.}. There are different ways of parameterizing this space.  Above we already
encountered the RGB color model, which uses three values in [0,1], for instance at the
beginning of Section~\ref{rgraphics:sec:ggplot}, where we printed out the contents of
\Robject{groupColour}:
%
```{r groupColour2}
groupColour[1]
```
```{r hexvals, echo = FALSE}
hexvals = sapply(1:3, function(i) substr(groupColour[1], i*2, i*2+1))
decvals = strtoi(paste0("0x", hexvals))
```
%
Here, `r hexvals[1]` is the hexadecimal representation for the strength of the red colour
channel, `r hexvals[2]` of the green and `r hexvals[3]` of the green colour channel. In
decimal, these numbers are `r decvals[1]`, `r decvals[2]` and `r decvals[3]`,
respectively. The range of these values goes from to 0 to 255, so by dividing by this
maximum value, an RGB triplet can also be thought of as a point in the three-dimensional
unit cube.
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_hcl-1}\\
\includegraphics[width=\linewidth]{chap3-r_hcl-2}
\caption{\label{rgraphics:fig:hcl}Circles in HCL colorspace.
  Upper panel: The luminosity $L$ is fixed to $75$, while the angular
  coordinate $H$ (hue) varies from 0 to 360 and the radial coordinate $C=0, 10,
  \ldots, 60$. Lower panel: constant chroma $C=50$, $H$ as above, and varying luminosity
  $L=10, 20, \ldots, 90$.}
\end{marginfigure}
%
```{r somecolors, echo = FALSE, results = "hide"}
library("colorspace")
library("grid")

plothcl = function(h, c, l, what, x0 = 0.5, y0 = 0.5, default.units = "npc", ...) {
  switch(what,
         "c" = {
           stopifnot(length(l)==1)
           n = length(c)
         },
         "l" = {
           stopifnot(length(c)==1)
           n = length(l)
         },
         stop("Sapperlot"))

  cr = seq(0.1, 0.5, length = n+1)
  dr = 0.05 / n

  for (j in seq_len(n)) {
    r = c(cr[j]+dr, cr[j+1]-dr)
    for(i in 1:(length(h)-1)){
      phi = seq(h[i], h[i+1], by=1)/180*pi
      px = x0 + c(r[1]*cos(phi), r[2]*rev(cos(phi)))
      py = y0 + c(r[1]*sin(phi), r[2]*rev(sin(phi)))
      mycol = switch(what,
        "c" = hcl(h=mean(h[i+(0:1)]), c=c[j], l=l),
        "l" = hcl(h=mean(h[i+(0:1)]), c=c, l=l[j]))
      grid.polygon(px, py, gp=gpar(col=mycol, fill=mycol),
                   default.units=default.units,...)
    }
  }
}
```
```{r hcl, fig.width = 3.6, fig.height = 3.6, echo = FALSE}
plothcl( h = seq(0, 360, by=3), c = seq(5, 75, by=10), l = 75,   what="c")
grid.newpage()
plothcl( h = seq(0, 360, by=3), c = 55, l = seq(20, 100, by=10), what="l")
```
%

The R function \Rfunction{hcl} uses a different coordinate system, which consists of the
three coordinates hue $H$, an angle in $[0, 360]$, chroma $C$, and lightness $L$ as a
value in $[0, 100]$. The possible values for $C$ depend on some constraints, but are
generally between 0 and 255. The \Rfunction{hcl} function corresponds to polar coordinates
in the CIE-LUV\footnote{CIE: Commission Internationale de l'\'Eclairage -- see
  e.\,g.\ Wikipedia for more on this.} and is designed for area fills. By keeping chroma
and luminescence coordinates constant and only varying hue, it is easy to produce color
palettes that are harmonious and avoid irradiation illusions that make light coloured
areas look bigger than dark ones. Our attention also tends to get drawn to loud colours,
and fixing the value of chroma makes the colors equally attractive to our eyes.

There are many ways of choosing colours from a colour
wheel. \emph{Triads} are three colours chosen equally spaced around
the colour wheel; for example, $H=0, 120, 240$ gives red, green, and
blue.  \emph{Tetrads} are four equally spaced colours around the colour
wheel, and some graphic artists describe the effect as "dynamic".
\emph{Warm colours} are a set of equally spaced colours close to yellow,
\emph{cool colours} a set of equally spaced colours close to blue.
\emph{Analogous colour} sets contain colours from a small segment of the
colour wheel, for example, yellow, orange and red, or green, cyan and
blue.  \emph{Complementary colours} are colours diametrically opposite
each other on the colour wheel. A tetrad is two pairs of
complementaries.  \emph{Split complementaries} are three colours
consisting of a pair of complementaries, with one partner split
equally to each side, for example, $H=60,\,240-30,\,240+30$.  This is
useful to emphasize the difference between a pair of similar
categories and a third different one. A more thorough discussion is
provided in the references~\cite{Mollon:1995,IhakaColorPres}.

\paragraph{Lines vs areas}
For lines and points, we want that they show a strong contrast to the background, so on a
white background, we want them to be relatively dark (low lightness $L$). For area fills,
lighter, more pastell-type colours with low to moderate chromatic content are usually more
pleasant.

%------------------------------------------------------------
\section{Data Transformations}
%------------------------------------------------------------
Plots in which most points are huddled up in one area,
with a lot of sparsely populated space, are difficult to read. If the histogram of the
marginal distribution of a variable has a sharp peak and then long tails to one or both
sides, transforming the data can be helpful. These considerations apply both to
\Robject{x} and \Robject{y} aesthetics, and to colour scales. In the plots of this chapter
that involved the microarray data, we used the logarithmic transformation\footnote{We used
  it implicitly since the data in the \Rclass{ExpressionSet} object \Robject{x} already
  come log-transformed.} -- not only in scatterplots like
Figure~\ref{rgraphics:fig:twodsp1} for the $x$ and $y$-coordinates, but also in
Figure~\ref{rgraphics:fig:heatmap} for the colour scale that represents the expression
fold changes. The logarithm transformation is attractive because it has a definitive
meaning - a move up or down by the same amount on a log-transformed scale corresponds to
the same multiplicative change on the original scale: $\log(ax)=\log a+\log x$.

Sometimes the logarithm however is not good enough, for instance when the data include
zero or negative values, or when even on the logarithmic scale the data distribution is
highly uneven. From the upper panel of Figure~\ref{rgraphics:fig:MA}, it is easy to take away the
impression that the distribution of \Robject{M} depends on \Robject{A}, with higher
variances for low \Robject{A}. However, this is entirely a visual artefact, as the lower
panel confirms: the distribution of \Robject{M} is independent of \Robject{A}, and the
apparent trend we saw in the upper panel was caused by the higher point density at
smaller \Robject{A}.
%
```{r MA1, fig.width = 3.75, fig.height = 3.75, dev = "png"}
A = exprs(x)[,1]
M = rnorm(length(A))
qplot(A, M)
qplot(rank(A), M)
```
%
\begin{ques} Can the visual artefact be avoided by using a density- or binning-based
plotting method, as in Figure~\ref{rgraphics:fig:twodsp6}?
\end{ques}
\begin{ques}
Can the rank transformation also be applied when choosing colour scales e.\,g.\
  for heatmaps? What does \emph{histogram equalization} in image processing do?
\end{ques}

\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_MA1-1}\\
\includegraphics[width=\linewidth]{chap3-r_MA1-2}
\caption{The effect of rank transformation on the visual perception of dependency.}
\label{rgraphics:fig:MA}
\end{marginfigure}


%------------------------------------------------------------
\section{Saving Figures}
%------------------------------------------------------------

Just as important as plotting figures is saving them for later use.

\CRANpkg{ggplot2} has a built-in plot saving function called
\Rfunction{ggsave}, which if run by itself defaults to saving the last
plot you made with the size of the graphics device that it was/is open
in.

```{r plotsave1}
ggsave("myplot1.png")
```

Or you can specify a particular plot that you want to save, say, the sunspot plot from
earlier.

```{r plotsave2}
ggsave("myplot2.pdf", plot = sp)
```
```{r plotsave3, echo = FALSE, results = "hide"}
file.remove("myplot1.png", "myplot2.pdf")
```

There are two major ways of storing plots: vector graphics and raster (pixel) graphics. In
vector graphics, the plot is stored as a series of geometrical primitives such as points,
lines, curves, shapes and typographic characters. The prefered format in R for saving
plots into a vector graphics format is PDF. In raster graphics, the plot is stored in a
dot matrix data structure. The main limitation of raster formats is their limited
resolution, which depends on the number of pixels available; in R, the most commonly used
device for raster graphics output is \Rfunction{png}.  Generally, it's preferable to save
your graphics in a vector graphics format, since it is always possible later to convert a
vector graphics file into a raster format of any desired resolution, while the reverse is
in principle limited by the resolution of the original file. And you don't want the
figures in your talks or papers look poor because of pixelisation artefacts!

%-----------------------------------------------------------
\section{Biological Data with ggbio}
%-----------------------------------------------------------

\fixme{Expand this section... do something more interesting}

A package that can be really useful for biological data is an offshoot of
\CRANpkg{ggplot2} made for biology-specific plots called \Biocpkg{ggbio}.  One example is
plotting and highlighting \textbf{ideograms}\index{ideopram}\footnote{The term ideogram
generally means a graphic that represents a concept, specifically in biology we usually
are referring to plots of the chromosome, like in Figure \ref{rgraphics:fig:ideogram-1}.}

Load the ideogram cytoband information for the hg19 build of the human genome
%
```{r ideo, fig.width=4, fig.height=1.5}
library("ggbio")
data( hg19IdeogramCyto, package = "biovizBase" )
plotIdeogram( hg19IdeogramCyto, subchr = "chr1" )
```
%
\begin{marginfigure}
\includegraphics[width=\linewidth]{chap3-r_ideo-1}
\caption{Chromosome 1 of the human genome: ideogram plot.}
\label{rgraphics:fig:ideogram-1}
\end{marginfigure}

\fixme{Vlad wrote} In addition to cytoband information, one can use \Biocpkg{ggbio} to visualize gene model tracks on which coding regions (CDS), untranslated regions (UTR), introns, exons and non-genetic regions are indicated.


To create a gene model track for a subset of 500 hg19 RNA editing sites use the \texttt{darned\_hg19\_subset500} data set.
%
```{r gene-track1}
data(darned_hg19_subset500, package = "biovizBase")
dn = darned_hg19_subset500
library(GenomicRanges)
library(ggbio)
```
\begin{marginfigure}
\includegraphics[width=4cm]{chap3-r_gene-track3-1}
\caption{Karyogram of 22 chromosomes}
\label{rgraphics:fig:ideogram1-22}
\end{marginfigure}

\fixme{Need to fix the par of the pdf being saved. Right now x-axis labels get squeezed together}

Note that the information on sequence lengths is stored in \texttt{ideoCyto} data set, which provides hg19 genome and cytoband information.
%
```{r gene-track2}
data(ideoCyto, package = "biovizBase")
seqlengths(dn) = seqlengths(ideoCyto$hg19)[names(seqlengths(dn))]
```

Use function \Rfunction{keepSeqlevels} to subset the first 22 chromosomes and the X chromosome and plot the karyogram.
%
```{r gene-track3}
dn = keepSeqlevels(dn, paste0("chr", c(1:22, "X")))
autoplot(dn, axis.text.x=FALSE, layout = "karyogram", aes(color = exReg, fill = exReg))
```
The categorical variable 'exReg' is included with the data and marks CDS (coding regions), 3'-UTR and 5'-UTR, which correspond to 'C', '3' and '5' in the legend of the figure, respectively.


%------------------------------------------------------------
\section{Recap of this Chapter}
%------------------------------------------------------------

\begin{itemize}
\item
  You have had an introduction to the base plotting functions in R.
  They are widely used and can be convenient for quick data exploration.
\item You should now be comfortable making beautiful, versatile and easily extendable
  plots using \CRANpkg{ggplot2}'s \Rfunction{qplot} or \Rfunction{ggplot} functions.
\item Don't be afraid of setting up your data for faceting -- this is a great
  quick way to look at many different ways to slice the data in different wats
\item Now you are prepared to explore \CRANpkg{ggplot2} and plotting in general on your
  own.
\end{itemize}

%------------------------------------------------------------
\section{Exercises}
%------------------------------------------------------------
\begin{ex}[themes]
Explore how to change the visual appearance of plots with themes. For example:
```{r theme_bw, eval=FALSE}
qplot(1:10,1:10)
qplot(1:10,1:10) + theme_bw()
```
\end{ex}

\begin{ex}[colour names in R] Have a look at
\url{http://research.stowers-institute.org/efg/R/Color/Chart}
\end{ex}

\begin{ex}[ggxkcd]
On a lighter note, you can even modify \CRANpkg{ggplot2} to make plots in
the style of the popular webcomic \href{http://www.xkcd.com}{XKCD}.
You do this through manipulating the font and themes of ggplot2 objects.
See \url{http://stackoverflow.com/questions/12675147/how-can-we-make-xkcd-style-graphs-in-r}.
\begin{marginfigure}
\centering
\includegraphics[width=\linewidth]{xkcdgraph}\\
\includegraphics[width=\linewidth]{xkcdexample}
\end{marginfigure}
\end{ex}