%\VignetteEngine{knitr} %\VignetteIndexEntry{Machine learning techniques available in pRoloc} %\VignetteKeywords{Bioinformatics, Machine learning, Organelle, Proteomics} %\VignettePackage{pRoloc} \documentclass[12pt,a4paper,english]{scrartcl} \usepackage{amsmath,amsfonts,amssymb} \usepackage{tikz} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage[auth-sc]{authblk} \usepackage{setspace} \onehalfspacing % caption formatting \setcapindent{0em} \setkomafont{captionlabel}{\sffamily\bfseries} \setkomafont{caption}{\sffamily} \renewcommand\Authands{ and } \newcommand{\R}{\texttt{R} } \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\mbox{\normalfont\textsf{#1}}}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %% colors \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \usepackage{geometry} \geometry{verbose, tmargin = 2.5cm, bmargin = 2.5cm, lmargin = 3.0cm, rmargin = 3.0cm} \usepackage{hyperref} \usepackage{breakurl} \hypersetup{% pdfusetitle, bookmarks = {true}, bookmarksnumbered = {true}, bookmarksopen = {true}, bookmarksopenlevel = 2, unicode = {true}, breaklinks = {false}, hyperindex = {true}, colorlinks = {true}, linktocpage = {true}, plainpages = {false}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red}, pdfstartview = {Fit}, pdfpagemode = {UseOutlines}, pdfview = {XYZ null null null} } \author{ Laurent Gatto\thanks{\email{lg390@cam.ac.uk}} } \affil{ Cambridge Center for Proteomics\\ University of Cambridge } \begin{document} \title{Machine learning techniques available in \Rpackage{pRoloc}} \maketitle % %% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \vskip 0.3in minus 0.1in % \hrule % \begin{abstract} % This document described the machine learning techniques that are made available in the \Rpackage{pRoloc}. % \end{abstract} % \textit{Keywords}: Bioinformatics, spatial/organelle proteomics, machine learning, visualisation % \vskip 0.1in minus 0.05in % \hrule % \vskip 0.2in minus 0.1in % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \newpage \tableofcontents <>= library("knitr") opts_chunk$set(fig.align = 'center', fig.show = 'hold', par = TRUE, prompt = TRUE, eval = TRUE, comment = NA) options(replace.assign = TRUE, width = 55) suppressPackageStartupMessages(library("MSnbase")) suppressWarnings(suppressPackageStartupMessages(library("pRoloc"))) suppressPackageStartupMessages(library("pRolocdata")) ## suppressPackageStartupMessages(library("class")) suppressPackageStartupMessages(library("xtable")) @ %%$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \paragraph{NB} This document is currently under construction; please see the \Rpackage{pRoloc} tutorial for details. \paragraph{TODO} Define nomenclature \section{Introduction}\label{sec:intro} For a general practical introduction to \Rpackage{pRoloc}, readers are referred to the tutorial, available using \Rfunction{vignette("pRoloc-tutorial", package = "pRoloc")}. The following document provides a overview of the algorithms available in the package. The respective section describe unsupervised machine learning (USML), supervised machine learning (SML), and semi-supervised machine learning (SSML) as implemented in the novetly detection algorithm. \section{Data sets} We provide \Sexpr{nrow(pRolocdata()$results)} test data sets in the \Rpackage{pRolocdata} package%$ that can be readily used with \Rpackage{pRoloc}. The data set can be listed with \Rfunction{pRolocdata} and loaded with the \Rfunction{data} function. Each data set, including its origin, is individually documented. The data sets are distributed as \Robject{MSnSet} instances. Briefly, these are dedicated containers for quantitation data as well as feature and sample meta-data. More details about \Robject{MSnSet}s are available in the \Rpackage{pRoloc} tutorial and in the \Rpackage{MSnbase} package, that defined the class. <>= library("pRolocdata") data(tan2009r1) tan2009r1 @ \section{Unsupervised machine learning}\label{sec:usml} Unsupervised machine learning refers to clustering, i.e. finding structure in a quantitative, generally multi-dimensional data set of unlabelled data. Currently, unsupervised clustering facilities are available through the \Rfunction{plot2D} function and the \Rpackage{MLInterfaces} package \cite{MLInterfaces}. The former takes an \Robject{MSnSet} instance and represents the data on a scatter plot along the first two principal components. Arbitrary feature meta-data can be represented using different colours and point characters. The reader is referred to the manual page available through \Rfunction{?plot2D} for more details and examples. \Rpackage{pRoloc} also implements a \Rfunction{MLean} method for \Robject{MSnSet} instances, allowing to use the relevant infrastructure with the organelle proteomics framework. Although provides a common interface to unsupervised and numerous supervised algorithms, we refer to the \Rpackage{pRoloc} tutorial for its usage to several clustering algorithms. % \paragraph{Note} Current developments in terms of clustering are described on the \textit{Clustering infrastructure} wiki page\footnote{\url{https://github.com/lgatto/pRoloc/wiki/Clustering-infrastructure}} and will be incorporated in future version of the package. \section{Supervised machine learning}\label{sec:sml} Supervised machine learning refers to a broad family of classification algorithms. The algorithms learns from a modest set of labelled data points called the training data. Each training data example consists of a pair of inputs: the actual data, generally represented as a vector of numbers and a class label, representing the membership to exactly 1 of multiple possible classes. When there are only two possible classes, on refers to binary classification. The training data is used to construct a model that can be used to classifier new, unlabelled examples. The model takes the numeric vectors of the unlabelles data points and return, for each of these inputs, the corresponding mapped class. \subsection{Algorithms used}\label{sec:algo} \paragraph{k-nearest neighbour} Function \Rfunction{knn} from package \Rpackage{class}. For each row of the test set, the $k$ nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote over the $k$ classes, with ties broken at random. This is a simple algorithm that is often used as baseline classifier. %% If there are ties for the $k^{th}$ nearest vector, all candidates are included in the vote. \paragraph{Partial least square DA} Function \Rfunction{plsda} from package \Rpackage{caret}. Partial least square discriminant analysis is used to fit a standard PLS model for classification. \paragraph{Support vector machine} A support vector machine constructs a hyperplane (or set of hyperplanes for multiple-class problem), which are then used for classification. The best separation is defined as the hyperplane that has the largest distance (the margin) to the nearest data points in any class, which also reduces the classifiation generalisation error. To assure liner separation of the classes, the data is transformed using a \textit{kernel function} into a high-dimensional space, permitting liner separation of the classes. \Rpackage{pRoloc} makes use of the functions \Rfunction{svm} from package \Rpackage{e1071} and \Rfunction{ksvm} from \Rpackage{kernlab}. \paragraph{Articifial neural network} Function \Rfunction{nnet} from package \Rpackage{nnet}. Fits a single-hidden-layer neural network, possibly with skip-layer connections. \paragraph{Naive Bayes} Function \Rfunction{naiveBayes} from package \Rpackage{e1071}. Naive Bayes classifier that computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. Assumes independence of the predictor variables, and Gaussian distribution (given the target class) of metric predictors. \paragraph{Random Forest} Function \Rfunction{randomRandom} from package \Rpackage{randomForest}. \paragraph{Chi-square} Assingnment based on squared differences between a labelled marker and a new feature to be classified. Canonical protein correlation profile method (PCP) uses sqaured differences between a labelled marker and new features. In \citep{Andersen2003}, $\chi^2$ is defined as \emph{the [summed] squared deviation of the normalized profile [from the marker] for all peptides divided by the number of data points}, i.e. $\chi^{2} = \frac{\sum_{i=1}^{n} (x_i - m_i)^{2}}{n}$, whereas \citep{Wiese2007} divide by the value the squared value by the value of the reference feature in each fraction, i.e. $\chi^{2} = \sum_{i=1}^{n}\frac{(x_i - m_i)^{2}}{m_i}$, where $x_i$ is normalised intensity of feature $x$ in fraction $i$, $m_i$ is the normalised intensity of marker $m$ in fraction $i$ and $n$ is the number of fractions available. We will use the former definition. \paragraph{PerTurbo} From \cite{perturbo}: PerTurbo, an original, non-parametric and efficient classification method is presented here. In our framework, the manifold of each class is characterized by its Laplace-Beltrami operator, which is evaluated with classical methods involving the graph Laplacian. The classification criterion is established thanks to a measure of the magnitude of the spectrum perturbation of this operator. The first experiments show good performances against classical algorithms of the state-of-the-art. Moreover, from this measure is derived an efficient policy to design sampling queries in a context of active learning. Performances collected over toy examples and real world datasets assess the qualities of this strategy. The PerTurno implementation comes from the \Rpackage{pRoloc} packages. \subsection{Default analysis scheme} We present below a typical classification using \Rpackage{pRoloc}. The analaysis typically consists of two steps. The first one is to optimise the classifier parameters to be used for training and testing (see beginning of this section). A range of parameters are tested using the labelled data, for which the labels are known. For each set of parameters, we hide the labels of a subset of labelled data and use the other part to train a model and apply in on the data with hidden labels. The comparison of the estimated and expected labels enables to assess the validity of the model and hence the adequacy if the parameters. Once adequate parameters have been identified, they are used to infer a model on the complete test set and use of to infer the labels of the unlabelled examples. \minisec{Parameter optimisation} Agorithmic performance is estimated using a stratified 20/80 partitioning. The 80\% partitions are subjected to 5-fold cross-validation in order to optimise free parameters via a grid search, and these parameters are then applied to the remaining 20\%. The procedure is repeated $n = 100$ \texttt{times} to sample $n$ accuracy metrics (see below) values using $n$, possibly different, optimised parameters for evaluation. Models accuracy is evaluated using the F1 score, $F1 = 2 ~ \frac{precision \times recall}{precision + recall}$, calculated as the harmonic mean of the precision ($precision = \frac{tp}{tp+fp}$, a measure of \textit{exactness} -- returned output is a relevant result) and recall ($recall=\frac{tp}{tp+fn}$, a measure of \textit{completeness} -- indicating how much was missed from the output). What we are aiming for are high generalisation accuracies, i.e high $F1$, indicating that the marker proteins in the test data set are consistently correctly assigned by the algorithms. The result of the optimisation procedure are stored in an \Robject{GenRegRes} object that can be inspected, plotted and best parameter pairs can be extracted. For a given algorithm \texttt{alg}, the corresponding parameter optimisation function is names \Rfunction{algOptimisation} or, equivalently, \Rfunction{olgOptimization}. See table \ref{tab:algo} for details. A description of each of the respective model parameters is provided in the optimisation function manuals, available through \Rfunction{?algOptimisation}. <>= params <- svmOptimisation(tan2009r1, times = 10, xval = 5, verbose = FALSE) params @ \minisec{Classification} <>= tan2009r1 <- svmClassification(tan2009r1, params) tan2009r1 @ \subsection{Customising model parameters} Below we illustrate how to weight different classes according to the number of labelled instances, were large sets are down weighted. This strategy can help with imbalanced designs. <>= w <- table(fData(dunkley2006)$markers) w <- 1/w[-5] wpar <- svmOptimisation(dunkley2006, class.weights = w) wres <- svmClassification(dunkley2006, pw, class.weights = w) @ <>= ## Add chi^2. tab <- data.frame('parameter optimisation' = grep("Optimisation", ls("package:pRoloc"), value = TRUE), 'classification' = grep("Classification", ls("package:pRoloc"), value = TRUE)) tab$algorithm <- c("nearest neighbour", "support vector machine", "naive bayes", "neural networks", "PerTurbo", "partial least square", "random forest", "support vector machine") tab$package <- c("class", "kernlab", "e1071", "nnet", "pRoloc", "caret", "randomForest", "e1071") colnames(tab)[1] <- c("parameter optimisation") @ <>= xt <- xtable(tab, label = "tab:algo", caption = "Supervised ML algorithm available in \\Rpackage{pRoloc}.") print(xt, include.rownames = FALSE, size = "small") @ \section{Comparison of different classifiers} Several supervised machine learning algorithms have already been applied to organelle proteomics data classification: partial least square discriminant analysis in \cite{Dunkley2006, Tan2009}, support vector machines (SVMs) in \cite{Trotter2010}, random forest in \cite{Ohta2010}, neural networks in \cite{Tardif2012}, naive Bayes \cite{Nikolovski2012}. In our HUPO 2011 poster (see \Rfunction{vignette("HUPO\_2011\_poster", package = "pRoloc")}), we show that different classification algorithms provide very similar performance. We have extended this comparison on various datasets distributed in the \Rpackage{pRolocdata package}. On figure \ref{fig:f1box}, we illustrate how different algorithms reach very similar performances on most of our test datasets. \begin{figure}[!hbt] \centering \includegraphics[width=0.7\textwidth]{./Figures/F1boxplots.pdf} \caption{Comparison of classification performances of several contemporaty classification algorithms on data from the \Rpackage{pRolocdata} package.} \label{fig:f1box} \end{figure} \section{Semi-supervised machine learning and novelty detection}\label{sec:ssml} The PhenoDisco algorithm is a semi-supervised novelty detection method by \cite{Breckels2013} (figure \ref{fig:pd}). It uses the labelled (i.e. markers, noted $D_L$) and unlabelled (i.e. proteins of unknown localisation, noted $D_U$) sets of the input data. The algorithm is repeated $N$ times (the \Robject{code} argument in the \Rfunction{phenoDisco} function). At each iteration, each organelle class $D_{L}^{i}$ and the unlabelled complement are clustered using Gaussian mixture modelling. While unlabelled members that systematically cluster with $D_{L}^{i}$ and pass outlier detection are labelled as new putative members of class $i$, any example of $D_U$ which are not merged with any any of the $D_{L}^{i}$ and are consistently clustered together throughout the $N$ iterations are considered members of a new phenotype. \begin{figure}[!hbt] \centering \includegraphics[width=0.7\textwidth]{./Figures/phenodisco.pdf} \caption{The PhenoDisco iterative algorithm.} \label{fig:pd} \end{figure} \singlespacing %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \section*{Session information}\label{sec:sessionInfo} All software and respective versions used to produce this document are listed below. <>= toLatex(sessionInfo()) @ \bibliographystyle{plainnat} \bibliography{pRoloc} \end{document}