%\VignetteIndexEntry{Introduction to BiocParallel} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ \newcommand{\BiocParallel}{\Biocpkg{BiocParallel}} \title{Introduction to \BiocParallel} \author{Vincent Carey, Michael Lawrence, Martin Morgan\footnote{\url{mtmorgan@fhcrc.org}}} \date{Edited: February 16, 2014; Compiled: \today} \begin{document} \maketitle \section{Introduction} The \BiocParallel{} package provides a consistent way of specifying parallel evaluation within \Bioconductor. Enable its use by attaching the package %% <>= library(BiocParallel) @ %% To use, invoke a \BiocParallel{} function like \Rcode{bplapply}, or use a \BiocParallel-enabled function provided by another package. \subsection{Why use \BiocParallel?} The Task View document at \url{http://cran.r-project.org/web/views/HighPerformanceComputing.html} shows that numerous approaches to parallel computing are available in \R. Most applications cited in the task view identify one or more of \CRANpkg{snow}, \CRANpkg{Rmpi} or \CRANpkg{foreach} as relevant parallelization infrastructure. A basic objective of \BiocParallel{} is reduction of complexity faced by developers and users in creating and using software that benefits from performing computations in parallel. This is accomplished by defining abstractions of the key components of parallel computing environments. Information on the parallel environment can be stored in formally structured ``parameter'' arguments that are used at run time to define the approach to parallel execution. This allows developers to focus on \textit{what} is to be computed, leaving the \textit{how} to the infrastructure. Advantages for developers and power users of \BiocParallel{} over \textit{ad hoc} \R{} programming for parallel computation include the following. \begin{itemize} \item A uniform idiom (using \Rcode{BiocParallelParam} instances) is available for defining parallel computing resources; sensible default definitions are generated when \BiocParallel{} is loaded. \item \Rcode{bplapply} and \Rcode{bpvec} address iteration in parallel and parallel evaluation of vectorized functions respectively. \item When the parallel environment is managed by a cluster scheduler through \CRANpkg{BatchJobs}, job management and result retrieval are considerably simplified. \item \Rcode{foreach} and programming with the \CRANpkg{iterators} package are fully supported, but registration of the parallel back end uses \Rcode{BiocParallelParam} instances. \end{itemize} \section{The \BiocParallel{} Interface} The \BiocParallel{} work flow is simple: \begin{enumerate} \item Invoke \BiocParallel-enabled functions. The functions use the registered back-ends for evaluation. \end{enumerate} An optional step is to register appropriate back-ends for your particular configuration by \begin{enumerate} \item Creating a \Rcode{BiocParallelParam} instance to describe how parallel evaluation is to be implemented. \item Registering the \Rcode{BiocParallelParam} instance for use in your \R{} session. \end{enumerate} %% The registry is a `stack', with the last entry added to the stack used first, so your own back-ends generally take precedence over the back-ends established when the \BiocParallel{} package is loaded. \subsection{\Rclass{*Param} objects to describe parallel evaluation environments} Different types of parallel computation are supported by creating and \Rcode{register()}ing a `\Rcode{Param}'. Supported \Rcode{Param} objects are: \begin{description} \item[\Rcode{SerialParam}] Evaluate \BiocParallel-enabled code with parallel evaluation disabled. This is very useful when writing new scripts and trying to debug code. \item[\Rcode{MulticoreParam}] Evaluate \BiocParallel-enabled code using multiple cores on a single computer. When available, this is the most efficient and least troublesome way to parallelize code. Unfortunately, Windows does not support multi-core evaluation (the \Rcode{MulticoreParam} object can be used, but evaluation is serial). On other operating systems, the default number of workers equals the value of the global option \Rcode{mc.cores} (e.g., \Rcode{getOption("mc.cores")}) or, if that is not set, the number of cores returned by \Rcode{parallel::detectCores()}. \item[\Rcode{SnowParam}] Evaluate \BiocParallel-enabled code across several distinct \R{} instances, on one or several computers. This can be an easy way to parallelize code when working with one or several computers, and is based on facilities originally implemented in the \CRANpkg{snow} package. Different types of \CRANpkg{snow} `back-ends' are supported, including socket and MPI clusters. \item[\Rcode{BatchJobsParam}] Evaluate \BiocParallel-enabled code by submitting to a cluster scheduler like SGE. \item[\Rcode{DoparParam}] Register a parallel back-end supported by the \CRANpkg{foreach} package for use with \BiocParallel. \end{description} The simplest illustration of creating \Rcode{BiocParallelParam} is <>= serialParam <- SerialParam() serialParam @ %% Most parameters have additional arguments influencing behavior, e.g., specifying the number of `cores' to use when creating a \Rcode{MulticoreParam} instance <>= multicoreParam <- MulticoreParam(workers=8) multicoreParam @ %% Arguments are detailed on the corresponding help page, e.g., \Rcode{?MulticoreParam}. \subsection{\Rcode{register()}ing \Rcode{BiocParallelParam} instances} The \Rcode{register()} function registers a \Rcode{BiocParallelParam} instance for use in parallel evaluation. <>= register(multicoreParam) @ %% View registered parameters with \Rcode{registered()} %% <>= registered() @ %% The list of registered \Rcode{BiocParallelParam} instances represents the user's preferences for different types of back-ends. Individual algorithms may specify a preferred back-end, and different back-ends maybe chosen when parallel evaluation is nested. \subsection{Functions for parallel computation} There are facilities for querying and controlling parallel evaluation environments. \begin{description} \item[\Rcode{bpisup(x)}] Query a \Rcode{BiocParallelParam} back-end \Rcode{x} for its status. \item[\Rcode{bpworkers}] Query a \Rcode{BiocParallelParam} back-end for the number of workers available for parallel evaluation. \item[\Rcode{bpstart(x)}] Start a parallel back end specified by \Rcode{BiocParallelParam} \Rcode{x}, if possible. \item[\Rcode{bpstop(x)}] Stop a parallel back end specified by \Rcode{BiocParallelParam} \Rcode{x}. \end{description} %% These are used in common functions, implemented as much as possible for all back-ends. The functions (see the help pages, e.g., \Rcode{?bplapply} for a full definition) include \begin{description} \item[\Rcode{bplapply(X, FUN, ...)}] Apply in parallel a function \Rcode{FUN} to each element of \Rcode{X}. \Rcode{bplapply} invokes \Rcode{FUN} \Rcode{length(X)} times, each time with a single element of \Rcode{X}. \item[\Rcode{bpmapply(FUN, ...)}] Apply in parallel a function \Rcode{FUN} to the first, second, etc., elements of each argument in \ldots. \item[\Rcode{bpvec(X, FUN, ...)}] Apply in parallel a function \Rcode{FUN} to subsets of \Rcode{X}. \Rcode{bpvec} invokes function \Rcode{FUN} as many times as there are cores or cluster nodes, with \Rcode{FUN} receiving a subset (typically more than 1 element, in contrast to \Rcode{bplapply}) of \Rcode{X}. \item[\Rcode{bpaggregate(x, data, FUN, ...)}] Use the formula in \Rcode{x} to aggregate \Rcode{data} using \Rcode{FUN}. \end{description} %% There are facilities for recovering from errors \begin{description} \item[\Rcode{bplasterror}] Report the last error reported from a \BiocParallel{} evaluation. \item[\Rcode{bpresume}] Attempt to resume computation after an error. \end{description} \section{Use cases} \subsection{Single computer} \subsection{\emph{Ad hoc} clusters} \subsection{Clusters with schedulers} \section{For developers} Developers wishing to use \BiocParallel{} in their own packages should include \BiocParallel{} in the \texttt{DESCRIPTION} file \begin{verbatim} Imports: BiocParallel \end{verbatim} and import the functions they wish to use in the \texttt{NAMESPACE} file, e.g., \begin{verbatim} importFrom(BiocParallel, bplapply) \end{verbatim} Then invoke the desired function in the code, e.g., <>= system.time(x <- bplapply(1:3, function(i) { Sys.sleep(i); i })) unlist(x) @ %% This will use the back-end returned by \Rcode{bpparam()}, by default a \Rcode{MulticoreParam()} instance or the user's preferred back-end if they have used \Rcode{register()}. The \Rcode{MulticoreParam} back-end does not require any special configuration or set-up and is therefore the safest option for developers. Unfortunately, \Rcode{MulticoreParam} provides only serial evaluation on Windows. Developers should document that their function uses \BiocParallel{} functions on the man page, and should perhaps include in their function signature an argument \Rcode{BPPARAM=bpparam()}. Developers wishing to invoke back-ends other than \Rcode{MulticoreParam} need to take special care to ensure that required packages, data, and functions are available and loaded on the remote nodes. \end{document}