%\VignetteIndexEntry{05. Genomic Ranges} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= options(max.print=1000) BiocStyle::latex() library(knitr) opts_chunk$set(cache=TRUE, tidy=FALSE) @ <>= suppressPackageStartupMessages({ library(ShortRead) library(VariantAnnotation) library(ggplot2) library(RNAseqData.HNRNPC.bam.chr14) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) library(AnnotationHub) library(rtracklayer) }) @ \title{Practical: Ranges} \author{Martin Morgan (mtmorgan@fhcrc.org)} \date{27-28 February, 2014} \newcommand{\Hsap}{\emph{H.~sapiens}} \newcommand{\Dmel}{\emph{D.~melanogaster}} \usepackage{Exercise} \begin{document} \maketitle \tableofcontents \section{Working with ranges} Start by loading the \Biocpkg{GenomicRanges} package and defining the \Rfunction{plotRanges} helper function <>= library(GenomicRanges) plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep = 0.5, ...) { height <- 1 if (is(xlim, "Ranges")) xlim <- c(min(start(xlim)), max(end(xlim))) bins <- disjointBins(IRanges(start(x), end(x) + 1)) plot.new() par(mai=c(0.6, 0.2, 0.6, 0.2)) plot.window(xlim, c(0, max(bins)*(height + sep))) ybottom <- bins * (sep + height) - height rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...) title(main, cex.main=2.8, font.main=1) axis(1) } @ Ranges describe both features of interest (e.g., genes, exons, promoters) and reads aligned to the genome. \Bioconductor{} has very powerful facilities for working with ranges, some of which are summarized in Table~\ref{tab:ranges-pkgs}. These are implemented in the \Biocpkg{GenomicRanges} package; see \cite{10.1371/journal.pcbi.1003118} for a more comprehensive conceptual orientation. \begin{table} \centering \caption{Selected \Bioconductor{} packages for representing and manipulating ranges, strings, and other data structures.} \label{tab:ranges-pkgs} \begin{tabular}{lp{.7\textwidth}} Package & Description\\ \hline \Biocpkg{IRanges} & Defines important classes (e.g., \Rclass{IRanges}, \Rclass{Rle}) and methods (e.g., \Rfunction{findOverlaps}, \Rfunction{countOverlaps}) for representing and manipulating ranges of consecutive values. Also introduces \Rclass{DataFrame}, \Rclass{SimpleList} and other classes tailored to representing very large data.\\ \Biocpkg{GenomicRanges} & Range-based classes tailored to sequence representation (e.g., \Rclass{GRanges}, \Rclass{GRangesList}), with information about strand and sequence name.\\ \Biocpkg{GenomicFeatures} & Foundation for manipulating data bases of genomic ranges, e.g., representing coordinates and organization of exons and transcripts of known genes.\\ \hline \end{tabular} \end{table} \paragraph{The \Rclass{GRanges} class} Instances of \Rclass{GRanges} are used to specify genomic coordinates. Suppose we wish to represent two \Dmel{} genes. The first is located on the positive strand of chromosome 3R, from position 19967117 to 19973212. The second is on the minus strand of the X chromosome, with `left-most' base at 18962306, and right-most base at 18962925. The coordinates are \emph{1-based} (i.e., the first nucleotide on a chromosome is numbered 1, rather than 0), \emph{left-most} (i.e., reads on the minus strand are defined to `start' at the left-most coordinate, rather than the 5' coordinate), and \emph{closed} (the start and end coordinates are included in the range; a range with identical start and end coordinates has width 1, a 0-width range is represented by the special construct where the end coordinate is one less than the start coordinate). A complete definition of these genes as \Rclass{GRanges} is: <>= genes <- GRanges(seqnames=c("chr3R", "chrX"), ranges=IRanges( start=c(19967117, 18962306), end = c(19973212, 18962925)), strand=c("+", "-"), seqlengths=c(chr3R=27905053, chrX=22422827)) @ \noindent The components of a \Rclass{GRanges} object are defined as vectors, e.g., of \Rcode{seqnames}, much as one would define a \Rclass{data.frame}. The start and end coordinates are grouped into an \Rclass{IRanges} instance. The optional \Rcode{seqlengths} argument specifies the maximum size of each sequence, in this case the lengths of chromosomes 3R and X in the `dm2' build of the \Dmel{} genome. This data is displayed as <>= genes @ The \Rclass{GRanges} class has many useful methods defined on it. Consult the help page \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor} \begin{kframe} \begin{alltt} \hlkwd{?GRanges} \end{alltt} \end{kframe} \end{knitrout} \noindent and package vignettes <>= vignette(package="GenomicRanges") @ \noindent for a comprehensive introduction. A \Rclass{GRanges} instance can be subset, with accessors for getting and updating information. <>= genes[2] strand(genes) width(genes) length(genes) names(genes) <- c("FBgn0039155", "FBgn0085359") genes # now with names @ \noindent \Rfunction{strand} returns the strand information in a compact representation called a \emph{run-length encoding}. The `names' could have been specified when the instance was constructed; once named, the \Rclass{GRanges} instance can be subset by name like a regular vector. As the \Rfunction{GRanges} function suggests, the \Rclass{GRanges} class extends the \Rclass{IRanges} class by adding information about \Rcode{seqnames}, \Rcode{strand}, and other information particularly relevant to representing ranges that are on genomes. The \Rclass{IRanges} class and related data structures (e.g., \Rclass{RangedData}) are meant as a more general description of ranges defined in an arbitrary space. Many methods implemented on the \Rclass{GRanges} class are `aware' of the consequences of genomic location, for instance treating ranges on the minus strand differently (reflecting the 5' orientation imposed by DNA) from ranges on the plus strand. \paragraph{Operations on ranges} The \Rclass{GRanges} class has many useful methods. We use \Rclass{IRanges} to illustrate these operations to avoid complexities associated with strand and seqnames, but the operations are comparable on \Rclass{GRanges}. We begin with a simple set of ranges: <>= ir <- IRanges(start=c(7, 9, 12, 14, 22:24), end=c(15, 11, 12, 18, 26, 27, 28)) @ %% <>= png("ranges-ir-plot.png", width=800, height=160) plotRanges(ir, xlim=c(5, 35), main="Original") dev.off() png("ranges-shift-ir-plot.png", width=800, height=160) plotRanges(shift(ir, 5), xlim=c(5, 35), main="Shift") dev.off() png("ranges-reduce-ir-plot.png", width=800, height=160) plotRanges(reduce(ir), xlim=c(5, 35), main="Reduce") dev.off() @ \noindent These and some common operations are illustrated in the upper panel of Figure~\ref{fig:ranges} and summarized in Table~\ref{tab:range-ops}. %% \begin{figure} \centering \includegraphics[width=.5\textwidth,height=!]{ranges-ir-plot}\\ \includegraphics[width=.5\textwidth,height=!]{ranges-shift-ir-plot}\\ \includegraphics[width=.5\textwidth,height=!]{ranges-reduce-ir-plot} \caption{Ranges} \label{fig:ranges} \end{figure} %% Methods on ranges can be grouped as follows: \begin{description} \item[Intra-range] methods act on each range independently. These include \Rfunction{flank}, \Rfunction{narrow}, \Rfunction{reflect}, \Rfunction{resize}, \Rfunction{restrict}, and \Rfunction{shift}, among others. An illustration is \Rfunction{shift}, which translates each range by the amount specified by the \Rcode{shift} argument. Positive values shift to the right, negative to the left; \Rcode{shift} can be a vector, with each element of the vector shifting the corresponding element of the \Rclass{IRanges} instance. Here we shift all ranges to the right by 5, with the result illustrated in the middle panel of Figure~\ref{fig:ranges}. <>= shift(ir, 5) @ <>= png("ranges-shift-ir-plot.png", width=800, height=160) plotRanges(shift(ir, 5), xlim=c(5, 35), main="Shift") dev.off() @ \item[Inter-range] methods act on the collection of ranges as a whole. These include \Rfunction{disjoin}, \Rfunction{reduce}, \Rfunction{gaps}, and \Rfunction{range}. An illustration is \Rfunction{reduce}, which reduces overlapping ranges into a single range, as illustrated in the lower panel of Figure~\ref{fig:ranges}. <>= reduce(ir) @ <>= png("ranges-reduce-ir-plot.png", width=800, height=160) plotRanges(reduce(ir), xlim=c(5, 35), main="Reduce") dev.off() @ \Rfunction{coverage} is an inter-range operation that calculates how many ranges overlap individual positions. Rather than returning ranges, \Rfunction{coverage} returns a compressed representation (run-length encoding) <>= cvg <- coverage(ir) cvg ## plot(as.integer(cvg), type="s", xlab="Coordinate", ylab="Depth of coverage") @ The run-length encoding can be interpreted as `a run of length 6 of nucleotides covered by 0 ranges, followed by a run of length 2 of nucleotides covered by 1 range\ldots'. \item[Between] methods act on two (or sometimes more) \Rclass{IRanges} instances. These include \Rfunction{intersect}, \Rfunction{setdiff}, \Rfunction{union}, \Rfunction{pintersect}, \Rfunction{psetdiff}, and \Rfunction{punion}. The \Rfunction{countOverlaps} and \Rfunction{findOverlaps} functions operate on two sets of ranges. \Rfunction{countOverlaps} takes its first argument (the \Rcode{query}) and determines how many of the ranges in the second argument (the \Rcode{subject}) each overlaps. The result is an integer vector with one element for each member of \Rcode{query}. \Rfunction{findOverlaps} performs a similar operation but returns a more general matrix-like structure that identifies each pair of query / subject overlaps. Both arguments allow some flexibility in the definition of `overlap'. \end{description} \begin{table} \centering \caption{Common operations on \Rclass{IRanges}, \Rclass{GRanges} and \Rclass{GRangesList}.} \label{tab:range-ops} \begin{tabular}{lll} Category & Function & Description \\ \hline Accessors & \Rfunction{start}, \Rfunction{end}, \Rfunction{width} & Get or set the starts, ends and widths \\ & \Rfunction{names} & Get or set the names \\ & \Rfunction{mcols}, \Rfunction{metadata} & Get or set metadata on elements or object \\ & \Rfunction{length} & Number of ranges in the vector \\ & \Rfunction{range} & Range formed from min start and max end \\ Ordering & \Rfunction{<}, \Rfunction{<=}, \Rfunction{>}, \Rfunction{>=}, \Rfunction{==}, \Rfunction{!=} & Compare ranges, ordering by start then width \\ & \Rfunction{sort}, \Rfunction{order}, \Rfunction{rank} & Sort by the ordering \\ & \Rfunction{duplicated} & Find ranges with multiple instances \\ & \Rfunction{unique} & Find unique instances, removing duplicates \\ Arithmetic & \Rcode{r + x}, \Rcode{r - x}, \Rcode{r * x} & Shrink or expand ranges \Robject{r} by number \Robject{x} \\ & \Rfunction{shift} & Move the ranges by specified amount \\ & \Rfunction{resize} & Change width, anchoring on start, end or mid \\ & \Rfunction{distance} & Separation between ranges (closest endpoints) \\ & \Rfunction{restrict} & Clamp ranges to within some start and end \\ & \Rfunction{flank} & Generate adjacent regions on start or end \\ Set operations & \Rfunction{reduce} & Merge overlapping and adjacent ranges \\ & \Rfunction{intersect}, \Rfunction{union}, \Rfunction{setdiff} & Set operations on reduced ranges \\ & \Rfunction{pintersect}, \Rfunction{punion}, \Rfunction{psetdiff} & Parallel set operations, on each \Rcode{x[i]}, \Rcode{y[i]} \\ & \Rfunction{gaps}, \Rfunction{pgap} & Find regions not covered by reduced ranges \\ & \Rfunction{disjoin} & Ranges formed from union of endpoints \\ Overlaps & \Rfunction{findOverlaps} & Find all overlaps for each \Robject{x} in \Robject{y} \\ & \Rfunction{countOverlaps} & Count overlaps of each \Robject{x} range in \Robject{y} \\ & \Rfunction{nearest} & Find nearest neighbors (closest endpoints) \\ & \Rfunction{precede}, \Rfunction{follow} & Find nearest \Robject{y} that \Robject{x} precedes or follows \\ & \Rcode{x \%in\% y} & Find ranges in \Robject{x} that overlap range in \Robject{y} \\ Coverage & \Rfunction{coverage} & Count ranges covering each position \\ Extraction & \Rcode{r[i]} & Get or set by logical or numeric index \\ & \Rcode{r[[i]]} & Get integer sequence from \Rcode{start[i]} to \Rcode{end[i]} \\ & \Rfunction{subsetByOverlaps} & Subset \Robject{x} for those that overlap in \Robject{y} \\ & \Rfunction{head}, \Rfunction{tail}, \Rfunction{rev}, \Rfunction{rep} & Conventional R semantics \\ Split, combine & \Rfunction{split} & Split ranges by a factor into a \Rclass{RangesList} \\ & \Rfunction{c} & Concatenate two or more range objects \\ \hline \end{tabular} \end{table} \paragraph{Adding \Rfunction{mcols} and \Rfunction{metadata}} The \Rclass{GRanges} class (actually, most of the data structures defined or extending those in the \Biocpkg{IRanges} package) has two additional very useful data components. The \Rfunction{mcols} function allows information on each range to be stored and manipulated (e.g., subset) along with the \Rclass{GRanges} instance. The element metadata is represented as a \Rclass{DataFrame}, defined in \Biocpkg{IRanges} and acting like a standard \R{} \Rclass{data.frame} but with the ability to hold more complicated data structures as columns (and with element metadata of its own, providing an enhanced alternative to the \Biocpkg{Biobase} class \Rclass{AnnotatedDataFrame}). <>= mcols(genes) <- DataFrame(EntrezId=c("42865", "2768869"), Symbol=c("kal-1", "CG34330")) @ \noindent \Rfunction{metadata} allows addition of information to the entire object. The information is in the form of a list; any data can be provided. <>= metadata(genes) <- list(CreatedBy="A. User", Date=date()) @ \paragraph{The \Rclass{GRangesList} class} The \Rfunction{GRanges} class is extremely useful for representing simple ranges. Some next-generation sequence data and genomic features are more hierarchically structured. A gene may be represented by several exons within it. An aligned read may be represented by discontinuous ranges of alignment to a reference. The \Rclass{GRangesList} class represents this type of information. It is a list-like data structure, which each element of the list itself a \Rclass{GRanges} instance. The ENSEMBL genes identified earlier can be represented as a \Rclass{GRangesList}. <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene # alias ## have ENSEMBL ids ensids <- c("ENSG00000130720", "ENSG00000103257", "ENSG00000156414", "ENSG00000144644", "ENSG00000159307", "ENSG00000144485") ## need ENTREZID's egids <- select(org.Hs.eg.db, ensids, "ENTREZID", "ENSEMBL")$ENTREZID exByGn <- exonsBy(txdb, "gene")[egids] @ <>= exByGn @ The \Rclass{GRangesList} object has methods one would expect for lists (e.g., \Rfunction{length}, sub-setting). Many of the methods introduced for working with \Rclass{IRanges} are also available, with the method applied element-wise. \subsection{Selecting gene sequences} \begin{Exercise} This exercise uses annotation packages to go from gene identifiers to coding sequences. \begin{enumerate} \item Map from an informal gene SYMBOL, e.g., BRCA1, to ENTREZID gene identifiers using the \Biocannopkg{org.Hs.eg.db} package and the \Rfunction{select} function, use the \Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene} package and a second map to go from ENTREZID to TXNAME. \item Extract the coding sequence grouped by transcript using the \Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene} package and \Rfunction{cdsBy} function; select just those transcripts we are interested in. \item Retrieve the nucleotide sequence from the \Biocannopkg{BSgenome.Hsapiens.UCSC.hg19} package using the function \Rfunction{extractTranscriptsFromGenome}. \item Verify that the coding sequences are all multiples of 3, and translate from nucleotide to amino acid sequence. \end{enumerate} \end{Exercise} \begin{Solution} Map from gene SYMBOL to ENTREZID, and from ENTREZID to TXNAME and extract the releveant coding sequence, grouped by transcript <>= library(org.Hs.eg.db) egid <- select(org.Hs.eg.db, "BRCA1", "ENTREZID", "SYMBOL")$ENTREZID library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene egToTx <- select(txdb, egid, "TXNAME", "GENEID") cdsByTx <- cdsBy(txdb, "tx", use.names=TRUE)[egToTx$TXNAME] @ \noindent Retrieve the sequence <>= library(BSgenome.Hsapiens.UCSC.hg19) txx <- extractTranscriptsFromGenome(Hsapiens, cdsByTx) @ \noindent Translate to amino acid sequence <>= all(width(txx) %% 3 == 0) # sanity check translate(txx) # amino acid sequence @ \end{Solution} \subsection{Summarizing overlaps} \begin{Exercise} A basic operation in RNA-seq and other work flows is to count the number of times aligned reads overlap features of interest. \begin{enumerate} \item Load the \Biocexptpkg{RNAseqData.HNRNPC.bam.chr14} experiment data package and get the paths to the BAM files it contains. \item Load the `transcript db' package that contains the coordinates of each exon of the UCSC 'known genes' track of hg19. \item Extract the exon coordinates grouped by gene; the result is an \Rclass{GRangesList} object that we will discuss more latter. \item Use the \Rfunction{summarizeOverlaps} function with the exon coordinates and BAM files to generate a count of the number of reads overlapping each gene. Visit the help page \Rcode{?summarizeOverlaps} to read about the counting strategy used. \item The counts can be extracted from the return value of \Rfunction{summarizeOverlaps} using the function \Rfunction{assay}. This is standard \R{} matrix. How many reads overlapped regions of interest in each sample? How many genes had non-zero counts? \end{enumerate} \end{Exercise} \begin{Solution} Point to BAM files <>= library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES @ \noindent Get the gene model; this could also come from, e.g., a GFF or GTF file. <>= library(BiocParallel) library(TxDb.Hsapiens.UCSC.hg19.knownGene) ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") @ \noindent Summarize the number of reads overlapping each region of interest <>= counts <- summarizeOverlaps(ex, fls) colSums(assay(counts)) sum(rowSums(assay(counts)) != 0) @ \end{Solution} \appendix \nocite{10.1371/journal.pcbi.1003118} \bibliography{EMBOBGI} \end{document}