% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{04. Bioconductor for Sequence Analysis -- DNA Sequences} \documentclass{article} <>= BiocStyle::latex() library(knitr) opts_chunk$set(cache=TRUE, tidy=FALSE) @ \title{Sequences} \author{Sonali Arora \footnote{\url{sarora@fhcrc.org}}} \date{27-28 February 2014} \begin{document} \maketitle \tableofcontents \section{Introduction to Bioconductor Classes and objects for String manipulation} Aim of this section \begin{itemize} \item get familiar with various containers for sequences \item read and display sequences from a FASTA file \item simple manipulations on sequences stores in a FASTA file such as reverse(), reverseComplement(), translate() \item calculate gc content \end{itemize} <>= library(Biostrings) library(ShortRead) @ \subsection{Containers for sequences} Bioconductor has various classes for storing sequences. You can find out the possible containers using: <>= showMethods(complement) @ A quick example for each of them is: <>= b <- BString("I store any set of characters!" ) d <- DNAString("GCATAT-TAC") # Creates DNAString object. r <- RNAString("GCAUAU-UAC") # Creates RNAString object. r <- RNAString(d) # Converts d into RNAString object. p <- AAString("HCWYHH") @ Lets look at how you can use these on a daily basis: You're studying Breast Cancer genes -BRCA1 and BRCA2 - Inherited mutations in BRCA1 and BRCA2, confer increased lifetime risk of developing breast or ovarian cancer We want to do the following: For BRCA1: We can learn more about the gene at - \url{http://www.ncbi.nlm.nih.gov/gene/672} We can download the FASTA sequence as a file at: \url{http://www.ncbi.nlm.nih.gov/nuccore/NC_000017.11?report=fasta&from=43032116&to=43137660&strand=true} Similarly, for BRCA2 , We can learn more about the gene at \url{http://www.ncbi.nlm.nih.gov/gene/675} we can get the FASTA sequence from : \url{http://www.ncbi.nlm.nih.gov/nuccore/NC_000013.11?report=fasta&from=32302850&to=32412300} These files are already saved for you and you can access them using <>= fls <- list.files(system.file("extdata", package="BiocIntro"), pattern =".txt",full=TRUE) fls @ So lets begin by reading them into an R session using DNAStringSet, a container for storing DNAString objects. For this we use the readFASTA from the ShortReads package in Biocondcutor. readFasta reads all FASTA-formated files stored in fls. It returns a DNAStringSet containing sequences and qualities contained in the given file. We will then use sread from the ShortRead package to create a DNAStringSet and diaply the sequence for one of the genes in a nice, user fiendly format. <>= #Approach- 1 # read in the FATSA file seq <- readFasta(fls) ##Lets create a DNAStringSet which is a container for storing a set of DNAString dna <- sread(seq) #Approach-2 dna <- readDNAStringSet(fls) # let us look at the first DNAString, brca1 stored at [1] # This [[]] operation converts a DNAStringSet to DNAString brca1 <- dna[[1]] brca2 <- dna[[2]] # making the output more understandable. # A sequence in FASTA format is represented as a series of lines, each of which # usually do not exceed 80 characters. successiveViews(brca1, width=rep(50,length(dna[[1]])/50+1)) @ By default we show only the top 5 and last 5 in a given View. But we can set \Rcode{options(showHeadLines=Inf) } to display everything <>= options(showHeadLines=Inf) successiveViews(brca1, width=rep(50,length(dna[[1]])/50+1)) @ \subsection{Basic Manipulations} \textbf{Exercise:1} a) Create the complement, the reverse and the reverse and complement sequences for brca1 <>= reverse(brca1) complement(brca1) reverseComplement(brca1) @ b) Translate your random DNA sequences into proteins. <<>>= translate(brca1) @ \subsection{Pre-defined constants} Bioconductor also has some predefined constants which you can use. <>= DNA_BASES DNA_ALPHABET IUPAC_CODE_MAP @ \subsection{Frequencies} We can also find out various frequencies for our given GOI, lets look at brca2: <>= # what are the unique letter ? uniqueLetters(brca2) alphabetFrequency(brca2) alphabetFrequency(brca2, baseOnly=TRUE) dinucleotideFrequency(brca2) trinucleotideFrequency(brca2) @ you can also have oligonucleotideFrequency() \textbf{Exercise:2} Can you find the GC content for BRCA1 and BRCA2? Hint: use alphabetFrequency Solution: <>= gcContent <- function(x) { alf <- alphabetFrequency(x, as.prob=TRUE) sum(alf[c("G","C")]) } gcContent(brca1) gcContent(brca2) @ \subsection{Bioconductor has data pacakges for your favourite organism} BSgenome Data Packages \begin{itemize}[leftmargin=*] \item Full genomes stored in Biostrings containers \item Currently 16 organisms supported (Human, Mouse, Worm, Yeast, etc...) \item acilities for supporting new genomes (BSgenomeForge) \end{itemize} \textbf{Exercise:3} a. Can you find out the gc content for chromosome 17 ( home of BRCA1) and the gc content for chromosome 13 ( home of BRCA2) <>= library(BSgenome.Hsapiens.UCSC.hg19) @ <>= gcContent(Hsapiens[["chr17"]]) gcContent(Hsapiens[["chr13"]]) @ b. Please create a plot of the GC frequencies for all the primary chromosomes Please superimpose the frequencies of BRCA1 and BRCA2 on this plot Add title , legend and appropriate labels for the axes. <>= chrs <- paste0("chr", c(1:22,"X","Y")) data <- sapply(chrs, function(x) gcContent(Hsapiens[[x]])) names(data) <- chrs plot(data, xlab="chromosomes",ylab="gc Frequencies", xlim=c(1,24), col="blue") abline(h=gcContent(Hsapiens[["chr17"]]),col="red") abline(h=gcContent(Hsapiens[["chr13"]]),col="orange") title(main="gc Frequecies across Human Chromosomes", col.main="blue", font.main=4) legend("topleft",c("chromsomes","brca1","brca2"), cex=0.8, col=c("blue","red","orange"), pch=21:22, lty=1:2) @ \section{Session Info} Here is the output of sessionInfo on the system on which this document was compiled: <>= sessionInfo() @ %% \end{document}