---
title: "Integrated analysis and visualization of ChIP-seq data using _ChIPpeakAnno_, _GeneNetworkBuilder_ and _TrackViewer_"
author: "Jianhong Ou, Jun Yu, Lihua Julie Zhu"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{Slides}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
revealjs::revealjs_presentation:
theme: simple
transition: slide
self_contained: true
backgroundTransition: zoom
css: ["styles.css", "custom.css"]
includes:
after_body: doc_suffix.html
---
## ChIPseq technology
> Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a
> technique for genome-wide profiling of DNA-binding proteins, histone
> modifications or nucleosomes.
> -- [*Peter J. Park*](http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html)
```{r echo=FALSE, results="hide", warning=FALSE, message=FALSE}
suppressPackageStartupMessages({
library(SRAdb)
library(reshape2)
library(Rsubread)
library(Rsamtools)
library(ChIPpeakAnno)
library(GeneNetworkBuilder)
library(rtracklayer)
library(GenomicAlignments)
library(idr)
library(ChIPQC)
library(csaw)
library(ggplot2)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(reactome.db)
library(BSgenome.Hsapiens.UCSC.hg19)
library(motifStack)
library(knitr)
library(trackViewer)
library(png)
library(grid)
library(limma)
library(simpIntLists)
library(plotly)
library(Rgraphviz)
})
opts_chunk$set(eval=FALSE, fig.width=8, fig.height=8, warning = FALSE, message = FALSE)
extdata <- system.file("extdata", package = "ChIPseqStepByStep")
load(file.path(extdata, "minimal.rds"))
```
![ChIP-seq workflow](figure/ChIPseq.workflow.png)
(Modified from [Nature Reviews Genetics doi:10.1038/nrg2641, P.Park](http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html))
## Call peaks
![call peaks](figure/call.peaks.png)
(Modified from [Nature Methods doi:10.1038/nmeth.1371, S.Pepke](http://www.nature.com/nmeth/journal/v6/n11s/abs/nmeth.1371.html))
## Workflow of ChIPseq Analysis
* Quality Assessment of sequencing: [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* Mapping: [bowtie1/2](http://bowtie-bio.sourceforge.net/index.shtml), [Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html)
* Peak calling: [CCAT](http://cmb.gis.a-star.edu.sg/ChIPSeq/paperCCAT.htm), [SICER](http://home.gwu.edu/~wpeng/Software.htm), [MACS](http://liulab.dfci.harvard.edu/MACS/), [ZINBA](https://code.google.com/p/zinba/), [BayesPeak](http://bioconductor.org/packages/release/bioc/html/BayesPeak.html), [chipseq](http://bioconductor.org/packages/release/bioc/html/chipseq.html), [ChIPseqR](http://bioconductor.org/packages/release/bioc/html/ChIPseqR.html), [CSAR](http://bioconductor.org/packages/release/bioc/html/CSAR.html), [csaw](http://bioconductor.org/packages/release/bioc/html/csaw.html), [GenoGAM](http://bioconductor.org/packages/release/bioc/html/GenoGAM.html), [iSeq](http://bioconductor.org/packages/release/bioc/html/iSeq.html), [PICS](http://bioconductor.org/packages/release/bioc/html/PICS.html)
* Quality assessment: [ChIPQC](http://www.bioconductor.org/packages/release/bioc/html/ChIPQC.html)
* Peak annotation and pathway analysis: **[ChIPpeakAnno](http://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html)**, [Homer](http://homer.ucsd.edu/homer/ngs/index.html), [PAVIS](https://manticore.niehs.nih.gov/pavis2/), [GREAT](http://bejerano.stanford.edu/great/public/html/), [ChIPseeker](http://www.bioconductor.org/packages/release/bioc/html/ChIPseeker.html), [clusterProfiler](http://www.bioconductor.org/packages/release/bioc/html/clusterProfiler.html), [GOstats](http://www.bioconductor.org/packages/release/bioc/html/GOstats.html)
* Gene Network Building: **[GeneNetworkBuilder](http://www.bioconductor.org/packages/release/bioc/html/GeneNetworkBuilder.html)**, [ChIPXpress](http://www.bioconductor.org/packages/release/bioc/html/ChIPXpress.html)
* Visualization: **[trackViewer](http://www.bioconductor.org/packages/release/bioc/html/trackViewer.html)**, [IGV](https://www.broadinstitute.org/igv/home), [UCSC genome browser](http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38), [rtracklayer](http://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html)
* Motif enrichment analysis: [The MEME Suite](http://meme.nbcr.net/meme/index.html), [homer](http://homer.salk.edu/homer/motif/)
## Summary of the workflow
The workflow is based primarily on software packages from the open-source [Bioconductor project](http://bioconductor.org/). It contains all steps that are necessary for detecting peaks, starting from the raw read sequences.
Reads are first aligned to the genome using the **[Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html)** package.
Peaks are called by **[MACS2](https://github.com/taoliu/MACS/)**.
The peaks are annotated and virualized by **[ChIPpeakAnno](http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html)** , **[trackViewer](http://bioconductor.org/packages/release/bioc/html/trackViewer.html)**, and **[GeneNetworkBuilder](http://bioconductor.org/packages/release/bioc/html/GeneNetworkBuilder.html)** packages.
We will demonstrate the workflow on a publicly available ChIP-seq dataset to profile YAP/TAZ/TEAD binding sites ([Zanconato et al., 2015](http://www.nature.com/ncb/journal/v17/n9/abs/ncb3216.html)).
## ChIPpeakAnno
Main functions:
1. Compare and combine biological replicates (`IDRfilter`, `findOverlapsOfPeaks`, `peakPermTest`)
2. Annotate to nearest gene (`annotatePeakInBatch`)
3. Pathway Analysis (`getEnrichedGO`, `getEnrichedPath`)
4. Separate binding sites into active enhancer regions and promoter regions using histone modification data
5. Visualize the signal pattern for genomic loci bound by multiple transcription factors
## [GeneNetworkBuilder](https://bioconductor.org/packages/release/bioc/vignettes/GeneNetworkBuilder/inst/doc/GeneNetworkBuilder_vignettes.html)
1. Build regulatory network (`buildNetwork`)
2. Filter regulatory network (`filterNetwork`)
3. Polish regulatory network (`polishNetwork`)
## [trackViewer](https://bioconductor.org/packages/release/bioc/vignettes/trackViewer/inst/doc/trackViewer.html)
1. Import data (`importScore`)
2. Build gene model (`geneModelFromTxdb`)
3. View the tracks (`viewTracks`, `browseTracks`)
## Download data from Gene Expression Omnibus
NCBI GEO ID | SRA project | Description | package
------------|-------------|-------------------------------------------------------------------------------------------|----------
[GSE66081](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66081) | SRP055170 | the data of ChIP-seq raw reads for the study of association between YAP/TAZ/TEAD and AP-1 at enhancers drivers oncogenic trowth | [SRAdb](http://bioconductor.org/packages/release/bioc/html/SRAdb.html)
[GSE49651](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49651) | | the H3K27Ac, H3K4me1, H3K4me3 ChIP-seq data in breast cancer cell line | [GEOquery](http://bioconductor.org/packages/release/bioc/html/GEOquery.html)
[GSE59232](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59232) | | transcriptomic data of Affymetrix GeneChips Human Genome U133 Plus 1.0 | [GEOquery](http://bioconductor.org/packages/release/bioc/html/GEOquery.html)
***
Here we use **[SRAdb](http://bioconductor.org/packages/release/bioc/html/SRAdb.html)** package to download the sequence data.
```{r eval=FALSE}
## load library SRAdb to retrieve SRA files for GSE66081 dataset
library(SRAdb)
## SRAdb need to download a light sqlite file
## to extract sra file information
## and then download by getSRAfile function.
sqlfile <- "SRAmetadb.sqlite"
if(!file.exists(sqlfile))
sqlfile <- getSRAdbFile()
sra_con <- dbConnect(SQLite(), sqlfile)
conversion <- sraConvert(c("SRP055170"),
sra_con=sra_con)
out <- getSRAfile(conversion$sample,
sra_con=sra_con,
fileType="sra")
```
***
```{r eval=FALSE}
## extract file annotations
rs <- listSRAfile(conversion$sample,
sra_con=sra_con,
fileType="sra")
experiment <-
dbGetQuery(sra_con,
paste0("select * from experiment where",
" experiment_accession in ('",
paste(conversion$experiment,
collapse="','"), "')"))
info <- merge(experiment[, c("experiment_accession",
"title", "library_layout")],
rs[, c("experiment", "run")], by=1)
```
```{r eval=TRUE}
info[1:5, 2:3]
```
***
These files are downloaded in the SRA format, and need to be unpacked to the FASTQ format prior to alignment. This can be done using the _fastq-dump_ utility from **[SRA Toolkit](http://www.ncbi.nlm.nih.gov/books/NBK158900/)**.
```{r eval=FALSE}
runs <- info[, "run"]
## extract fastq files from sra by sratoolkit
sapply(paste0(runs, ".sra"),
function(.ele) system(paste("fastq-dump",
.ele)))
```
***
Technical replicates are merged together prior to further processing. This reflects the fact that they originate from a single library of DNA fragments.
```{r eval=FALSE}
grp <- do.call(rbind,
strsplit(info$title, "(;|:) "))
info <- cbind(info, grp)
runs <- split(runs, grp[, 2])
group <- gsub("_ChIPSeq", "", names(runs))
```
```{r eval=TRUE}
runs[1:2]
```
```{r eval=FALSE}
mapply(function(.ele, n)
system(paste("cat", paste0(.ele, ".fastq", collapse=" "),
">>", paste0(n, ".fastq"), collapse=" ")),
runs, group)
## remove unused files to save storage.
unlink(paste0(info$run, ".fastq"))
unlink(paste0(info$run, ".sra"))
```
## Aligning reads to hg19 build of human genome
Reads in each library are aligned to the hg19 build of human genome, using the **[Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html)** package.
```{r eval=FALSE}
library(Rsubread)
fastq.files <- paste0(group, ".fastq")
bam.files <- paste0(group, ".bam")
##getPhredOffset, if the offset is not correct, it will report error.
x <- qualityScores(filename=fastq.files[1],
input_format="FASTQ", offset=33,
nreads=1000)
```
```{r eval=TRUE}
x[1:3, 1:10]
```
***
Exercise: Check the offset of quality scores
```{r}
library(Rsubread)
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
head(qualityScores(filename = reads,
input_format = "gzFASTQ",
offset = 33,
nreads = 100)) ## please check the errors
head(qualityScores(filename = reads,
input_format = "gzFASTQ",
offset = 64,
nreads = 100))
```
***
An index is constructed with the prefix index/hg19 and the index can be re-used. Here, the `type` of sequencing data is set to 1 for genomic DNA-seq data. The uniquely mapped reads should be reported only by setting the `uniqe` to TRUE.
```{r eval=FALSE}
## build index and this can be re-used.
hg19.fasta <-
"Genomes/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
dir.create("index")
buildindex(basename="index/hg19",
reference=hg19.fasta)
## alignment
align(index="./index/hg19", readfile1=fastq.files,
type=1,
output_file=bam.files, maxMismatches=2,
nthreads=2,
phredOffset=33, unique=TRUE)
```
***
Subsample the data to save time for this tutorial.
```{r eval=FALSE}
library(Rsamtools)
filter <- FilterRules(list(seqn = function(x) x$rname == "chr1"))
## sort and index
null <- sapply(group, function(gp) {
sortBam(paste0(gp, ".bam"), paste0(gp, ".srt"))
file.copy(paste0(gp, ".srt.bam"),
paste0(gp, ".bam"), overwrite = TRUE)
indexBam(paste0(gp, ".bam"))
})
## subset (optional)
null <- sapply(group, function(gp) {
dest <- paste0(gp, ".chr1.bam")
filterBam(paste0(gp, ".bam"), paste0(gp, ".chr1.bam"),
filter=filter)
file.rename(paste0(gp, ".chr1.bam"), paste0(gp, ".bam"))
indexBam(paste0(gp, ".bam"))
})
```
***
Potential PCR duplicates are removed by _removeDupReads_ function in **[Rsubread](http://bioconductor.org/packages/release/bioc/html/Rsubread.html)** package. This step can also be done by _MarkDuplicates_ tool from the [Picard software suite](http://broadinstitute.github.io/picard/).
By _asBam_ function from **[Rsamtools](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html)** package, the alignment is sorted by their mapping location, and an index created on the destination (with extension '.bai') when `indexDestination=TRUE`.
```{r eval=FALSE}
## remove reads which are mapped to identical locations,
## using mapping location of the first base of each read.
## and resort by position
library(Rsamtools)
library(Rsubread)
null <- sapply(group[c(1, 11:13)], function(gp){
out <- asSam(paste0(gp, ".bam"), gp)
removeDupReads(out, threshold=2,
outputFile=paste0(gp, ".rmdup.sam"))
asBam(paste0(gp, ".rmdup.sam"), paste0(gp, ".rmdup"),
indexDestination=TRUE)
unlink(out)
unlink(paste0(gp, ".rmdup.sam"))
})
```
## Calling peaks
The control datasets are pooled before calling peaks according the description of the paper.
```{r eval=FALSE}
## pool IgG depend on experiment
IgG.bg <- list(rep1.2=c("IgG_rep1.rmdup.bam",
"IgG_rep2.rmdup.bam",
"IgG_rep1.2.rmdup.bam"),
rep3.4=c("IgG_rep3.rmdup.bam",
"IgG_rep4.rmdup.bam",
"IgG_rep3.4.rmdup.bam"),
rep3.5=c("IgG_rep3.rmdup.bam",
"IgG_rep5.rmdup.bam",
"IgG_rep3.5.rmdup.bam"))
null <- sapply(IgG.bg[3], function(.ele){
out <- mergeBam(files=.ele[-3],
destination = .ele[3],
indexDestination=TRUE)
})
```
***
Before we call peaks, we may want to check the difference between ChIP signal and background signal.
By `ChIPpeakAnno::cumulativePercentage` function, the difference between the cumulative percentage
tag allocation in Input and IP could be clearly checked ([Diaz et al., 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342857/), [RamÃrez et al., 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987876/)).
```{r eval=FALSE}
library(ChIPpeakAnno)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
pair <- data.frame(treat=c("YAP", "TAZ", "TEAD4", "JUN"),
control=c("1.2", "1.2", "3.4", "3.5"),
stringsAsFactors = FALSE)
## use chromosome 1 to save the time.
chr1 <- as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)["chr1"], "GRanges")
## check the difference between TEAD4 ChIP and background signal
i <- 3
cumulativePercentage(
bamfiles = c(paste0(pair[i, "treat"],
"_rep1.rmdup.bam"),
paste0("IgG_rep", pair[i, "control"],
".rmdup.bam")),
gr=chr1)
```
***
See documentation from [deeptools](http://deeptools.readthedocs.io/en/latest/content/tools/plotFingerprint.html)
![deeptools](figure/QC_fingerprint.png)
***
We set loose filter condition for [MACS2](https://github.com/taoliu/MACS/) to get more peaks for irreproducibility discovery rate (IDR) analysis.
```{r eval=FALSE}
for(i in seq.int(nrow(pair))){
system(paste("macs2 callpeak --to-large -t",
paste0(pair[i, "treat"], "_rep1.rmdup.bam"),
"-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"),
" -f BAM -g hs -q 0.1 -n",
paste0(pair[i, "treat"], "_rep1.q1")))
system(paste("macs2 callpeak --to-large -t",
paste0(pair[i, "treat"], "_rep2.rmdup.bam"),
"-c ", paste0("IgG_rep", pair[i, "control"], ".rmdup.bam"),
" -f BAM -g hs -q 0.1 -n",
paste0(pair[i, "treat"], "_rep2.q1")))
## we will use idr to filter the results later
}
```
***
After peak calling, the output of [MACS2](https://github.com/taoliu/MACS/) will be imported into R by `ChIPpeakAnno::toGRanges` function. The top 10000 peaks sorted by pValue (FDR) will be used for IDR analysis.
```{r eval=FALSE, warning=FALSE}
library(ChIPpeakAnno)
macs2.files <- dir(".", pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele)
toGRanges(.ele, format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele)
.ele[order(.ele$pValue, decreasing=TRUE)])
```
Exercise: Import data by `toGRanges`
```{r}
library(ChIPpeakAnno)
packagePath <- system.file("extdata", package = "ChIPpeakAnno")
dir(packagePath, "gff|bed|narrowPeak|broadPeak|gz")
toGRanges(file.path(packagePath, "GFF_peaks.gff"), format = "GFF")
toGRanges(file.path(packagePath, "WS220.bed"), format = "BED")
toGRanges(file.path(packagePath, "peaks.narrowPeak"), format = "narrowPeak")
toGRanges(file.path(packagePath, "TAF.broadPeak"), format = "broadPeak")
toGRanges(file.path(packagePath, "wgEncodeUmassDekker5CGm12878PkV2.bed.gz"), format = "BED")
packagePath <- system.file("extdata", "macs", package = "ChIPseqStepByStep")
macs2.files <- dir(packagePath, pattern="*.q1_peaks.narrowPeak$")
peaks <- sapply(macs2.files, function(.ele)
toGRanges(file.path(packagePath, .ele), format="narrowPeak"))
peaks.bk <- peaks
peaks <- lapply(peaks, function(.ele) .ele[order(.ele$pValue, decreasing=TRUE)])
```
***
There are lots of noise peaks when we set loose filter condition for [MACS2](https://github.com/taoliu/MACS/). We will use IDR framework to filter the low reproducible peaks ([Landt et al., 2012](http://genome.cshlp.org/content/22/9/1813.short)). Only overlapping peaks will be used for [IDR](https://sites.google.com/site/anshulkundaje/projects/idr) calculation.
```{r eval=TRUE}
peaks <- lapply(peaks, function(.ele) .ele[1:min(1e5, length(.ele))])
lengths(peaks)
```
***
```{r eval=FALSE}
new.group <- gsub("^(.*?)_.*$", "\\1", names(peaks))
peaks.grp <- split(peaks, new.group)
names(peaks.grp)
## find overlapping peaks
GSE66081 <- sapply(peaks.grp, findOverlapsOfPeaks, simplify = FALSE)
GSE66081.bk <- GSE66081
GSE66081 <- sapply(GSE66081, function(.ele)
.ele$peaklist[[names(.ele$peaklist)[grepl("\\/\\/\\/",
names(.ele$peaklist))]]],
simplify=FALSE)
```
```{r eval=TRUE}
lengths(GSE66081)
```
***
The IDR are calculated by the average coverages of each overlapping peak. The reads counts for peaks are done by _summarizeOverlaps_ function in **[GenomicAlignments](https://www.bioconductor.org/packages/release/bioc/html/GenomicAlignments.html)** package.
```{r eval=FALSE}
library(ChIPpeakAnno)
new.group.names <- lapply(GSE66081.bk, function(.ele)
sub(".q1_peaks.narrowPeak", ".bam",
names(.ele$peaklist)[!grepl("\\/\\/\\/", names(.ele$peaklist))]))
GSE66081 <- mapply(function(.peaks, .bamfiles)
IDRfilter(peaksA=.peaks[[1]], peaksB=.peaks[[2]],
bamfileA=.bamfiles[1], bamfileB=.bamfiles[2],
IDRcutoff=0.01),
peaks.grp, new.group.names)
## The original peaks are filtered to decrease the memory usage.
peaks.keep <- sapply(GSE66081, function(.ele)
as.character(unlist(.ele$peakNames)))
peaks.keep <- lapply(peaks.keep, function(.ele) gsub("^.*__", "", .ele))
peaks.keep <- unlist(peaks.keep)
peaks.s <- lapply(peaks, function(.ele) .ele[names(.ele) %in% peaks.keep])
peaks.unlist <- unlist(GRangesList(peaks.s), use.names = FALSE)
```
## Export filtered peaks by `rtracklayer`
Export the filtered peak by _export_ function in **[rtracklayer](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html)** package.
```{r eval=FALSE}
library(rtracklayer)
null <- mapply(function(.dat, .name)
export(.dat,
sub("^(.*?).q1.*$", "\\1.fil.bed", .name),
format="BED"),
peaks.s, names(peaks.s))
## filter the peaks by qValue < 10^-10 to get similar number peaks
## as filtered by idr to confirm the effect of IDR filter.
peaks.bk <- lapply(peaks.bk, function(.ele)
.ele[.ele$qValue>10])
null <- mapply(function(.dat, .name)
export(.dat,
sub("^(.*?).q1.*$", "\\1.fil.q01.bed", .name),
format="BED"),
peaks.bk, names(peaks.bk))
```
## Quality control
The ChIP-seq experiment could be checked by [ChIPQC](http://bioconductor.org/packages/release/bioc/html/ChIPQC.html) package.
Usally, before mapping, [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) could be used for sequence quality accessment. And after mapping, [SAMStat](http://samstat.sourceforge.net/) could be used for mapping quality analysis, and strand cross-correlation could be used for checking the experiment quality via [csaw](http://bioconductor.org/packages/release/bioc/html/csaw.html) package.
```{r eval=FALSE}
## strand cross-correlation
library(csaw)
scc <- lapply(sub("^(.*?).q1.*$", "\\1.rmdup.bam", names(peaks)),
correlateReads, cross=TRUE,
param=readParam(minq=30,
restrict="chr1",
dedup=TRUE))
names(scc) <- gsub(".q1.*$", "", names(peaks))
scc.param <- lapply(scc, function(.ele){
readsLength <- 50
cutline <- 2 * readsLength
read_length <- which.max(.ele[1:cutline])
fragment_length <- which.max(.ele[cutline:length(.ele)]) + cutline - 1
baseline <- which.min(.ele[cutline:length(.ele)]) + cutline - 1
#normalized.strand.coefficient
NSC <- .ele[fragment_length] / .ele[baseline]
# relative strand correlation
RSC <- (.ele[fragment_length] - .ele[baseline])/(.ele[read_length] - .ele[baseline])
c(read_length=read_length-1,
fragment_length=fragment_length-1,
baseline=baseline-1,
NSC=NSC,
RSC=RSC)
})
```
***
The [normalized strand coefficient (NSC) and relative strand correlation (RSC)](https://genome.ucsc.edu/ENCODE/qualityMetrics.html) are strong metrics for assessing signal-to-noise ratios in a ChIP-seq experiment. High-quality ChIP-seq data sets should have NSC values >= 1.05 and RSC values >= 0.8 ([Landt et al., 2012](http://genome.cshlp.org/content/22/9/1813.short)).
```{r eval=TRUE}
do.call(rbind, scc.param)[, c("NSC", "RSC")]
```
## Strand cross correlation against delay distance
The absolute and relative height of 'phantom' peak and ChIP peak are useful determinants of the success of a ChIP-seq experiment.
```{r eval=FALSE}
op <- par(mfrow=c(4, 2))
for(i in 1:length(scc)){
plot(1:length(scc[[i]])-1, scc[[i]],
xlab="Delay (bp)", ylab="cross-correlation",
main=names(scc)[i], type="l",
ylim=c(scc[[i]][scc.param[[i]]["baseline"]]*.8, max(scc[[i]])*1.1))
abline(v=scc.param[[i]]["fragment_length"], col="red", lty=3)
abline(h=scc[[i]][scc.param[[i]][c("fragment_length", "read_length", "baseline")]+1],
col=c("blue", "green", "gray30"), lty=2)
}
```
***
```{r eval=TRUE, fig.width=9, fig.height=6, echo=FALSE}
op <- par(mfrow=c(2, 4))
for(i in seq_along(scc)){
plot(seq_along(scc[[i]])-1, scc[[i]],
xlab="Delay (bp)", ylab="cross-correlation",
main=names(scc)[i], type="l",
ylim=c(scc[[i]][scc.param[[i]]["baseline"]]*.8, max(scc[[i]])*1.1))
abline(v=scc.param[[i]]["fragment_length"], col="red", lty=3)
abline(h=scc[[i]][scc.param[[i]][c("fragment_length", "read_length", "baseline")]+1],
col=c("blue", "green", "gray30"), lty=2)
}
```
## Generate bigWig files for UCSC genome browser
Now we have estimated fragment length and can generate track files for UCSC genome browser.
```{r eval=FALSE}
library(GenomicAlignments)
library(rtracklayer)
bwPath <- "bw"
dir.create(bwPath)
cvgs <- mapply(function(.ele, .name){
gal <- readGAlignments(paste0(.name, ".rmdup.bam"))
cvg <- coverage(gal, width = as.numeric(.ele["fragment_length"]))
seqinfo(cvg) <- seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)
export.bw(cvg,
con=file.path(bwPath,
paste0(.name, ".rmdup.bigWig")))
cvg
}, scc.param, names(scc.param))
```
***
[ChIPQC](http://bioconductor.org/packages/release/bioc/html/ChIPQC.html) is used for quality control.
```{r eval=TRUE}
library(ChIPQC)
samples <-
data.frame(SampleID=gsub("^(.*?).q1.*$", "\\1", names(peaks)),
Factor=gsub("^(.*?)_.*$", "\\1", names(peaks)),
Replicate=gsub("^.*?_rep(.*?).q1.*$", "\\1",
names(peaks)),
bamReads=gsub("^(.*?).q1.*$", "\\1.rmdup.bam",
names(peaks)),
Peaks=names(peaks),
PeakCaller="narrow")
samples[1:3, ]
```
***
We run ChIPQC for peaks before filter, filtered by IDR and qValue.
```{r eval=FALSE}
## before filter
exampleExp <- ChIPQC(experiment=samples, annotaiton="hg19")
ChIPQCreport(exampleExp, reportFolder="ChIPQC", facetBy="Factor")
## after IDR filter
samples.fil <- samples
samples.fil$Peaks <- gsub("^(.*?).q1.*$", "\\1.fil.bed", names(peaks))
samples.fil$PeakCaller="bed"
exampleExp.fil <- ChIPQC(experiment=samples.fil, annotaiton="hg19")
ChIPQCreport(exampleExp.fil,
reportFolder="ChIPQC.fil",
facetBy="Factor")
## filtered by qValue from the MACS2 result
samples.fil.q01 <- samples.fil
samples.fil.q01$Peaks <-
gsub("^(.*?).q1.*$", "\\1.fil.q01.bed", names(peaks))
exampleExp.fil.q01 <- ChIPQC(experiment=samples.fil.q01,
annotaiton="hg19")
ChIPQCreport(exampleExp.fil.q01,
reportFolder="ChIPQC.fil.q01",
facetBy="Factor")
```
***
![ChIPQC](figure/ChIP_QC.png)
The effect of IDR with full dataset. The average profile across within peak regions of duplicates, and the correlation between samples as heatmap and by principal component analysis show higher quality of IDR filtered peaks for replicate samples because they are clustered together in the heatmap and spatially grouped within the PCA plot.
## Association among different sets of peaks
Test the overlaps of all the peaks. From the testing result, we can confirm the widespread association between YAP, TAZ, TEAD and JUN. The vennDiagram shows the numbers of overlapping peak for these TFs.
```{r eval=FALSE}
ol <- findOverlapsOfPeaks(GSE66081, connectedPeaks="keepAll")
```
Exercise: find overlaps among peak lists
```{r}
library(ChIPpeakAnno)
packagePath <- system.file("extdata", "filtered", package = "ChIPseqStepByStep")
filtedFiles <- dir(packagePath, ".bed")
GSE66081 <- sapply(filtedFiles, function(.ele){
toGRanges(file.path(packagePath, .ele), format="BED")
})
names(GSE66081) <- sub(".bed", "", names(GSE66081))
lengths(GSE66081)
ol <- findOverlapsOfPeaks(GSE66081)
names(ol)
names(ol$peaklist)
makeVennDiagram(ol)
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=4, fig.height=4}
makeVennDiagram(ol)
```
***
We want to confirm that not only the peaks are overlapped but also their summits are close to each other.
```{r eval=FALSE}
peaks.summit <- lapply(peaks, function(.ele) {
.ele <- shift(.ele, .ele$peak)
width(.ele) <- 1
.ele
})
TEAD.summit <- unlist(GRangesList(peaks.summit[new.group=="TEAD4"]))
TAZ.summit <- unlist(GRangesList(peaks.summit[new.group=="TAZ"]))
bof <- binOverFeature(TEAD.summit,
annotationData=TAZ.summit,
radius=300, nbins=300, FUN=length)
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=5, fig.height=5}
plot(as.numeric(rownames(bof)), bof,
ylab="Peak density",
xlab="Distance to the summit of TAZ peaks",
main="Position of TEAD4 peak summit",
type="l", xlim=c(-300, 300))
```
## Steps of annotation by `ChIPpeakAnno`
The functions, `toRanges`, `annotatePeakInBatch`, and `addGeneIDs` in the ChIPpeakAnno,
the annotation of ChIP-Seq peaks becomes streamlined into four major steps:
1. Read peak data with `toGRanges`
2. Generate annotation data with `toGRanges`
3. Annotate peaks with `annotatePeakInBatch`
4. Add additional informations with `addGeneIDs`
## Prepare annotation data
The method `toGRanges` can be used to construct **GRanges** object from various
annotation format such as BED, GFF, user defined readable text files,
**EnsDb** or **TxDb** object.
Please type `?toGRanges` for more information.
```{r eval=FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")
```
```{r eval=TRUE}
annoData[1]
```
***
Exercise: Prepare annotation data from Ensembl/UCSC based annotation package
- Ensembl Archives: http://www.ensembl.org/info/website/archives/index.html
- UCSC release numbers: https://genome.ucsc.edu/FAQ/FAQreleases.html
```{r}
library(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75)
toGRanges(EnsDb.Hsapiens.v75, feature = "transcript")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene)
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "geneModel")
toGRanges(TxDb.Hsapiens.UCSC.hg19.knownGene, feature = "threeUTR")
```
## Annotate by `annotatePeakInBatch`
Then we use _annotatePeakInBatch_ or _annoPeaks_ function in **[ChIPpeakAnno](http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html)** package to annotate the peaks. Depend on the experiment, we can annotate the peaks in multiple ways.
![annotatePeakInBatch](figure/annotatePeakInBatch.png)
***
```{r eval=FALSE}
overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
## annotate the peaks by nearest annotations.
YAP.TAZ.TEAD4.nearest <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData)
## annotate the peaks by both side in a given range.
YAP.TAZ.TEAD4.2side <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData,
output="nearestBiDirectionalPromoters",
bindingRegion=c(-100000, 100000))
```
Exercise: annotatePeakInBatch
```{r}
overlaps <- ol$peaklist[["TAZ///TEAD4///YAP"]]
YAP.TAZ.TEAD4.nearest <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData)
YAP.TAZ.TEAD4.promoter <-
annotatePeakInBatch(overlaps,
AnnotationData=annoData,
output="nearestBiDirectionalPromoters",
bindingRegion=c(-10000, 2000))
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=4.5, fig.height=4.5}
pie1(table(YAP.TAZ.TEAD4.2side$insideFeature))
```
## Add gene symbols by `addGeneIDs`
```{r eval=TRUE}
library(org.Hs.eg.db)
YAP.TAZ.TEAD4.nearest.anno <-
addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.nearest,
orgAnn = org.Hs.eg.db,
feature_id_type = "entrez_id")
YAP.TAZ.TEAD4.nearest.anno[1:3]
```
***
Exercise: addGeneIDs
```{r}
library(org.Hs.eg.db)
YAP.TAZ.TEAD4.promoter.anno <-
addGeneIDs(annotatedPeak = YAP.TAZ.TEAD4.promoter,
orgAnn = org.Hs.eg.db,
feature_id_type = "entrez_id")
```
## Save annotation results
The annotation results can be saved in XLS file format using **[WriteXLS](https://cran.r-project.org/web/packages/WriteXLS/index.html)** package to avoid the gene name errors that can be inadvertently introduced when opened by Excel ([Zeeberg et al., 2004](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80)).
```{r eval=FALSE}
## save annotations
YAP.TAZ.TEAD4.2side.m <- as.data.frame(unname(YAP.TAZ.TEAD4.2side))
YAP.TAZ.TEAD4.2side.m$peakNames <- NULL
library(WriteXLS)
WriteXLS("YAP.TAZ.TEAD4.2side.m",
"YAP.TAZ.TEAD4.overlapping.peaks.anno.xls")
```
![ErrInExcel](figure/GeneSymbolErr.jpg)
## Enrichment analysis
With the annotated peak data, you can call the function `getEnrichedGO` to obtain a list of enriched GO terms. Please note that the default setting of `feature_id_type` is "ensembl\_gene\_id". If you are using _TxDb_ as annotation data, you need to change it to "entrez\_id".
```{r eval=FALSE}
## enrichment analysis
library(org.Hs.eg.db)
over <- getEnrichedGO(YAP.TAZ.TEAD4.2side,
orgAnn="org.Hs.eg.db",
feature_id_type="entrez_id",
maxP=.01, condense=TRUE)
```
```{r eval=TRUE, message=FALSE, warning=FALSE}
over[["bp"]][1:3, -c(3, 10)]
dim(over[["bp"]])
```
***
For pathway analysis, you can call function `getEnrichedPATH` using reactome or KEGG database.
```{r eval=FALSE}
library(reactome.db)
path <- getEnrichedPATH(YAP.TAZ.TEAD4.2side,
orgAnn="org.Hs.eg.db",
pathAnn="reactome.db",
feature_id_type="entrez_id",
maxP=.05)
```
```{r eval=TRUE, message=FALSE, warning=FALSE}
path[1:2, ]
```
## Bidirectional promoters
Bidirectional promoters are the DNA regions located between the 5' ends of two adjacent genes coded on opposite strands. The two adjacent genes are transcribed to the opposite directions, and often co-regulated by this shared promoter region(Robertson et al., 2007). Here is an example to find peaks with bi-directional promoters and output the percentage of the peaks near bi-directional promoters.
```{r eval=FALSE}
bdp <- peaksNearBDP(overlaps, annoData, MaxDistance=100000)
```
```{r eval=TRUE, message=FALSE, warning=FALSE}
bdp
```
## Check distribution of peaks
The distribution of peaks over exon, intron, enhancer, proximal promoter, 5'UTR and 3'UTR could give you some clues of how to annotate the peaks. The distribution can be summarized in the peak centric or nucleotide centric view using the function _assignChromosomeRegion_ in [ChIPpeakAnno](http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html) package.
```{r eval=FALSE}
seqlevelsStyle(overlaps) ## == UCSC
aCR<-assignChromosomeRegion(overlaps, nucleotideLevel=FALSE,
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=9, fig.height=4.5}
par(mar=c(10.1, 4.1, 4.1, 2.1))
barplot(aCR$percentage, las=3)
```
## Discover the binding pattern
Owing to most of the peaks are in the intergenic regions, we questioned whether most YAP/TAZ/TEAD common peaks are located in enhancers. Enhancers can be distinguished from promoters by their epigenetic features, that is, relative enrichment of histone H3 monomethylation (H3K4me1) on Lys4 at enhancers, and trimethylation (H3K4me3) at promoters. We first determine the distribution of the distance between the peaks and known genes. And then check the correlation of peaks and enhancers.
***
```{r eval=FALSE}
dist2TSS <-
lapply(list(YAP=GSE66081[["YAP"]],
TAZ=GSE66081[["TAZ"]],
TEAD4=GSE66081[["TEAD4"]],
YAP.TAZ.TEAD4=ol$peaklist[["TAZ///TEAD4///YAP"]]),
function(.ele) {
.ele <- annotatePeakInBatch(.ele,
AnnotationData=annoData,
output="nearestLocation")
.ele$shortestDistance})
dist2TSS.cut <- lapply(dist2TSS, cut, breaks=c(0, 1e3, 1e4, 1e5, 1e10))
dist2TSS.table <- sapply(dist2TSS.cut, table)
dist2TSS.percentage <- apply(dist2TSS.table, 2,
function(.ele) .ele/sum(.ele))
library(ggplot2)
library(reshape2)
dist2TSS.percentage <- melt(dist2TSS.percentage)
dist2TSS.percentage$value <- dist2TSS.percentage$value * 100
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=12, fig.height=8}
ggplot(dist2TSS.percentage, aes(x=Var2, y=value, fill=Var1)) +
geom_bar(stat="identity") +
xlab("") + ylab("%") + labs(title="Distance to TSS") + theme_bw() +
scale_fill_manual(name="",
values=c("#FF0000FF", "#FF0000BB",
"#FF000077", "#FF000033"),
labels=c("<1kb", "1-10kb", "10-100kb", ">100kb"))
```
***
Because over 70% of the peaks are far from known TSS (more than 10Kb), we want to confirm the peaks could be linked to active enhancer region. The ChIP-seq data of epigenetic marks of same cell line are availble in GSE49651.
We use **[GEOquery](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html)** package to download data from GEO. We can also download the data by **[SRAdb](http://bioconductor.org/packages/release/bioc/html/SRAdb.html)** package and call peaks from raw reads. Here, we use the peaks saved in the GSE49651.
```{r eval=FALSE}
library(GEOquery)
getGEOSuppFiles("GSE49651")
epipeaks.files <- dir("GSE49651", full.names=TRUE)
library(ChIPpeakAnno)
epipeaks <- lapply(epipeaks.files[3:5],
function(.ele) toGRanges(gzfile(.ele), format="BED"))
names(epipeaks) <-
gsub("GSE49651_MDAMB231.(.*?).hg19.*._peaks.txt.gz", "\\1",
basename(epipeaks.files)[3:5])
```
***
The presence of H3K4me1 and H3K4me3 peaks, their genomic locations and their overlap can be used as the criteria to define promoters and enhancers. H3K4me3 peaks not overlapping with H3K4me1 peaks and close to a TSS (5kb) are defined as promoters, as NA otherwise; H3K4me1 peaks not overlapping with H3K4me3 peaks are defined as enhancers. Promoters or enhancers are defined as active if overlapping with H3K27ac peaks.
![histoneModification](figure/histModify.png)
(Modified from [Nature Cell Biology, doi:10.1038/ncb3216, F. Zanconato](https://www.nature.com/ncb/journal/v17/n9/full/ncb3216.html))
***
```{r eval=FALSE}
promoter.UCSC <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene, 5e3, 5e3)
promoter.UCSC <- unique(promoter.UCSC)
attach(epipeaks)
ol.promoter <- findOverlapsOfPeaks(H3K4me1, H3K4me3, H3K27Ac,
promoter.UCSC,
ignore.strand = FALSE)
promoter <- c(ol.promoter$peaklist[["H3K4me3///H3K27Ac///promoter.UCSC"]],
ol.promoter$peaklist[["H3K4me1///H3K4me3///H3K27Ac///promoter.UCSC"]])
enhancer.active <- ol.promoter$peaklist[["H3K4me1///H3K27Ac"]]
enhancer.inactive <- ol.promoter$peaklist[["H3K4me1"]]
mcols(enhancer.active) <- DataFrame(type="enhancer.active")
mcols(enhancer.inactive) <- DataFrame(type="enhancer.inactive")
mcols(promoter) <- DataFrame(type="promoter")
types <- c(promoter, enhancer.active, enhancer.inactive)
YAP.TAZ.TEAD <- ol$peaklist[["TAZ///TEAD4///YAP"]]
seqlevelsStyle(YAP.TAZ.TEAD) <- "UCSC"
YAP.TAZ.TEAD <- YAP.TAZ.TEAD[seqnames(YAP.TAZ.TEAD) %in%
seqlevels(types)]
YAP.TAZ.TEAD4.type <- findOverlaps(YAP.TAZ.TEAD, types)
tbl <- table(types[subjectHits(YAP.TAZ.TEAD4.type)]$type)
tbl["not.classified"] <- length(YAP.TAZ.TEAD) -
length(unique(queryHits(YAP.TAZ.TEAD4.type)))
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=4.5, fig.height=4.5}
pie1(tbl)
```
***
The binding pattern could be visualized and compared by heatmap and distribution curve from the binding ranges of overlapping peaks of target TFs. For big bedgraph files, [bedtools](http://bedtools.readthedocs.org/en/latest/) are used to decrease the file size before importing into R.
```{r eval=FALSE}
##heatmap
YAP.TAZ.TEAD.assigned <- YAP.TAZ.TEAD[queryHits(YAP.TAZ.TEAD4.type)]
YAP.TAZ.TEAD.assigned$type <- types[subjectHits(YAP.TAZ.TEAD4.type)]$type
YAP.TAZ.TEAD.assigned <- c(YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="enhancer.active"], YAP.TAZ.TEAD.assigned[YAP.TAZ.TEAD.assigned$type=="promoter"])
YAP.TAZ.TEAD.assigned <- unique(YAP.TAZ.TEAD.assigned)
library(rtracklayer)
YAP.TAZ.TEAD.assigned.center <-
reCenterPeaks(YAP.TAZ.TEAD.assigned, width=1)
YAP.TAZ.TEAD.assigned.reCenter <-
reCenterPeaks(YAP.TAZ.TEAD.assigned, width=2000)
untar("GSE49651/GSE49651_RAW.tar")
sapply(c("GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph.gz",
"GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph.gz",
"GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph.gz"), gunzip)
bams <- c("YAP", "TAZ", "TEAD4", "JUN")
library(BSgenome.Hsapiens.UCSC.hg19)
len <- seqlengths(Hsapiens)
write.table(len, file="hg19.genome.txt",
quote=FALSE, col.names=FALSE,
sep="\t")
sapply(bams, function(.ele){
system(paste0("genomeCoverageBed -bga -split -ibam ",
.ele, "_rep1.rmdup.bam -g hg19.genome.txt > ",
.ele, "_rep1.bedGraph"))
})
files <- c(YAP="YAP_rep1.bedGraph",
TAZ="TAZ_rep1.bedGraph",
TEAD4="TEAD4_rep1.bedGraph",
JUN="JUN_rep1.bedGraph",
H3K4me1="GSM1204470_MDAMB231.H3K4me1_1.hg19.tags.bedGraph",
H3K4me3="GSM1204472_MDAMB231.H3K4me3_1.hg19.tags.bedGraph",
H3K27Ac="GSM1204474_MDAMB231.H3K27Ac_1.hg19.tags.bedGraph")
export(YAP.TAZ.TEAD.assigned.reCenter, "tmp.bed", format="BED")
sapply(files, function(.ele)
system(paste("intersectBed -a", .ele, "-b tmp.bed >",
gsub("bedGraph", "sub.bedGraph", .ele))))
unlink("tmp.bed")
library(trackViewer)
cvglists <- sapply(gsub("bedGraph", "sub.bedGraph", files),
importData, format="bedGraph")
names(cvglists) <- names(files)
sig <- featureAlignedSignal(cvglists, YAP.TAZ.TEAD.assigned.center,
upstream=1000, downstream=1000)
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=6, fig.height=6}
names(sig)
heatmap <- featureAlignedHeatmap(sig, YAP.TAZ.TEAD.assigned.center,
upstream=1000, downstream=1000,
annoMcols=c("type"),
margin=c(.1, .01, .15, .25))
```
## Summarize the consensus
We first get the sequences of the peaks and then summarize the enriched oligos.
```{r eval=FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
YAP.TAZ.TEAD4.uniq <- unique(YAP.TAZ.TEAD4.nearest)
YAP.TAZ.TEAD4.uniq <- YAP.TAZ.TEAD4.uniq[!is.na(YAP.TAZ.TEAD4.uniq$feature)]
strand(YAP.TAZ.TEAD4.uniq) <- as.character(YAP.TAZ.TEAD4.uniq$feature_strand)
seq <- getAllPeakSequence(YAP.TAZ.TEAD4.uniq, upstream=0, downstream=0,
genome=Hsapiens)
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3,
quickMotif=TRUE, freqs=freqs, revcomp=TRUE)
zscore <- sort(os$zscore)
```
***
```{r eval=TRUE, message=FALSE, warning=FALSE}
h <- hist(zscore, breaks=100)
text(zscore[length(zscore)], max(h$counts)/10,
labels=names(zscore[length(zscore)]), adj=1)
```
***
```{r eval=FALSE}
library(motifStack)
pfms <- mapply(function(.ele, id)
new("pfm", mat=.ele, name=paste("SAMPLE motif", id)),
os$motifs, 1:length(os$motifs))
```
```{r eval=TRUE, message=FALSE, warning=FALSE, fig.width=6, fig.height=4}
motifStack(pfms)
```
***
Exercise: Try to run oligoSummary by full dataset
```{r}
library(BSgenome.Hsapiens.UCSC.hg19)
YAP.TAZ.TEAD4.uniq <- unique(YAP.TAZ.TEAD4.nearest)
YAP.TAZ.TEAD4.uniq <- YAP.TAZ.TEAD4.uniq[!is.na(YAP.TAZ.TEAD4.uniq$feature)]
strand(YAP.TAZ.TEAD4.uniq) <- as.character(YAP.TAZ.TEAD4.uniq$feature_strand)
seq <- getAllPeakSequence(YAP.TAZ.TEAD4.uniq, upstream=0, downstream=0, genome=Hsapiens)
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3,
quickMotif=TRUE, freqs=freqs, revcomp=TRUE)
zscore <- sort(os$zscore)
h <- hist(zscore, breaks=100)
text(zscore[length(zscore)], max(h$counts)/10,
labels=names(zscore[length(zscore)]), adj=1)
library(motifStack)
pfms <- mapply(function(.ele, id)
new("pfm", mat=.ele, name=paste("SAMPLE motif", id)),
os$motifs, 1:length(os$motifs))
motifStack(pfms)
```
The _oligoSummary_ is a quick tool to calculate the enriched k-mers in the peaks. To get accurate motifs, multiple softwares could be used, such as [the MEME Suite](http://meme-suite.org/doc/meme.html?man_type=web)(Bailey et al., 1994), [Consensus](http://stormo.wustl.edu/consensus/html/Html/main.html)(Hertz et al., 1990), [rGADEM](https://www.bioconductor.org/packages/release/bioc/html/rGADEM.html), [Homer](http://homer.salk.edu/homer/), and et. al.
***
```{r eval=FALSE}
ap1 <- GSE66081[["JUN"]]
seq.ap1 <- getAllPeakSequence(ap1, upstream=0, downstream=0, genome=Hsapiens)
os.ap1 <- oligoSummary(seq.ap1, oligoLength=7, MarkovOrder=3,
quickMotif=FALSE, freqs=freqs, revcomp=TRUE)
zscore.ap1 <- sort(os.ap1$zscore)
```
```{r eval=TRUE, fig.width=6, fig.height=4}
h.ap1 <- hist(zscore.ap1, breaks=100)
text(zscore.ap1[length(zscore.ap1)], max(h.ap1$counts)/10,
labels=names(zscore.ap1[length(zscore.ap1)]), adj=1)
```
## Build Regulatory Network from ChIP-seq and Expression Data
Intersecting genes with promoter regions bound by YAP/TAZ from the ChIP experiment and genes differential expressed in knockdown of YAP/TAZ allow us to identify genes directly/indirectly regulated by YAP/TAZ by `GeneNetworkBuilder` package.
Gene expression data are downloaded from GSE59232 by `GEOquery` package.
```{r eval=FALSE}
library(GEOquery)
library(limma)
gset <- getGEO("GSE59232", GSEMatrix =TRUE, AnnotGPL=TRUE)[[1]]
pD <- pData(gset)[, c("title", "geo_accession")]
gset <- gset[, grepl("MDA-MB-231 cells", pD$title, fixed = TRUE)]
pD <- pData(gset)[, c("title", "geo_accession")]
ex <- exprs(gset) ## log2 transformed expression profile
```
***
Affymetrix microarray will be analyzed by `limma` package.
```{r eval=FALSE}
fl <- do.call(rbind,
strsplit(sub("MDA-MB-231 cells ",
"", pD$title), ", "))
sml <- sub("#.$", "", fl[, 1])
design.table <- data.frame(group=sml,
batch=fl[, 1],
replicate=fl[, 2])
design <- model.matrix(~ -1 + group + batch + replicate,
data = design.table)
colnames(design) <- sub("group", "",
make.names(colnames(design)))
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(contrasts = "siYAP.TAZ-siControl",
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tT <- topTable(fit2, adjust="fdr", number = nrow(fit2))
```
***
Using interaction data from BioGRID saved in `simpIntLists` package, combined with
the direct targets of _TAZ_, `buildNetwork` function will build a regulation network.
And the network will be filtered by `filterNetwork` function by the expression profile.
```{r eval=FALSE}
library(simpIntLists)
interactionmap <- findInteractionList("human", "Official")
interactionmap <- lapply(interactionmap, function(.ele)
cbind(from=as.character(.ele$name),
to=as.character(.ele$interactors)))
interactionmap <- do.call(rbind, interactionmap)
library(GeneNetworkBuilder)
rootgene <- "TAZ"
TFbindingTable <-
cbind(from=rootgene,
to=unique(YAP.TAZ.TEAD4.nearest.anno$symbol))
sifNetwork <-
buildNetwork(TFbindingTable=TFbindingTable,
interactionmap=interactionmap, level=2)
tT$symbols <- tT$Gene.symbol
tT <- tT[, c("ID", "logFC", "P.Value", "symbols")]
expressionData <- tT[!duplicated(tT[, "symbols"]), ]
cifNetwork <-
filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork,
exprsData=expressionData, mergeBy="symbols",
miRNAlist=character(0),
tolerance=1, cutoffPVal=0.001,
cutoffLFC=1.5)
## polish network
gR<-polishNetwork(cifNetwork)
```
***
```{r eval=TRUE, fig.width=5, fig.height=3.5}
## plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
stop("layouttype must be dot, neato, twopi, circo or fdp")
}
g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
renderGraph(g1)
}
plotNetwork(gR)
```
***
```{r eval=TRUE}
browseNetwork(gR)
```
## View tracks by `trackViewer`
The peaks could be visualized by **[trackViewer](http://bioconductor.org/packages/release/bioc/html/trackViewer.html)** package.
```{r eval=FALSE}
library(trackViewer)
## prepare ranges to view
gene <- "SYDE2"
eID <- mget(gene, org.Hs.egSYMBOL2EG)
gr <- as(annoData[eID[[1]]], "GRanges")
gr.promoter <- promoters(gr, upstream=5000, downstream=2000)
seqlevels(gr.promoter) <- "chr1"
seqinfo(gr.promoter) <- seqinfo(gr.promoter)["chr1"]
## import data
bams.rep1 <- sub(".bam", ".rmdup.bam",
bam.files[grepl("rep1", bam.files)])
syde <- lapply(bams.rep1, importBam, ranges=gr.promoter)
names(syde) <- gsub("_rep1.rmdup.bam", "", bams.rep1)
## get gene model
trs <- geneModelFromTxdb(TxDb.Hsapiens.UCSC.hg19.knownGene,
orgDb = org.Hs.eg.db, gr = gr.promoter)
trs.names <- sapply(trs, function(.ele) .ele@name)
trs <- trs[match(gene, trs.names)]
names(trs) <- gene
## modify the plot styles
optSty <- optimizeStyle(trackList(syde, trs, heightDist=c(.9, .1)),
theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
for(i in 1:length(trackList))
setTrackStyleParam(trackList[[i]], "ylabgp", list(cex=.8))
```
***
```{r eval=TRUE, fig.width=12, fig.height=8}
## plot
vp <- viewTracks(trackList, gr=gr.promoter, viewerStyle=viewerStyle)
addGuideLine(guideLine = c(85666950, 85667700), vp = vp)
```
***
Too complicated? Try `browseTracks`
Exercise: browseTracks
```{r}
library(trackViewer)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ChIPseqStepByStep)
(ex.gr <- parse2GRanges("chr1:85664729-85671728:-"))
## get gene model
trs <- geneModelFromTxdb(TxDb.Hsapiens.UCSC.hg19.knownGene,
orgDb = org.Hs.eg.db, gr = ex.gr)
## import signals
packagePath <- system.file("extdata", "bedGraph", package = "ChIPseqStepByStep")
(ex.bg <- dir(packagePath, "bedGraph"))
ex.scores <- sapply(ex.bg, function(.ele){
importScore(file.path(packagePath, .ele),
format = "bedGraph",
ranges = ex.gr)
})
## browse the tracks
browseTracks(trackList(ex.scores, trs, heightDist = c(3/4, 1/4)), gr=ex.gr)
```
## Software availability
This workflow depends on various packages from version 3.5 of the Bioconductor project, running on R version 3.4.1 or higher. Version numbers for all packages used are shown below.
```{r eval=TRUE}
sessionInfo()
```
The command-line tools used in the workflow including SRA Toolkit (version 2.3.4), MACS2 (version 2.1.0) and Bedtools (version 2.25.0) must be installed on the system.
## Acknowledgements
We would like to thank the support from the [Department of Molecular, Cell and Cancer Biology (MCCB) at UMass Medical School (UMMS)](http://umassmed.edu/mccb). We thank the early adopters who provided great ideas and feedbacks to enhance the features of the software.