scPloidy is a package to compute ploidy of single cells
(or nuclei) based on single-cell (or single-nucleus) ATAC-seq data. In
ATAC-seq, open chromatin regions are excised and sequenced. For any site
on the genome, ATAC-seq could read 0, 1 or 2 DNA fragments, if the cell
was diploid. If the cell was tetraploid, ATAC-seq could read 0, 1, 2, 3
or 4 fragments from the same site. This is the basic idea used in
scPloidy. We model the depth of DNA sequencing at one site
by binomial distribution.
library(scPloidy)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:gplots':
#> 
#>     space
#> The following object is masked from 'package:tidyr':
#> 
#>     expand
#> The following objects are masked from 'package:dplyr':
#> 
#>     first, rename
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> Loading required package: GenomeInfoDb
library(IRanges)
library(readr)See description.
In this section, we demonstrate the package by using a dataset included in the package. See next section on how to analyze your own data.
We use small dataset for single-nucleus ATAC-seq of rat liver. To limit the file size, it only includes 10 cells and fragments from chromosomes 19 and 20. The fragments were mapped to the rat rn7 genome.
We first set the regions where overlaps are counted. This is usually all of the autosomes.
targetregions =
  GenomicRanges::GRanges(
    c("chr19", "chr20"),
    IRanges::IRanges(c(1, 1), width = 500000000))Simple repeats in the genome can generate false overlaps. We exclude such regions. The regions were downloaded from USCS genome browser.
simpleRepeat = readr::read_tsv(
  system.file("extdata", "simpleRepeat.chr19_20.txt.gz", package = "scPloidy"),
  col_names = c("chrom", "chromStart", "chromEnd"))
#> Rows: 107037 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): chrom
#> dbl (2): chromStart, chromEnd
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
simpleRepeat[, 2] = simpleRepeat[, 2] + 1 # convert from 0-based position to 1-based
simpleRepeat = GenomicRanges::makeGRangesFromDataFrame(
  as.data.frame(simpleRepeat),
  seqnames.field = "chrom",
  start.field    = "chromStart",
  end.field      = "chromEnd")Now compute the overlaps. The input file
SHR_m154211.10cells.chr19_20.fragments.txt.gz records the
fragments sequenced in ATAC-seq.
fragmentoverlap =
  fragmentoverlapcount(
    system.file("extdata", "SHR_m154211.10cells.chr19_20.fragments.txt.gz", package = "scPloidy"),
    targetregions,
    excluderegions = simpleRepeat,
    Tn5offset = c(4, -5))
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
fragmentoverlap
#> # A tibble: 10 × 8
#>    barcode     nfrags depth1 depth2 depth3 depth4 depth5 depth6
#>    <chr>        <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1 BC00019_N01   6501   6288    201      9      2      1      0
#>  2 BC00022_N01   8006   7689    310      6      1      0      0
#>  3 BC00025_N01   5904   5767    133      3      1      0      0
#>  4 BC00026_N01   5257   5102    148      6      1      0      0
#>  5 BC00035_N01   6360   6198    158      4      0      0      0
#>  6 BC00055_N01   6271   6084    181      6      0      0      0
#>  7 BC00057_N01   5570   5413    152      4      1      0      0
#>  8 BC00077_N01   5997   5795    191     10      1      0      0
#>  9 BC00086_N01   5416   5261    151      4      0      0      0
#> 10 BC00087_N01   3470   3409     60      1      0      0      0
rm(fragmentoverlap)Instead of the small dataset computed above, we here use a complete dataset for single-nucleus ATAC-seq of a rat liver.
data(SHR_m154211)
?SHR_m154211
fragmentoverlap = SHR_m154211$fragmentoverlap
fragmentoverlap
#> # A tibble: 3,572 × 8
#>    barcode     nfrags depth1 depth2 depth3 depth4 depth5 depth6
#>    <chr>        <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1 BC00025_N01 129201 126908   2237     53      3      0      0
#>  2 BC00026_N01 130564 127401   3065     91      6      1      0
#>  3 BC00035_N01 123100 120093   2940     66      1      0      0
#>  4 BC00057_N01 110602 108061   2472     67      2      0      0
#>  5 BC00086_N01 110108 107374   2667     66      1      0      0
#>  6 BC00087_N01 108661 106717   1908     33      2      1      0
#>  7 BC00099_N01 107028 104685   2264     76      3      0      0
#>  8 BC00106_N01 100659  97576   2985     92      6      0      0
#>  9 BC00115_N01 105057 102889   2104     64      0      0      0
#> 10 BC00116_N01 105085 103519   1535     31      0      0      0
#> # ℹ 3,562 more rowsCompute ploidy, assuming a mixture of 2x (diploid), 4x (tetraploid)
and 8x cells. We recommend using ploidy.moment which is
based on moment method.
p = ploidy(fragmentoverlap,
           c(2, 4, 8))
#> number of iterations= 178
head(p)
#>       barcode ploidy.moment ploidy.momentfractional ploidy.kmeans ploidy.em
#> 1 BC00025_N01             8                5.706704             4         4
#> 2 BC00026_N01             8                7.485560             8         8
#> 3 BC00035_N01             8                7.391306             4         4
#> 4 BC00057_N01             8                7.083094             8         8
#> 5 BC00086_N01             8                7.522005             8         4
#> 6 BC00087_N01             8                5.703602             4         4The cell type of the cells are stored in dataframe
cells. There are many hepatocytes of 4x and 8x, but other
cell types are mostly 2x.
In the Cell Ranger output directory, you should have files
fragments.tsv.gz and fragments.tsv.gz.tbi. The
file fragments.tsv.gz can be processed with the
fragmentoverlapcount function, specifying the option
Tn5offset = c(1, 0)
Although this requires several steps, you can start from a BAM file
and generate fragments file for scPloidy. You need to
install samtools, bgzip and
tabix, and run the following in your shell.
First generate a BAM file in which the cell barcode is prepended to
read name, separated by ‘:’. For example, suppose in your input file
bap.bam your barcode BCxxxxxx was stored in
the field with DB tag as
DB:Z:atac_possorted_bam_BCxxxxxx. We generate a BAM file
snap.bam.
# extract the header file
samtools view bap.bam -H > header.sam
# create a bam file with the barcode embedded into the read name
cat <( cat header.sam ) \
<( samtools view bap.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^DB:Z:atac_possorted_bam_BC"){ td[substr($i,1,2)] = substr($i,25,length($i)-24); } }; printf "%s:%s\n", td["DB"], $0 }' ) \
| samtools view -bS - > snap.bamWe next sort the reads by barcode, and obtain the file
snap.nsrt.bam.
Finally, we extract fragments from the name-sorted BAM file, and
obtain the file fragments.txt.gz.
samtools view -f 0x2 -F 0x900 -q 30 snap.nsrt.bam  \
  | samtofragmentbed.pl  \
  | perl -ne 'chomp; @a=split; $a[3]=~s/:.*//; print join("\t",@a), "\n"'  \
  | LC_ALL=C sort -T ./ -S 50% --parallel=12 -k1,1 -k2,3n -k4,4 -t$'\t' | uniq  \
  | bgzip > fragments.txt.gz
tabix -b 2 -e 3 fragments.txt.gzThe Perl script samtofragmentbed.pl is included in this
package as this file:
The file fragments.txt.gz can be processed with the
fragmentoverlapcount function, specifying the option
Tn5offset = c(4, -5)