cigarillo 0.99.1
CIGAR stands for Concise Idiosyncratic Gapped Alignment Report. CIGAR strings are typically found in the SAM/BAM files produced by most aligners and in the AIRR-formatted output produced by IgBLAST.
cigarillo is an R/Bioconductor infrastructure package that provides functions to parse and inspect CIGAR strings, trim them, turn them into ranges of positions relative to the “query space” or “reference space”, and project positions or sequences from one space to the other.
Note that these operations are low-level operations that the user rarely needs to perform directly. More typically, they are performed behind the scene by higher-level functionality implemented in other packages like Bioconductor packages GenomicAlignments and igblastr.
Like any Bioconductor package, cigarillo should be installed
with BiocManager::install():
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("cigarillo")
BiocManager::install() will take care of installing the package
dependencies that are missing.
Load the package:
library(cigarillo)
Functions explode_cigar_ops() and explode_cigar_oplens() can
be used to extract the letters (or lengths) of the CIGAR operations
contained in a vector of CIGAR strings:
my_cigars <- c(
"40M2I9M",
"60M",
"3H15M55N4M2I6M2D5M6S",
"50=2X3=1X10=",
"2S10M2000N15M",
"3H33M5H"
)
explode_cigar_ops(my_cigars)
## [[1]]
## [1] "M" "I" "M"
##
## [[2]]
## [1] "M"
##
## [[3]]
## [1] "H" "M" "N" "M" "I" "M" "D" "M" "S"
##
## [[4]]
## [1] "=" "X" "=" "X" "="
##
## [[5]]
## [1] "S" "M" "N" "M"
##
## [[6]]
## [1] "H" "M" "H"
explode_cigar_oplens(my_cigars)
## [[1]]
## [1] 40 2 9
##
## [[2]]
## [1] 60
##
## [[3]]
## [1] 3 15 55 4 2 6 2 5 6
##
## [[4]]
## [1] 50 2 3 1 10
##
## [[5]]
## [1] 2 10 2000 15
##
## [[6]]
## [1] 3 33 5
See ?explode_cigars for more information and additional examples.
Function tabulate_cigar_ops() can be used to count the occurences
of CIGAR operations in a vector of CIGAR strings:
tabulate_cigar_ops(my_cigars)
## M I D N S H P = X
## [1,] 2 1 0 0 0 0 0 0 0
## [2,] 1 0 0 0 0 0 0 0 0
## [3,] 4 1 1 1 1 1 0 0 0
## [4,] 0 0 0 0 0 0 0 3 2
## [5,] 2 0 0 1 1 0 0 0 0
## [6,] 1 0 0 0 0 2 0 0 0
See ?tabulate_cigar_ops for more information and additional examples.
Three functions are provided to calculate the extent of a CIGAR string, that is, the number of positions spanned by the alignment that it describes:
cigar_extent_along_ref() calculates the extent along
the “reference space”.
cigar_extent_along_query() calculates the extent along
the “query space”.
cigar_extent_along_pwa() calculates the extent along
the “pairwise alignment space”.
The three functions are vectorized.
cigar_extent_along_ref(my_cigars)
## [1] 49 60 87 66 2025 33
cigar_extent_along_query(my_cigars)
## [1] 51 60 38 66 27 33
See ?cigar_extent for more information and additional examples.
Functions trim_cigars_along_ref() and trim_cigars_along_query() can
be used to compute the CIGAR string that describes a trimmed alignment,
that is, the portion of the alignment that is left after trimming
it by a given number of positions on its left and/or right ends.
The two functions are vectorized.
cigar1 <- "3H15M55N4M2I6M2D5M6S"
trim_cigars_along_ref(cigar1) # only drops the soft/hard clipping
## [1] "15M55N4M2I6M2D5M"
## attr(,"rshift")
## [1] 0
trim_cigars_along_ref(cigar1, Lnpos=14, Rnpos=16)
## [1] "1M55N1M"
## attr(,"rshift")
## [1] 14
trim_cigars_along_query(cigar1, Lnpos=19, Rnpos=2)
## [1] "3M2I6M2D5M4S"
## attr(,"rshift")
## [1] 71
See ?trim_cigars for more information and additional examples.
Functions cigars_as_ranges_along_ref(), cigars_as_ranges_along_query(),
and cigars_as_ranges_along_pwa(), can be used to turn CIGAR strings into
ranges of positions relative to the “query space”, “reference space”,
and “pairwise alignment space”, respectively:
cigars_as_ranges_along_ref(my_cigars, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 1 40 40
## I 41 40 0
## M 41 49 9
##
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 1 60 60
##
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## H 1 0 0
## M 1 15 15
## N 16 70 55
## M 71 74 4
## I 75 74 0
## M 75 80 6
## D 81 82 2
## M 83 87 5
## S 88 87 0
##
## ...
## <3 more elements>
lmmpos <- c(1001, 2001, 3001, 4001, 5001, 6001)
cigars_as_ranges_along_ref(my_cigars, lmmpos=lmmpos, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 1001 1040 40
## I 1041 1040 0
## M 1041 1049 9
##
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 2001 2060 60
##
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## H 3001 3000 0
## M 3001 3015 15
## N 3016 3070 55
## M 3071 3074 4
## I 3075 3074 0
## M 3075 3080 6
## D 3081 3082 2
## M 3083 3087 5
## S 3088 3087 0
##
## ...
## <3 more elements>
cigars_as_ranges_along_query(my_cigars, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 1 40 40
## I 41 42 2
## M 43 51 9
##
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## M 1 60 60
##
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## H 1 0 0
## M 1 15 15
## N 16 15 0
## M 16 19 4
## I 20 21 2
## M 22 27 6
## D 28 27 0
## M 28 32 5
## S 33 38 6
##
## ...
## <3 more elements>
See ?cigars_as_ranges for more information and additional examples.
Function query_pos_as_ref_pos() projects positions defined along
the “query space” onto the “reference space”, that is, it turns
them into positions defined along the “reference space”:
my_query_pos <- 1:9
my_cigars <- rep("3M2I2M60N2M", 9)
query_pos_as_ref_pos(my_query_pos, my_cigars, lmmpos=51, narrow.left=TRUE)
## [1] 51 52 53 53 53 54 55 116 117
Function ref_pos_as_query_pos() does the opposite i.e. it projects
positions defined along the “reference space” onto the “query space”.
See ?project_positions for more information and additional examples.
Function project_sequences() can be used to project sequences that
belong to a given projection space (e.g. the “query space”) onto
another projection space (e.g. the “reference space”) by removing/injecting
substrings from/into them, based on their corresponding CIGAR string.
See ?cigar_ops_visibility for the 8 supported projection spaces.
For example, consider the following query and reference sequences:
qseqs <- BStringSet(c(id1="XZ", id2="ABCdeFGKL", id3="MNoPQTUV"))
rseqs <- BStringSet(c(id1="XyZ", id2="ABCFGhijKL", id3="MNPQrsTUV"))
A descent pairwise alignment tool would very likely produce the following output:
aligned_qseqs <- BStringSet(c(id1="X-Z", id2="ABCdeFG...KL", id3="MNoPQ--TUV"))
aligned_rseqs <- BStringSet(c(id1="XyZ", id2="ABC--FGhijKL", id3="MN-PQrsTUV"))
If our pairwise alignment has also the capability to produce CIGAR strings, they would be:
cigars <- c(id1="1M1D1M", id2="3M2I2M3N2M", id3="2M1I2M2D3M")
Having access to the CIGAR strings allows us to infer the aligned sequences from the original ones:
project_sequences(qseqs, cigars, from="query", to="pairwise")
## BStringSet object of length 3:
## width seq names
## [1] 3 X-Z id1
## [2] 12 ABCdeFG...KL id2
## [3] 10 MNoPQ--TUV id3
project_sequences(rseqs, cigars, from="reference", to="pairwise")
## BStringSet object of length 3:
## width seq names
## [1] 3 XyZ id1
## [2] 12 ABC--FGhijKL id2
## [3] 10 MN-PQrsTUV id3
and vice versa:
project_sequences(aligned_qseqs, cigars, from="pairwise", to="query")
## BStringSet object of length 3:
## width seq names
## [1] 2 XZ id1
## [2] 9 ABCdeFGKL id2
## [3] 8 MNoPQTUV id3
project_sequences(aligned_rseqs, cigars, from="pairwise", to="reference")
## BStringSet object of length 3:
## width seq names
## [1] 3 XyZ id1
## [2] 10 ABCFGhijKL id2
## [3] 9 MNPQrsTUV id3
We can also project the original query sequences onto the “reference space”:
project_sequences(qseqs, cigars, from="query", to="reference")
## BStringSet object of length 3:
## width seq names
## [1] 3 X-Z id1
## [2] 10 ABCFG...KL id2
## [3] 9 MNPQ--TUV id3
or project the original reference sequences onto the “query space”:
project_sequences(rseqs, cigars, from="reference", to="query")
## BStringSet object of length 3:
## width seq names
## [1] 2 XZ id1
## [2] 9 ABC--FGKL id2
## [3] 8 MN-PQTUV id3
See ?project_ranges for more information and additional examples.
Here is the output of sessionInfo() on the system on which this document
was compiled:
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cigarillo_0.99.1 Biostrings_2.77.2 Seqinfo_0.99.3
## [4] XVector_0.49.1 IRanges_2.43.6 S4Vectors_0.47.5
## [7] BiocGenerics_0.55.4 generics_0.1.4 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 cli_3.6.5 knitr_1.50
## [4] rlang_1.1.6 xfun_0.53 jsonlite_2.0.0
## [7] htmltools_0.5.8.1 sass_0.4.10 rmarkdown_2.30
## [10] evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
## [13] yaml_2.3.10 lifecycle_1.0.4 bookdown_0.45
## [16] BiocManager_1.30.26 compiler_4.5.1 digest_0.6.37
## [19] R6_2.6.1 bslib_0.9.0 tools_4.5.1
## [22] cachem_1.1.0