Contents

1 Introduction

CIGAR stands for Concise Idiosyncratic Gapped Alignment Report. CIGAR strings are typically found in the SAM/BAM files produced by most aligners and in the AIRR-formatted output produced by IgBLAST.

cigarillo is an R/Bioconductor infrastructure package that provides functions to parse and inspect CIGAR strings, trim them, turn them into ranges of positions relative to the “query space” or “reference space”, and project positions or sequences from one space to the other.

Note that these operations are low-level operations that the user rarely needs to perform directly. More typically, they are performed behind the scene by higher-level functionality implemented in other packages like Bioconductor packages GenomicAlignments and igblastr.

2 Install and load the package

Like any Bioconductor package, cigarillo should be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("cigarillo")

BiocManager::install() will take care of installing the package dependencies that are missing.

Load the package:

library(cigarillo)

3 A quick overview of the functionality available in the package

3.1 Explode CIGAR strings

Functions explode_cigar_ops() and explode_cigar_oplens() can be used to extract the letters (or lengths) of the CIGAR operations contained in a vector of CIGAR strings:

my_cigars <- c(
    "40M2I9M",
    "60M",
    "3H15M55N4M2I6M2D5M6S",
    "50=2X3=1X10=",
    "2S10M2000N15M",
    "3H33M5H"
)
explode_cigar_ops(my_cigars)
## [[1]]
## [1] "M" "I" "M"
## 
## [[2]]
## [1] "M"
## 
## [[3]]
## [1] "H" "M" "N" "M" "I" "M" "D" "M" "S"
## 
## [[4]]
## [1] "=" "X" "=" "X" "="
## 
## [[5]]
## [1] "S" "M" "N" "M"
## 
## [[6]]
## [1] "H" "M" "H"
explode_cigar_oplens(my_cigars)
## [[1]]
## [1] 40  2  9
## 
## [[2]]
## [1] 60
## 
## [[3]]
## [1]  3 15 55  4  2  6  2  5  6
## 
## [[4]]
## [1] 50  2  3  1 10
## 
## [[5]]
## [1]    2   10 2000   15
## 
## [[6]]
## [1]  3 33  5

See ?explode_cigars for more information and additional examples.

3.2 Tabulate CIGAR operations

Function tabulate_cigar_ops() can be used to count the occurences of CIGAR operations in a vector of CIGAR strings:

tabulate_cigar_ops(my_cigars)
##      M I D N S H P = X
## [1,] 2 1 0 0 0 0 0 0 0
## [2,] 1 0 0 0 0 0 0 0 0
## [3,] 4 1 1 1 1 1 0 0 0
## [4,] 0 0 0 0 0 0 0 3 2
## [5,] 2 0 0 1 1 0 0 0 0
## [6,] 1 0 0 0 0 2 0 0 0

See ?tabulate_cigar_ops for more information and additional examples.

3.3 Calculate the number of positions spanned by a CIGAR

Three functions are provided to calculate the extent of a CIGAR string, that is, the number of positions spanned by the alignment that it describes:

  1. cigar_extent_along_ref() calculates the extent along the “reference space”.

  2. cigar_extent_along_query() calculates the extent along the “query space”.

  3. cigar_extent_along_pwa() calculates the extent along the “pairwise alignment space”.

The three functions are vectorized.

cigar_extent_along_ref(my_cigars)
## [1]   49   60   87   66 2025   33
cigar_extent_along_query(my_cigars)
## [1] 51 60 38 66 27 33

See ?cigar_extent for more information and additional examples.

3.4 Trim CIGAR strings along the reference or query space

Functions trim_cigars_along_ref() and trim_cigars_along_query() can be used to compute the CIGAR string that describes a trimmed alignment, that is, the portion of the alignment that is left after trimming it by a given number of positions on its left and/or right ends. The two functions are vectorized.

cigar1 <- "3H15M55N4M2I6M2D5M6S"

trim_cigars_along_ref(cigar1)  # only drops the soft/hard clipping
## [1] "15M55N4M2I6M2D5M"
## attr(,"rshift")
## [1] 0
trim_cigars_along_ref(cigar1, Lnpos=14, Rnpos=16)
## [1] "1M55N1M"
## attr(,"rshift")
## [1] 14
trim_cigars_along_query(cigar1, Lnpos=19, Rnpos=2)
## [1] "3M2I6M2D5M4S"
## attr(,"rshift")
## [1] 71

See ?trim_cigars for more information and additional examples.

3.5 Turn CIGAR strings into ranges of positions

Functions cigars_as_ranges_along_ref(), cigars_as_ranges_along_query(), and cigars_as_ranges_along_pwa(), can be used to turn CIGAR strings into ranges of positions relative to the “query space”, “reference space”, and “pairwise alignment space”, respectively:

cigars_as_ranges_along_ref(my_cigars, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M         1        40        40
##   I        41        40         0
##   M        41        49         9
## 
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M         1        60        60
## 
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   H         1         0         0
##   M         1        15        15
##   N        16        70        55
##   M        71        74         4
##   I        75        74         0
##   M        75        80         6
##   D        81        82         2
##   M        83        87         5
##   S        88        87         0
## 
## ...
## <3 more elements>
lmmpos <- c(1001, 2001, 3001, 4001, 5001, 6001)
cigars_as_ranges_along_ref(my_cigars, lmmpos=lmmpos, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M      1001      1040        40
##   I      1041      1040         0
##   M      1041      1049         9
## 
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M      2001      2060        60
## 
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   H      3001      3000         0
##   M      3001      3015        15
##   N      3016      3070        55
##   M      3071      3074         4
##   I      3075      3074         0
##   M      3075      3080         6
##   D      3081      3082         2
##   M      3083      3087         5
##   S      3088      3087         0
## 
## ...
## <3 more elements>
cigars_as_ranges_along_query(my_cigars, with.ops=TRUE)
## IRangesList object of length 6:
## [[1]]
## IRanges object with 3 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M         1        40        40
##   I        41        42         2
##   M        43        51         9
## 
## [[2]]
## IRanges object with 1 range and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   M         1        60        60
## 
## [[3]]
## IRanges object with 9 ranges and 0 metadata columns:
##         start       end     width
##     <integer> <integer> <integer>
##   H         1         0         0
##   M         1        15        15
##   N        16        15         0
##   M        16        19         4
##   I        20        21         2
##   M        22        27         6
##   D        28        27         0
##   M        28        32         5
##   S        33        38         6
## 
## ...
## <3 more elements>

See ?cigars_as_ranges for more information and additional examples.

3.6 Project positions from query to reference space and vice versa

Function query_pos_as_ref_pos() projects positions defined along the “query space” onto the “reference space”, that is, it turns them into positions defined along the “reference space”:

my_query_pos <- 1:9
my_cigars <- rep("3M2I2M60N2M", 9)
query_pos_as_ref_pos(my_query_pos, my_cigars, lmmpos=51, narrow.left=TRUE)
## [1]  51  52  53  53  53  54  55 116 117

Function ref_pos_as_query_pos() does the opposite i.e. it projects positions defined along the “reference space” onto the “query space”.

See ?project_positions for more information and additional examples.

3.7 Project sequences from one space to the other

Function project_sequences() can be used to project sequences that belong to a given projection space (e.g. the “query space”) onto another projection space (e.g. the “reference space”) by removing/injecting substrings from/into them, based on their corresponding CIGAR string. See ?cigar_ops_visibility for the 8 supported projection spaces.

For example, consider the following query and reference sequences:

qseqs <- BStringSet(c(id1="XZ",  id2="ABCdeFGKL",  id3="MNoPQTUV"))
rseqs <- BStringSet(c(id1="XyZ", id2="ABCFGhijKL", id3="MNPQrsTUV"))

A descent pairwise alignment tool would very likely produce the following output:

aligned_qseqs <- BStringSet(c(id1="X-Z", id2="ABCdeFG...KL", id3="MNoPQ--TUV"))
aligned_rseqs <- BStringSet(c(id1="XyZ", id2="ABC--FGhijKL", id3="MN-PQrsTUV"))

If our pairwise alignment has also the capability to produce CIGAR strings, they would be:

cigars <- c(id1="1M1D1M", id2="3M2I2M3N2M", id3="2M1I2M2D3M")

Having access to the CIGAR strings allows us to infer the aligned sequences from the original ones:

project_sequences(qseqs, cigars, from="query", to="pairwise")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     3 X-Z                                               id1
## [2]    12 ABCdeFG...KL                                      id2
## [3]    10 MNoPQ--TUV                                        id3
project_sequences(rseqs, cigars, from="reference", to="pairwise")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     3 XyZ                                               id1
## [2]    12 ABC--FGhijKL                                      id2
## [3]    10 MN-PQrsTUV                                        id3

and vice versa:

project_sequences(aligned_qseqs, cigars, from="pairwise", to="query")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     2 XZ                                                id1
## [2]     9 ABCdeFGKL                                         id2
## [3]     8 MNoPQTUV                                          id3
project_sequences(aligned_rseqs, cigars, from="pairwise", to="reference")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     3 XyZ                                               id1
## [2]    10 ABCFGhijKL                                        id2
## [3]     9 MNPQrsTUV                                         id3

We can also project the original query sequences onto the “reference space”:

project_sequences(qseqs, cigars, from="query", to="reference")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     3 X-Z                                               id1
## [2]    10 ABCFG...KL                                        id2
## [3]     9 MNPQ--TUV                                         id3

or project the original reference sequences onto the “query space”:

project_sequences(rseqs, cigars, from="reference", to="query")
## BStringSet object of length 3:
##     width seq                                               names               
## [1]     2 XZ                                                id1
## [2]     9 ABC--FGKL                                         id2
## [3]     8 MN-PQTUV                                          id3

See ?project_ranges for more information and additional examples.

4 Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] cigarillo_0.99.1    Biostrings_2.77.2   Seqinfo_0.99.3     
## [4] XVector_0.49.1      IRanges_2.43.6      S4Vectors_0.47.5   
## [7] BiocGenerics_0.55.4 generics_0.1.4      BiocStyle_2.37.1   
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.3        cli_3.6.5           knitr_1.50         
##  [4] rlang_1.1.6         xfun_0.53           jsonlite_2.0.0     
##  [7] htmltools_0.5.8.1   sass_0.4.10         rmarkdown_2.30     
## [10] evaluate_1.0.5      jquerylib_0.1.4     fastmap_1.2.0      
## [13] yaml_2.3.10         lifecycle_1.0.4     bookdown_0.45      
## [16] BiocManager_1.30.26 compiler_4.5.1      digest_0.6.37      
## [19] R6_2.6.1            bslib_0.9.0         tools_4.5.1        
## [22] cachem_1.1.0