Structstrings 1.0.3
The Structstrings package implements the widely used dot bracket annotation
to store base pairing information in structured RNA. For example it is used
in the ViennaRNA package (Lorenz et al. 2011), the tRNAscan-SE software (Lowe and Eddy 1997)
and the tRNAdb (Jühling et al. 2009).
Structstrings uses the infrastructure provided by the
Biostrings package (H. Pagès, P. Aboyoun, R. Gentleman, and S. DebRoy, n.d.) and derives the class
DotBracketString and related classes from the BString class. From these base
pair tables can be produced for in depth analysis, for which the
DotBracketDataFrame class is derived from the DataFrame class. In addition,
the loop indices of the base pairs can be retrieved as a LoopIndexList, a
derivate if the IntegerList class. Generally, all classes check automatically
for the validity of the base pairing information.
The conversion of the DotBracketString to the base pair table and the loop
indices is implemented in C for efficiency. The C implementation to a large
extent inspired by the ViennaRNA package.
This package was developed as an improvement for the tRNA package. However,
other projects might benefit as well, so it was split of and improved upon.
The package is installed from Bioconductor and loaded.
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Structstrings")
library(Structstrings)DotBracketString objects can be created from character as any other XString.
The validity of the structure information is checked upon creation or
modification of the object.
# Hairpin with 4 base pairs
dbs <- DotBracketString("((((....))))")
dbs##   12-letter "DotBracketString" instance
## seq: ((((....))))# a StringSet with four hairpin structures, which are all equivalent
dbs <- DotBracketStringSet(c("((((....))))",
                             "<<<<....>>>>",
                             "[[[[....]]]]",
                             "{{{{....}}}}"))
dbs##   A DotBracketStringSet instance of length 4
##     width seq
## [1]    12 ((((....))))
## [2]    12 <<<<....>>>>
## [3]    12 [[[[....]]]]
## [4]    12 {{{{....}}}}# StringSetList for storing even more structure annotations
dbsl <- DotBracketStringSetList(dbs,rev(dbs))
dbsl## DotBracketStringSetList of length 2
## [[1]] ((((....)))) <<<<....>>>> [[[[....]]]] {{{{....}}}}
## [[2]] {{{{....}}}} [[[[....]]]] <<<<....>>>> ((((....))))# invalid structure
DotBracketString("((((....)))")## Error in validObject(from): invalid class "DotBracketString" object: 
## Following structures are invalid:
## '1'.
## They contain unmatched positions.Annotations can be converted using the convertAnnotation function.
dbs[[2]] <- convertAnnotation(dbs[[2]],from = 2L, to = 1L)
dbs[[3]] <- convertAnnotation(dbs[[3]],from = 3L, to = 1L)
dbs[[4]] <- convertAnnotation(dbs[[4]],from = 4L, to = 1L)
# Note: convertAnnotation checks for presence of annotation and stops
# if there is any conflict.
dbs##   A DotBracketStringSet instance of length 4
##     width seq
## [1]    12 ((((....))))
## [2]    12 ((((....))))
## [3]    12 ((((....))))
## [4]    12 ((((....))))The dot bracket annotation can be turned into a base pairing table, which allows
the base pairing information to be queried more easily. For example, the tRNA
package makes uses this to identify the structural elements for tRNAs.
For this purpose the class DotBracketDataFrame is derived from DataFrame.
This special DataFrame can only contain 5 columns, pos, forward, reverse
character, base. The first three are obligatory, whereas the last two are
optional.
# base pairing table
dbdfl <- getBasePairing(dbs)
dbdfl[[1]]## DotBracketDataFrame with 12 rows and 4 columns
##           pos   forward   reverse   character
##     <integer> <integer> <integer> <character>
## 1           1         1        12           (
## 2           2         2        11           (
## 3           3         3        10           (
## 4           4         4         9           (
## 5           5         0         0           .
## ...       ...       ...       ...         ...
## 8           8         0         0           .
## 9           9         9         4           )
## 10         10        10         3           )
## 11         11        11         2           )
## 12         12        12         1           )The types of each column are also fixed as shown in the example above. The fifth
column not shown above must be an XStringSet object.
Additionally, loop indices can be generated for the individual annotation types. These information can also be used to distinguish structure elements.
loopids <- getLoopIndices(dbs, bracket.type = 1L)
loopids[[1]]##  [1] 1 2 3 4 4 4 4 4 4 3 2 1# can also be constructed from DotBracketDataFrame and contains the same 
# information
loopids2 <- getLoopIndices(dbdfl, bracket.type = 1L)
all(loopids == loopids2)##                     
## TRUE TRUE TRUE TRUEThe dot bracket annotation can be recreated from a DotBracketDataFrame object
with the function getDotBracket(). If the character column is present, this
informations is just concatenated and used to create a DotBracketString. If
it is not present or force.bracket is set to TRUE, the dot bracket string
is created from the base pairing information.
rec_dbs <- getDotBracket(dbdfl)
dbdf <- unlist(dbdfl)
dbdf$character <- NULL
dbdfl2 <- relist(dbdf,dbdfl)
# even if the character column is not set, the dot bracket string can be created
rec_dbs2 <- getDotBracket(dbdfl2)
rec_dbs3 <- getDotBracket(dbdfl, force = TRUE)
rec_dbs[[1]]##   12-letter "DotBracketString" instance
## seq: ((((....))))rec_dbs2[[1]]##   12-letter "DotBracketString" instance
## seq: ((((....))))rec_dbs3[[1]]##   12-letter "DotBracketString" instance
## seq: ((((....))))Please be aware that getDotBracket() might return a different output than
original input, if this information is turned around from a DotBracketString
to DotBracketDataFrame and back to a DotBracketString. First the ()
annotation is used followed by <>, [] and {} in this order.
For a DotBracketString containing only one type of annotation this might not
mean much, except if the character string itself is evaluated. However,
if pseudoloops are present, this will lead potentially to a reformated and
simplified annotation.
db <- DotBracketString("((((....[[[))))....((((....<<<<...))))]]]....>>>>...")
db##   52-letter "DotBracketString" instance
## seq: ((((....[[[))))....((((....<<<<...))))]]]....>>>>...getDotBracket(getBasePairing(db), force = TRUE)##   52-letter "DotBracketString" instance
## seq: ((((....<<<))))....<<<<....[[[[...>>>>>>>....]]]]...To store a nucleotide sequence and a structure in one object, the classes
StructuredRNAStringSet are implemented.
data("dbs", package = "Structstrings", envir = environment())
data("seq", package = "Structstrings", envir = environment())
sdbs <- StructuredRNAStringSet(seq,dbs)
sdbs[1]##   A StructuredRNAStringSet instance containing:
## 
##   A RNAStringSet instance of length 1
##     width seq                                            names               
## [1]    72 GGGCGUGUGGUCUAGUGGUAUG...GUUCAAUUCCCAGCUCGCCCC Sequence 1
## 
##   A DotBracketStringSet instance of length 1
##     width seq
## [1]    72 (((((.(..(((.........))).(((((.....)).....(((((.......)))))).))))).# subsetting to element returns the sequence
sdbs[[1]]##   72-letter "RNAString" instance
## seq: GGGCGUGUGGUCUAGUGGUAUGAUUCUCGCUUUGGGUGCGAGAGGCCCUGGGUUCAAUUCCCAGCUCGCCCC# dotbracket() gives access to the DotBracketStringSet
dotbracket(sdbs)[[1]]##   72-letter "DotBracketString" instance
## seq: (((((.(..(((.........))).(((((.......))))).....(((((.......)))))).))))).The base pair table can be directly accessed using getBasePairing(). The
base column is automatically populated from the nucleotide sequence. This is a
bit slower than just creating the base pair table. Therefore this step can be
omitted by setting return.sequence to FALSE.
dbdfl <- getBasePairing(sdbs)
dbdfl[[1]]## DotBracketDataFrame with 72 rows and 5 columns
##           pos   forward   reverse   character           base
##     <integer> <integer> <integer> <character> <RNAStringSet>
## 1           1         1        71           (              G
## 2           2         2        70           (              G
## 3           3         3        69           (              G
## 4           4         4        68           (              C
## 5           5         5        67           (              G
## ...       ...       ...       ...         ...            ...
## 68         68        68         4           )              G
## 69         69        69         3           )              C
## 70         70        70         2           )              C
## 71         71        71         1           )              C
## 72         72         0         0           .              C# returns the result without sequence information
dbdfl <- getBasePairing(sdbs, return.sequence = TRUE)
dbdfl[[1]]## DotBracketDataFrame with 72 rows and 4 columns
##           pos   forward   reverse   character
##     <integer> <integer> <integer> <character>
## 1           1         1        71           (
## 2           2         2        70           (
## 3           3         3        69           (
## 4           4         4        68           (
## 5           5         5        67           (
## ...       ...       ...       ...         ...
## 68         68        68         4           )
## 69         69        69         3           )
## 70         70        70         2           )
## 71         71        71         1           )
## 72         72         0         0           .sessionInfo()## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] Structstrings_1.0.3 Biostrings_2.52.0   XVector_0.24.0     
## [4] IRanges_2.18.1      S4Vectors_0.22.0    BiocGenerics_0.30.0
## [7] BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1                 knitr_1.23                
##  [3] magrittr_1.5               assertive.data.us_0.0-2   
##  [5] zlibbioc_1.30.0            assertive.code_0.0-3      
##  [7] assertive.reflection_0.0-4 stringr_1.4.0             
##  [9] assertive.matrices_0.0-2   tools_3.6.0               
## [11] assertive.properties_0.0-4 xfun_0.7                  
## [13] assertive.models_0.0-2     assertive.files_0.0-2     
## [15] htmltools_0.3.6            yaml_2.2.0                
## [17] digest_0.6.19              assertive.base_0.0-7      
## [19] bookdown_0.11              assertive.data_0.0-3      
## [21] BiocManager_1.30.4         assertive.sets_0.0-3      
## [23] assertive.datetimes_0.0-2  codetools_0.2-16          
## [25] assertive.types_0.0-3      evaluate_0.14             
## [27] rmarkdown_1.13             stringi_1.4.3             
## [29] compiler_3.6.0             assertive.numbers_0.0-2   
## [31] assertive.data.uk_0.0-2    assertive.strings_0.0-3   
## [33] assertive_0.3-5H. Pagès, P. Aboyoun, R. Gentleman, and S. DebRoy. n.d. “Biostrings.” Bioconductor. https://doi.org/10.18129/B9.bioc.Biostrings.
Jühling, Frank, Mario Mörl, Roland K. Hartmann, Mathias Sprinzl, Peter F. Stadler, and Joern Pütz. 2009. “TRNAdb 2009: Compilation of tRNA Sequences and tRNA Genes.” Nucleic Acids Research 37 (Database issue):D159–62. https://doi.org/10.1093/nar/gkn772.
Lorenz, Ronny, Stephan H. Bernhart, Christian Höner Zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F. Stadler, and Ivo L. Hofacker. 2011. “ViennaRNA Package 2.0.” Algorithms for Molecular Biology : AMB 6:26. https://doi.org/10.1186/1748-7188-6-26.
Lowe, T. M., and S. R. Eddy. 1997. “TRNAscan-Se: A Program for Improved Detection of Transfer Rna Genes in Genomic Sequence.” Nucleic Acids Research 25 (5):955–64.