Contents

# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

1 1 Introduction

The PMScanR package provides large-scale identification, analysis and visualization of protein motifs. The package integrates various methods to facilitate motif identification, characterization, and visualization. It includes functions for running PS-Scan, a PROSITE database tool. Additionally, PMScanR supports format conversion to GFF, enhancing downstream analyses such as graphical representation and database integration. The library offers multiple visualization tools, including heatmaps, sequence logos, and pie charts, enabling a deeper understanding of motif distribution and conservation. Through its integration with PROSITE, PMScanR provides access to up-to-date motif data.

Proteins play a crucial role in biological processes, with their functions closely related to structure. Protein functions are often associated with the presence of specific motifs, which are short, sometimes repetitive amino acid sequences essential for distinctive molecular interactions or modifications. Most of the existing bioinformatics tools focus mainly on the identification of known motifs and often do not provide during motif extraction, interactive analysis and visualization tools. Moreover, they do not take into account the effect of single variations on an entire domain or protein motif. These limitations highlight the need for a tool that can automate and scale the analysis. To address this, we have developed PMScanR, an R-based package. Designed to facilitate and automate the prediction and the evaluation of the effect of single amino acid substitutions on the occurrence of protein motifs on a large scale of both motifs and sequences. However, existing tools lacked the capability to perform comparative analysis of multiple motifs across multiple sequences, a gap that PMScanR was particularly developed to fill.

1.1 1.1 Instalation and loading

To install this package, start R (version “4.4” or higher) and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PMScanR")

library(PMScanR)

2 2 Data manipulation and overall usage

2.1 2.1 GUI

If the user prefers to perform the analysis using a graphical user interface (GUI), they can simply run the function runPMScanRShiny(). This will launch a Shiny app that opens an interactive window. The window can be used both within R and in a web browser, providing a clickable, user-friendly interface that allows the entire analysis, including visualizations, to be carried out without needing to write code. If you want you can use shiny to use all the features of the library with user friendly GUI helping to follow all the steps which are contained in this library.

# This command launches the interactive Shiny app
runPMScanRShiny()

2.2 2.2 Command Line

Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.

2.2.1 2.2.1 List of functions and their description

The following list of commands provides a step-by-step description of the functions that ensure the complete analysis provided by the PMScanR package.

The first step of the analyses is to establish the working environment through the use of functions:

# Setting working directory is user-specific,
# setwd("disc:/your/path/to/working/directory")

The next step is to load the data for the analysis (more example input files are given in the repository on Github, and below is one example file just from this repository):

# Load example FASTA file included with the package
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")

Once the previous steps have been completed, you can move on to the relevant part of the analysis. Below is a description of the function, together with an example of its use.

2.2.1.1 runPsScan()

This function runs the ps_scan tool from the user’s operating system to find protein motifs. It requires in_file, out_format, and out_file arguments. The function automatically detects the operating system and uses a cached version of the required PROSITE files.

# This command is not evaluated in the vignette as it requires an external
# dependency (Perl) and can be time-consuming.
runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results_pfscan.gff")

This function is an automatic form of analysis loading, which searches the sequence for protein motifs present on it, but there is also an alternative option where the user can manually point the path to each of the files and the version of the operation system.

After running the scan, the output files can be converted into a GFF-like format for downstream analysis.

2.2.1.2 readPsa() and readProsite()

These functions are used to parse the output files generated by runPsScan. They convert the raw text output into a standardized GFF-like data frame, which is a structured format required by the other analysis and visualization functions in the package.

The following example shows how to parse a sample PSA output file that is included with the PMScanR package.

# Load the path to an example PSA output file
motifs_psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")

# Use readPsa to convert the file to a GFF-like data frame
gff_data <- readPsa(motifs_psa_file)

# Display the first few rows of the resulting data frame
head(gff_data)
##      seqnames source    type start end score strand phase             Name
## 1 NP_000508.1    PSA PS00005    39  41    NA   <NA>    NA PKC_PHOSPHO_SITE
## 2 NP_000508.1    PSA PS00005   138 140    NA   <NA>    NA PKC_PHOSPHO_SITE
## 3 NP_000508.1    PSA PS00006     4   7    NA   <NA>    NA CK2_PHOSPHO_SITE
## 4 NP_000508.1    PSA PS00008    19  24    NA   <NA>    NA         MYRISTYL
## 5 NP_000508.1    PSA PS00009    59  62    NA   <NA>    NA        AMIDATION
## 6 NP_000508.1    PSA PS01033     2 142    NA   <NA>    NA           GLOBIN
##                                                                                                                                              Sequence
## 1                                                                                                                                                 TtK
## 2                                                                                                                                                 TsK
## 3                                                                                                                                                SpaD
## 4                                                                                                                                              GAhaGE
## 5                                                                                                                                                hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
##   Level SequenceDescription width SkipFlag KnownFalsePos RawScore FeatureFrom
## 1  <NA>                  NA     3       NA            NA       NA          NA
## 2  <NA>                  NA     3       NA            NA       NA          NA
## 3  <NA>                  NA     4       NA            NA       NA          NA
## 4  <NA>                  NA     6       NA            NA       NA          NA
## 5  <NA>                  NA     4       NA            NA       NA          NA
## 6     0                  NA   141       NA            NA       NA          NA
##   FeatureTo
## 1        NA
## 2        NA
## 3        NA
## 4        NA
## 5        NA
## 6        NA

Once the analysis output files have been converted to GFF format, the next step is to create the matrix from this file needed for further visualisation in the form of heatmap generation. That can me make by using:

2.2.1.3 gff2matrix()

This function converts the GFF-like data frame into a binary motif-occurrence matrix. This matrix format is compact and is the required input for the heatmap visualization functions. Each row represents a unique motif, each column represents a sequence, and a 1 indicates the presence of that motif in that sequence.

# The 'gff_data' is used from the previous chunk
motif_matrix <- gff2matrix(gff_data)

# Display the first few rows of the resulting matrix
head(motif_matrix)
##                 NP_000175.1 NP_000508.1 NP_000549.1 NP_000550.2 NP_001138679.1
## PS00001:125-128           0           0           0           0              0
## PS00001:148-151           0           0           0           0              0
## PS00001:152-155           0           0           0           0              0
## PS00001:177-180           0           0           0           0              0
## PS00001:182-185           0           0           0           0              0
## PS00001:189-192           0           0           0           0              1
##                 NP_001305067.1 NP_001305150.1 NP_001350615.1 NP_071441.1
## PS00001:125-128              1              0              0           0
## PS00001:148-151              1              0              0           0
## PS00001:152-155              1              0              0           0
## PS00001:177-180              0              0              0           0
## PS00001:182-185              1              0              0           0
## PS00001:189-192              0              0              0           0
##                 NP_599030.1
## PS00001:125-128           0
## PS00001:148-151           0
## PS00001:152-155           0
## PS00001:177-180           1
## PS00001:182-185           0
## PS00001:189-192           0

After using this function a heatmap can be generate by using:

2.2.1.4 matrix2hm() and matrixToSquareHeatmap()

These functions generate interactive heatmaps from the motif-occurrence matrix, which is an effective way to visualize the presence or absence of motifs across many sequences simultaneously. matrix2hm creates a standard rectangular heatmap, while matrixToSquareHeatmap creates one with a square aspect ratio, which can be useful for comparing motifs.

# Generate a standard heatmap from the motif_matrix
heatmap1 <- matrix2hm(input = motif_matrix)
heatmap1
# Generate a square heatmap from the motif_matrix
heatmap2 <- matrixToSquareHeatmap(input = motif_matrix)
heatmap2

Heatmap is the first option to visualise the data (shown above), the next option is to generate a seqlogo, after preparing the protein motifs for their generation using the functions described below:

2.2.1.5 extractSegments() and extractProteinMotifs()

These functions are used to prepare data for sequence logo visualizations. A sequence logo provides a graphical representation of the conservation of nucleotides or amino acids in a set of aligned sequences.

extractSegments is used to pull out a specific region (e.g., from position 10 to 20) from a list of raw sequences. extractProteinMotifs is used to extract all occurrences of a specific, defined motif from a PSA scan result file.

The example below demonstrates how to extract a segment from our sample hemoglobin sequences and generate a logo.

# Read the FASTA file into a list of sequences
sequences <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")

# Extract segments from position 10 to 20 from all sequences
segments <- extractSegments(sequences, from = 10, to = 20)

# Generate the sequence logo from the extracted segments
ggseqlogo::ggseqlogo(unlist(segments), seq_type = "aa")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

2.2.2 extractProteinMotifs()

This function is used to extract all protein motifs belonging to specific PROSITE IDs from a PSA-formatted file. It returns a list where each element is named by a PROSITE ID (e.g., “PS00223”) and contains all the sequence instances found for that motif. This is used for sequence logo visualizations of specific motifs.

# Extract all protein motifs into a list
protein_motifs <- extractProteinMotifs(motifs_psa_file)

# Generate the sequence logo from the extracted protein motifs
ggseqlogo::ggseqlogo(protein_motifs$PS60007, seq_type='aa') # Sequence logo for motif 'PS60007' (example motif ID)
ggseqlogo::ggseqlogo(protein_motifs[1], seq_type='aa')       # Sequence logo for the first motif in the list
ggseqlogo::ggseqlogo(protein_motifs[5], seq_type='aa')       # Sequence logo for the fifth motif in the list

2.2.2.1 freqPie()

This function generates a pie chart to visualize the frequency distribution of each motif type found in the analysis. It takes the GFF-like data frame as input and provides a quick, high-level overview of which motifs are most common in the dataset.

# The 'gff_data'  from a previous chunk is used here
pie_chart <- freqPie(gff_data)
print(pie_chart)

3 References

Appendix

Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.

A Session Information

sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_4.11.0    ggplot2_4.0.0    seqinr_4.2-36    ggseqlogo_0.2   
## [5] PMScanR_0.99.6   BiocStyle_2.37.1
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                   bitops_1.0-9               
##   [3] httr2_1.2.1                 rlang_1.1.6                
##   [5] magrittr_2.0.4              ade4_1.7-23                
##   [7] matrixStats_1.5.0           compiler_4.5.1             
##   [9] RSQLite_2.4.3               vctrs_0.6.5                
##  [11] reshape2_1.4.4              stringr_1.5.2              
##  [13] pkgconfig_2.0.3             crayon_1.5.3               
##  [15] fastmap_1.2.0               dbplyr_2.5.1               
##  [17] magick_2.9.0                XVector_0.49.1             
##  [19] labeling_0.4.3              Rsamtools_2.25.3           
##  [21] promises_1.3.3              rmarkdown_2.30             
##  [23] tinytex_0.57                purrr_1.1.0                
##  [25] bit_4.6.0                   xfun_0.53                  
##  [27] cachem_1.1.0                jsonlite_2.0.0             
##  [29] shinyFiles_0.9.3            blob_1.2.4                 
##  [31] later_1.4.4                 DelayedArray_0.35.3        
##  [33] BiocParallel_1.43.4         parallel_4.5.1             
##  [35] R6_2.6.1                    bslib_0.9.0                
##  [37] stringi_1.8.7               RColorBrewer_1.1-3         
##  [39] rtracklayer_1.69.1          GenomicRanges_1.61.5       
##  [41] jquerylib_0.1.4             Rcpp_1.1.0                 
##  [43] Seqinfo_0.99.2              bookdown_0.45              
##  [45] SummarizedExperiment_1.39.2 knitr_1.50                 
##  [47] IRanges_2.43.5              httpuv_1.6.16              
##  [49] Matrix_1.7-4                tidyselect_1.2.1           
##  [51] dichromat_2.0-0.1           abind_1.4-8                
##  [53] yaml_2.3.10                 codetools_0.2-20           
##  [55] curl_7.0.0                  lattice_0.22-7             
##  [57] tibble_3.3.0                plyr_1.8.9                 
##  [59] Biobase_2.69.1              shiny_1.11.1               
##  [61] withr_3.0.2                 S7_0.2.0                   
##  [63] evaluate_1.0.5              BiocFileCache_2.99.6       
##  [65] Biostrings_2.77.2           pillar_1.11.1              
##  [67] BiocManager_1.30.26         filelock_1.0.3             
##  [69] MatrixGenerics_1.21.0       stats4_4.5.1               
##  [71] generics_0.1.4              RCurl_1.98-1.17            
##  [73] S4Vectors_0.47.4            scales_1.4.0               
##  [75] xtable_1.8-4                glue_1.8.0                 
##  [77] lazyeval_0.2.2              tools_4.5.1                
##  [79] BiocIO_1.19.0               data.table_1.17.8          
##  [81] GenomicAlignments_1.45.4    fs_1.6.6                   
##  [83] XML_3.99-0.19               grid_4.5.1                 
##  [85] tidyr_1.3.1                 crosstalk_1.2.2            
##  [87] restfulr_0.0.16             cli_3.6.5                  
##  [89] rappdirs_0.3.3              S4Arrays_1.9.1             
##  [91] viridisLite_0.4.2           dplyr_1.1.4                
##  [93] gtable_0.3.6                sass_0.4.10                
##  [95] digest_0.6.37               BiocGenerics_0.55.1        
##  [97] SparseArray_1.9.1           rjson_0.2.23               
##  [99] htmlwidgets_1.6.4           farver_2.1.2               
## [101] memoise_2.0.1               htmltools_0.5.8.1          
## [103] lifecycle_1.0.4             httr_1.4.7                 
## [105] mime_0.13                   bit64_4.6.0-1              
## [107] MASS_7.3-65