--- title: "Getting Started with glyparse" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with glyparse} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = TRUE ) ``` ## Your Universal Glycan Text Translator 🔄 Welcome to the world of glycan text parsing! If you've ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format. That's where `glyparse` comes to the rescue! 🚀 Think of `glyparse` as your **universal glycan translator** — it can read glycan structures written in many different "languages" and convert them all into a unified format that your computer can understand and work with. **Note:** All functions in `glyparse` return `glyrepr::glycan_structure` objects. If you are unfamiliar with `glyrepr`, you can read the documentation [here](https://glycoverse.github.io/glyrepr/articles/glyrepr.html). ```{r setup} library(glyparse) ``` ## The Babel Tower of Glycan Text Formats 🗼 Before we dive in, let's see what we're dealing with. Here's the same N-glycan core structure written in different formats: | Format | Example | Where You'll See It | |--------|---------|-------------------| | **IUPAC-condensed** | `Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc` | Literature, UniCarbKB | | **IUPAC-short** | `Mana3(Mana6)Manb4GlcNAcb4GlcNAc` | Literature, UniCarbKB | | **IUPAC-extended** | `alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc` | Literature, UniCarbKB | | **GlycoCT** | Complex multi-line format | Literature, GlycomeDB | | **WURCS** | `WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1...` | Literature, GlyTouCan | | **Linear Code** | `Ma3(Ma6)Mb4GNb4GNb` | Literature | | **pGlyco** | `(N(N(H(H(H)))))` | pGlyco software results | | **StrucGP** | `A2B2C1D1E2fedcba` | StrucGP software results | Confusing, right? 😵‍💫 `glyparse` understands them all! ## Your Parsing Toolkit 🛠️ `glyparse` provides seven specialized parsers, each optimized for a specific format: - **`parse_iupac_condensed()`**: The most common format - **`parse_iupac_short()`**: Compact literature format - **`parse_iupac_extended()`**: Verbose formal format - **`parse_glycoct()`**: Database standard format - **`parse_wurcs()`**: Modern standardized format - **`parse_linear_code()`**: Linear Code format - **`parse_pglyco_struc()`**: pGlyco software format - **`parse_strucgp_struc()`**: StrucGP software format All parsers follow the same pattern: - **Input**: Character vector of structure strings - **Output**: A `glyrepr::glycan_structure` object that you can analyze ## Part 0: `auto_parse()` Don't know what you're dealing with? Give it to `auto_parse()`! This function tries to identify the format automatically and use the appropriate parser. Even input with mixed formats is supported. ```{r} x <- c( "Gal(b1-3)GalNAc(b1-", "(N(F)(N(H(H(N))(H(N(H))))))", "WURCS=2.0/3,3,2/[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/1-2-3/a4-b1_b3-c1" ) auto_parse(x) ``` ## Part 1: IUPAC Family — The Popular Kids 🌟 Let's start with the IUPAC formats. ### IUPAC-Condensed: The Literature Standard This format is widely used in scientific literature and databases like UniCarbKB. Want to know more about IUPAC-condensed format? Check [this](https://glycoverse.github.io/glyrepr/articles/iupac.html) out! ```{r} # Single structure iupac_condensed <- "Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-4)Gal(b1-4)Glc(a1-" parse_iupac_condensed(iupac_condensed) ``` ```{r} # Multiple structures at once glycans <- c( "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core "Gal(b1-3)GalNAc(b1-", # O-glycan core 1 "Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-" # O-glycan core 2 ) parse_iupac_condensed(glycans) ``` ### IUPAC-Short: Literature's Favorite This compact format is popular in research papers because it saves space: ```{r} # The same structures in short format iupac_short <- c( "Mana3(Mana6)Manb4GlcNAcb4GlcNAcb-", "Galb3GalNAcb-", "Neu5Aca3Galb3(GlcNAcb6)GalNAcb-" ) parse_iupac_short(iupac_short) ``` Notice how much more compact this is! The parser is smart enough to infer common linkage positions (like Neu5Ac always being a2-linked). ### IUPAC-Extended: The Formal One This verbose format includes full chemical names and stereochemistry: ```{r} iupac_extended <- paste0( "α-D-Manp-(1→3)[α-D-Manp-(1→6)]-β-D-Manp-(1→4)", "-β-D-GlcpNAc-(1→4)-β-D-GlcpNAc-(1→" ) parse_iupac_extended(iupac_extended) ``` ## Part 2: Database Formats — The Heavy Hitters 💪 ### GlycoCT: The Precision Format GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It's more complex but extremely precise: ```{r} glycoct <- paste0( "RES\n", "1b:b-dglc-HEX-1:5\n", "2b:b-dgal-HEX-1:5\n", "3b:a-dgal-HEX-1:5\n", "LIN\n", "1:1o(4+1)2d\n", "2:2o(3+1)3d" ) parse_glycoct(glycoct) ``` ### WURCS: The Complex Structure Format WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan: ```{r} wurcs <- paste0( "WURCS=2.0/3,3,2/", "[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/", "1-2-3/a4-b1_b3-c1" ) parse_wurcs(wurcs) ``` ### Linear Code: The Simplified Format Linear Code is a simplified format used in literature for complex structures: ```{r} linear_code <- "Ma3(Ma6)Mb4GNb4GNb" parse_linear_code(linear_code) ``` ## Part 3: Software-Specific Formats — The Specialists 🔬 ### pGlyco Format: Proteomics Tool If you work with glycoproteomics, you might encounter pGlyco's parenthetical notation: ```{r} pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))" parse_pglyco_struc(pglyco) ``` This cryptic notation actually represents a complex N-glycan: - N = HexNAc - F = Fuc - H = Hex (Man or Gal) ### StrucGP Format: Alphabetical System StrucGP uses a letter-based encoding system: ```{r} strucgp <- "A2B2C1D1E2F1fedD1E2edcbB5ba" parse_strucgp_struc(strucgp) ``` ## The Bottom Line 🎯 `glyparse` transforms the chaos of glycan text formats into order. No matter where your glycan data comes from, databases, literature, or software tools, you can now parse it into `glyrepr::glycan_structure()` for further analysis. In fact, `glyread` package uses these parsing functions internally when reading output from common glycopeptide identification softwares. **Next steps:** - Explore the `glyrepr` package for structure manipulation - Try `glymotif` for motif analysis of your parsed structures - Use `glyexp` for experimental data analysis - Check out the rest of the `glycoverse` ecosystem! Happy parsing! 🧬✨