--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # Set to FALSE since API calls require credentials ) ``` `rsynthbio` is an R package that provides a convenient interface to the [Synthesize Bio](https://www.synthesize.bio/) API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq. Alternatively, you can AI generate datasets from our [web platform](https://app.synthesize.bio/datasets/). ## How to install You can install `rsynthbio` from CRAN: ```{r installation, eval=FALSE} install.packages("rsynthbio") ``` If you want the development version, you can install using the `remotes` package to install from GitHub: ```{r github-installation, eval=FALSE} if (!("remotes" %in% installed.packages())) { install.packages("remotes") } remotes::install_github("synthesizebio/rsynthbio") ``` Once installed, load the package: ```{r} library(rsynthbio) ``` ## Authentication Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication: ```{r auth-secure, eval=FALSE} # Securely prompt for and store your API token # The token will not be visible in the console set_synthesize_token() # You can also store the token in your system keyring for persistence # across R sessions (requires the 'keyring' package) set_synthesize_token(use_keyring = TRUE) ``` Loading your API key for a session. ```{r, eval=FALSE} # In future sessions, load the stored token load_synthesize_token_from_keyring() # Check if a token is already set has_synthesize_token() ``` You can obtain an API token by registering at [Synthesize Bio](https://app.synthesize.bio). ### Security Best Practices For security reasons, remember to clear your token when you're done: ```{r clear-token, eval = FALSE} # Clear token from current session clear_synthesize_token() # Clear token from both session and keyring clear_synthesize_token(remove_from_keyring = TRUE) ``` Never hard-code your token in scripts that will be shared or committed to version control. ## Designing Queries for Models ### Choosing a Modality The modality (data type to generate) is specified in the query using `get_valid_query()`: - **`bulk`**: Bulk RNA-seq (asynchronous under the hood, returned as data frames) - **`single-cell`**: Single-cell RNA-seq (asynchronous under the hood, returned as data frames) You can check which modalities are available programmatically: ```{r modalities} # Check available modalities get_valid_modalities() ``` You do not need to specify any internal API slugs. The library maps modalities to the appropriate model endpoints automatically. ```{r query, eval=FALSE} # Create a bulk query bulk_query <- get_valid_query(modality = "bulk") bulk <- predict_query(bulk_query, as_counts = TRUE) # Create a single-cell query sc_query <- get_valid_query(modality = "single-cell") sc <- predict_query(sc_query, as_counts = TRUE) ``` ### Creating a Query The structure of the query required by the API is fixed for the currently supported model. You can use `get_valid_query()` to get a correctly structured example list. ```{r query-example} # Get the example query structure example_query <- get_valid_query() # Inspect the query structure str(example_query) ``` The query consists of: 1. **`modality`**: The type of gene expression data to generate ("bulk" or "single-cell") 2. **`mode`**: The prediction mode that controls how expression data is generated: - **"sample generation"**: Generates realistic-looking synthetic data with measurement error (bulk only) - **"mean estimation"**: Provides stable mean estimates of expression levels (bulk and single-cell) 3. **`inputs`**: A list of biological conditions to generate data for Each input contains `metadata` (describing the biological sample) and `num_samples` (how many samples to generate). > See the [Query Parameters](#query-parameters) section below for detailed documentation on `mode` and other optional query fields. ### Making a Prediction Once your query is ready, you can send it to the API to generate gene expression data: ```{r predict, eval=FALSE} result <- predict_query(query, as_counts = TRUE) ``` This result will be a list of two dataframes: `metadata` and `expression` ### Understanding the Async API Behind the scenes, the API uses an **asynchronous model** to handle queries efficiently: 1. Your query is submitted to the API, which returns a query ID 2. The function automatically polls the status endpoint (default: every 2 seconds) 3. When the query completes, results are downloaded from a signed URL 4. Data is parsed and returned as R data frames All of this happens automatically when you call `predict_query()`. ### Controlling Async Behavior You can customize the polling behavior if needed: ```{r async-options, eval=FALSE} # Increase timeout for large queries (default: 900 seconds = 15 minutes) result <- predict_query( query, poll_timeout_seconds = 1800, # 30 minutes poll_interval_seconds = 5 # Check every 5 seconds instead of 2 ) ``` ### Valid Metadata Keys The input metadata is a list of lists. This is the full list of valid metadata keys: __Biological:__ - ``age_years`` - ``cell_line_ontology_id`` - ``cell_type_ontology_id`` - ``developmental_stage`` - ``disease_ontology_id`` - ``ethnicity`` - ``genotype`` - ``race`` - ``sample_type`` ("cell line", "organoid", "other", "primary cells", "primary tissue", "xenograft") - ``sex`` ("male", "female") - ``tissue_ontology_id`` __Perturbational:__ - ``perturbation_dose`` - ``perturbation_ontology_id`` - ``perturbation_time`` - ``perturbation_type`` ("coculture","compound","control","crispr","genetic","infection","other","overexpression","peptide or biologic","shrna","sirna") __Technical:__ - ``study`` (Bioproject ID) - ``library_selection`` (e.g., "cDNA", "polyA", "Oligo-dT" - see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection) - ``library_layout`` ("PAIRED", "SINGLE") - ``platform`` ("illumina") ### Valid Metadata Values The following are the valid values or expected formats for selected metadata keys: | Metadata Field | Requirement / Example | |----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_line_ontology_id` | Requires a [Cellosaurus ID](https://www.cellosaurus.org/). | | `cell_type_ontology_id` | Requires a [CL ID](https://www.ebi.ac.uk/ols4/ontologies/cl). | | `disease_ontology_id` | Requires a [MONDO ID](https://www.ebi.ac.uk/ols4/ontologies/mondo). | | `perturbation_ontology_id` | Must be a valid Ensembl gene ID (e.g., `ENSG00000156127`), [ChEBI ID](https://www.ebi.ac.uk/chebi/) (e.g., `CHEBI:16681`), [ChEMBL ID](https://www.ebi.ac.uk/chembl/) (e.g., `CHEMBL1234567`), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) (e.g., `9606`). | | `tissue_ontology_id` | Requires a [UBERON ID](https://www.ebi.ac.uk/ols4/ontologies/uberon). | We highly recommend using the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/) to find valid IDs for your metadata. Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error. ### Query Parameters In addition to metadata, queries support several optional parameters that control the generation process: #### mode (character, required) Controls the type of prediction the model generates. This parameter is required in all queries. Available modes: - **"sample generation"**: The model works identically to the mean estimation approach, except that the final gene expression distribution is also sampled to generate realistic-looking synthetic data that captures the error associated with measurements. This mode is useful when you want data that mimics real experimental measurements. - **"mean estimation"**: The model creates a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction. This mode is useful when you want a stable estimate of expected expression levels. > **Note:** **Single-cell queries only support "mean estimation" mode.** Bulk queries support both modes. ```{r mode-examples, eval=FALSE} # Bulk query with sample generation (default for bulk) bulk_query <- get_valid_query(modality = "bulk") bulk_query$mode <- "sample generation" # Bulk query with mean estimation bulk_query_mean <- get_valid_query(modality = "bulk") bulk_query_mean$mode <- "mean estimation" # Single-cell query (must use mean estimation) sc_query <- get_valid_query(modality = "single-cell") sc_query$mode <- "mean estimation" # Required for single-cell ``` #### total_count (integer, optional) Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally. ```{r total-count-example, eval=FALSE} # Create a query and add custom total_count query <- get_valid_query(modality = "bulk") query$total_count <- 5000000 ``` #### deterministic_latents (logical, optional) If `TRUE`, the model uses the mean of each latent distribution (`p(z|metadata)` or `q(z|x)`) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs. - Default: `FALSE` (sampling is enabled) ```{r deterministic-example, eval=FALSE} # Create a query and enable deterministic latents query <- get_valid_query(modality = "bulk") query$deterministic_latents <- TRUE ``` #### seed (integer, optional) Random seed for reproducibility when using stochastic sampling. ```{r seed-example, eval=FALSE} # Create a query with a specific seed query <- get_valid_query(modality = "bulk") query$seed <- 42 ``` You can combine multiple parameters in a single query: ```{r combined-params, eval=FALSE} # Create a query and add multiple parameters query <- get_valid_query(modality = "bulk") query$total_count <- 8000000 query$deterministic_latents <- TRUE query$mode <- "mean estimation" results <- predict_query(query) ``` ### Modifying Query Inputs You can customize the query inputs to fit your specific research needs: ```{r modify-query, eval=FALSE} # Get a base query query <- get_valid_query() # Adjust number of samples for the first input query$inputs[[1]]$num_samples <- 10 # Add a new condition query$inputs[[3]] <- list( metadata = list( sex = "male", sample_type = "primary tissue", tissue_ontology_id = "UBERON:0002371" ), num_samples = 5 ) ``` ### Working with Results ```{r analyze, eval=FALSE} # Access metadata and expression matrices metadata <- result$metadata expression <- result$expression # Check dimensions dim(expression) # View metadata sample head(metadata) ``` You may want to process the data in chunks or save it for later use: ```{r large-data, eval=FALSE} # Save results to RDS file saveRDS(result, "synthesize_results.rds") # Load previously saved results result <- readRDS("synthesize_results.rds") # Export as CSV write.csv(result$expression, "expression_matrix.csv") write.csv(result$metadata, "sample_metadata.csv") ``` ### Custom Validation You can validate your queries before sending them to the API: ```{r validation, eval=FALSE} # Validate structure validate_query(query) # Validate modality validate_modality(query) ``` ## Session info ```{r session-info} sessionInfo() ``` ## Additional Resources - [Package Source Code](https://github.com/synthesizebio/rsynthbio) - [File Bug Reports](https://github.com/synthesizebio/rsynthbio/issues)