---
title: "Getting Started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
    
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE # Set to FALSE since API calls require credentials
)
```


`rsynthbio` is an R package that provides a convenient interface to the [Synthesize Bio](https://www.synthesize.bio/) API, allowing users to generate realistic gene expression data based on specified biological conditions. 
This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq.

Alternatively, you can AI generate datasets from our [web platform](https://app.synthesize.bio/datasets/).

## How to install

You can install `rsynthbio` from CRAN:

```{r installation, eval=FALSE}
install.packages("rsynthbio")
```

If you want the development version, you can install using the `remotes` package to install from GitHub:

```{r github-installation, eval=FALSE}
if (!("remotes" %in% installed.packages())) {
  install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")
```

Once installed, load the package:

```{r}
library(rsynthbio)
```

## Authentication

Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:

```{r auth-secure, eval=FALSE}
# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()

# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)
```

Loading your API key for a session. 

```{r, eval=FALSE}
# In future sessions, load the stored token
load_synthesize_token_from_keyring()

# Check if a token is already set
has_synthesize_token()
```

You can obtain an API token by registering at [Synthesize Bio](https://app.synthesize.bio).

### Security Best Practices

For security reasons, remember to clear your token when you're done:

```{r clear-token, eval = FALSE}
# Clear token from current session
clear_synthesize_token()

# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)
```

Never hard-code your token in scripts that will be shared or committed to version control.

## Designing Queries for Models

### Choosing a Modality

The modality (data type to generate) is specified in the query using `get_valid_query()`:

- **`bulk`**: Bulk RNA-seq (asynchronous under the hood, returned as data frames)
- **`single-cell`**: Single-cell RNA-seq (asynchronous under the hood, returned as data frames)

You can check which modalities are available programmatically:

```{r modalities}
# Check available modalities
get_valid_modalities()
```

You do not need to specify any internal API slugs. The library maps modalities to the appropriate model endpoints automatically.

```{r query, eval=FALSE}
# Create a bulk query
bulk_query <- get_valid_query(modality = "bulk")
bulk <- predict_query(bulk_query, as_counts = TRUE)

# Create a single-cell query
sc_query <- get_valid_query(modality = "single-cell")
sc <- predict_query(sc_query, as_counts = TRUE)
```

### Creating a Query

The structure of the query required by the API is fixed for the currently supported model. You can use `get_valid_query()` to get a correctly structured example list.

```{r query-example}
# Get the example query structure
example_query <- get_valid_query()

# Inspect the query structure
str(example_query)
```

The query consists of:

1. **`modality`**: The type of gene expression data to generate ("bulk" or "single-cell")
2. **`mode`**: The prediction mode that controls how expression data is generated:
   - **"sample generation"**: Generates realistic-looking synthetic data with measurement error (bulk only)
   - **"mean estimation"**: Provides stable mean estimates of expression levels (bulk and single-cell)
3. **`inputs`**: A list of biological conditions to generate data for

Each input contains `metadata` (describing the biological sample) and `num_samples` (how many samples to generate).

> See the [Query Parameters](#query-parameters) section below for detailed documentation on `mode` and other optional query fields.

### Making a Prediction

Once your query is ready, you can send it to the API to generate gene expression data:

```{r predict, eval=FALSE}
result <- predict_query(query, as_counts = TRUE)
```

This result will be a list of two dataframes: `metadata` and `expression`

### Understanding the Async API

Behind the scenes, the API uses an **asynchronous model** to handle queries efficiently:

1. Your query is submitted to the API, which returns a query ID
2. The function automatically polls the status endpoint (default: every 2 seconds)
3. When the query completes, results are downloaded from a signed URL
4. Data is parsed and returned as R data frames

All of this happens automatically when you call `predict_query()`.

### Controlling Async Behavior

You can customize the polling behavior if needed:

```{r async-options, eval=FALSE}
# Increase timeout for large queries (default: 900 seconds = 15 minutes)
result <- predict_query(
  query,
  poll_timeout_seconds = 1800, # 30 minutes
  poll_interval_seconds = 5 # Check every 5 seconds instead of 2
)
```

### Valid Metadata Keys

The input metadata is a list of lists. This is the full list of valid metadata keys:

__Biological:__  

- ``age_years``
- ``cell_line_ontology_id``
- ``cell_type_ontology_id``
- ``developmental_stage``
- ``disease_ontology_id``
- ``ethnicity``
- ``genotype``
- ``race``
- ``sample_type`` ("cell line", "organoid", "other", "primary cells", "primary tissue", "xenograft")
- ``sex`` ("male", "female")
- ``tissue_ontology_id``

__Perturbational:__  

- ``perturbation_dose``
- ``perturbation_ontology_id``
- ``perturbation_time``
- ``perturbation_type`` ("coculture","compound","control","crispr","genetic","infection","other","overexpression","peptide or biologic","shrna","sirna")

__Technical:__  

- ``study`` (Bioproject ID)
- ``library_selection`` (e.g., "cDNA", "polyA", "Oligo-dT" - see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)
- ``library_layout`` ("PAIRED", "SINGLE")
- ``platform`` ("illumina")

### Valid Metadata Values

The following are the valid values or expected formats for selected metadata keys:

| Metadata Field             | Requirement / Example                                                                                                                                                                                                 |
|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `cell_line_ontology_id`    | Requires a [Cellosaurus ID](https://www.cellosaurus.org/).                                                                                                                                                           |
| `cell_type_ontology_id`    | Requires a [CL ID](https://www.ebi.ac.uk/ols4/ontologies/cl).                                                                                                                                                         |
| `disease_ontology_id`      | Requires a [MONDO ID](https://www.ebi.ac.uk/ols4/ontologies/mondo).                                                                                                                                                   |
| `perturbation_ontology_id` | Must be a valid Ensembl gene ID (e.g., `ENSG00000156127`), [ChEBI ID](https://www.ebi.ac.uk/chebi/) (e.g., `CHEBI:16681`), [ChEMBL ID](https://www.ebi.ac.uk/chembl/) (e.g., `CHEMBL1234567`), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) (e.g., `9606`). |
| `tissue_ontology_id`       | Requires a [UBERON ID](https://www.ebi.ac.uk/ols4/ontologies/uberon).                                                                                                                                                 |


We highly recommend using the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/) to find valid IDs for your metadata.

Models have a limited acceptable range of metadata input values. 
If you provide a value that is not in the acceptable range, the API will return an error.

### Query Parameters

In addition to metadata, queries support several optional parameters that control the generation process:

#### mode (character, required)

Controls the type of prediction the model generates. This parameter is required in all queries.

Available modes:

- **"sample generation"**: The model works identically to the mean estimation approach, except that the final gene expression distribution is also sampled to generate realistic-looking synthetic data that captures the error associated with measurements. This mode is useful when you want data that mimics real experimental measurements.

- **"mean estimation"**: The model creates a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction. This mode is useful when you want a stable estimate of expected expression levels.

> **Note:** **Single-cell queries only support "mean estimation" mode.** Bulk queries support both modes.

```{r mode-examples, eval=FALSE}
# Bulk query with sample generation (default for bulk)
bulk_query <- get_valid_query(modality = "bulk")
bulk_query$mode <- "sample generation"

# Bulk query with mean estimation
bulk_query_mean <- get_valid_query(modality = "bulk")
bulk_query_mean$mode <- "mean estimation"

# Single-cell query (must use mean estimation)
sc_query <- get_valid_query(modality = "single-cell")
sc_query$mode <- "mean estimation" # Required for single-cell
```

#### total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.


```{r total-count-example, eval=FALSE}
# Create a query and add custom total_count
query <- get_valid_query(modality = "bulk")
query$total_count <- 5000000
```

#### deterministic_latents (logical, optional)

If `TRUE`, the model uses the mean of each latent distribution (`p(z|metadata)` or `q(z|x)`) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.

- Default: `FALSE` (sampling is enabled)

```{r deterministic-example, eval=FALSE}
# Create a query and enable deterministic latents
query <- get_valid_query(modality = "bulk")
query$deterministic_latents <- TRUE
```

#### seed (integer, optional)

Random seed for reproducibility when using stochastic sampling.

```{r seed-example, eval=FALSE}
# Create a query with a specific seed
query <- get_valid_query(modality = "bulk")
query$seed <- 42
```

You can combine multiple parameters in a single query:

```{r combined-params, eval=FALSE}
# Create a query and add multiple parameters
query <- get_valid_query(modality = "bulk")
query$total_count <- 8000000
query$deterministic_latents <- TRUE
query$mode <- "mean estimation"

results <- predict_query(query)
```

### Modifying Query Inputs

You can customize the query inputs to fit your specific research needs:

```{r modify-query, eval=FALSE}
# Get a base query
query <- get_valid_query()

# Adjust number of samples for the first input
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue",
    tissue_ontology_id = "UBERON:0002371"
  ),
  num_samples = 5
)
```

### Working with Results

```{r analyze, eval=FALSE}
# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)
```

You may want to process the data in chunks or save it for later use:

```{r large-data, eval=FALSE}
# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")
```


### Custom Validation

You can validate your queries before sending them to the API:

```{r validation, eval=FALSE}
# Validate structure
validate_query(query)

# Validate modality
validate_modality(query)
```

## Session info

```{r session-info}
sessionInfo()
```

## Additional Resources

- [Package Source Code](https://github.com/synthesizebio/rsynthbio)
- [File Bug Reports](https://github.com/synthesizebio/rsynthbio/issues)