---
title: "Sample Metadata Inference"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: rmarkdown::html_vignette
fig_width: 6
fig_height: 5
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{"4. Data Inference"}
  %\VignetteEncoding{UTF-8} 
---

SeSAMe implements inference of sex, age, ethnicity. These are valuable
information for checking the integrity of the experiment and detecting sample
swaps.

```{r inf1, echo=FALSE, message=FALSE}
library(sesame)
sesameDataCache()
sdf = sesameDataGet('EPIC.1.SigDF')
```

# Sex, XCI

Sex is inferred based on our curated X-linked probes and Y chromosome probes
excluding pseudo-autosomal regions and XCI escapes.

Human:

```{r inf2, message=FALSE}
sdf = sesameDataGet('EPIC.1.SigDF')
inferSex(sdf)
inferSexKaryotypes(sdf)
```

Mouse:

```{r nh16, message=FALSE}
sdf = sesameDataGet("MM285.1.SigDF")
inferSex(sdf)
```

# Ethnicity

Ethnicity is inferred using a random forest model trained based on both the
built-in SNPs (`rs` probes) and channel-switching Type-I probes.
```{r inf3}
sdf = sesameDataGet('EPIC.1.SigDF')
inferEthnicity(sdf)
```

# Age & Epigenetic Clock

SeSAMe provides age regression through multiple previously established models,
e.g., the well-known Horvath 353 model ([Horvath
2013](https://pubmed.ncbi.nlm.nih.gov/24138928/)) which returns the
chronological age in the number of years. Here is an example:

```{r inf4, eval=FALSE}
betas <- sesameDataGet('HM450.1.TCGA.PAAD')$betas
model <- sesameAnno_get("Anno/HM450/Clock_Horvath353.rds")
predictAge(betas, model)
```

And MM285 mouse array data using a set of 347 CpGs (see [Zhou et
al. 2022](https://www.cell.com/cell-genomics/fulltext/S2666-979X(22)00077-5))
The function returns the age in the number of months. We recommend using SeSAMe
preprocessed data as input to the function. Here’s an example:

```{r inf18, message=FALSE, eval=FALSE}
library(SummarizedExperiment)
betas <- assay(sesameDataGet("MM285.10.SE.tissue"))[,1]
model <- sesameAnno_get("Anno/MM285/Clock_Zhou347.rds")
predictAge(betas, model)
```

This indicates that this mouse is approximately 1.41 months old. The function
looks for overlapping probes and estimates age using the corresponding clock
models. Other available epigenetic clocks are

```{r inf20, echo=FALSE, result="asis"}
library(knitr)
df <- data.frame(rbind(
    c("Anno/HM450/Clock_Horvath353.rds", 353,
        "HM450/EPIC", "Horvath 2013 (24138928)"),
    c("Anno/HM450/Clock_Hannum.rds", 71,
        "HM450", "Hannum 2013 (23177740)"),
    c("Anno/HM450/Clock_SkinBlood.rds", 391,
        "HM450/EPIC", "Horath 2018 (30048243)"),
    c("Anno/EPIC/Clock_PhenoAge.rds", 514,
        "HM450/EPIC", "Levine 2018 (29676998)"),
    c("Anno/MM285/Clock_Zhou347.rds", 347,
        "MM285", "Zhou 2022")
))
colnames(df) <- c("RDS Key", "Platform", "N", "Reference (PMID)")
kable(df, caption="Available Epigenetic Clocks")
```


# Copy Number

See [Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.16/supplemental.html#cnv)

# Cell Count Deconvolution

SeSAMe estimates leukocyte fraction using a two-component model.This function
works for samples whose targeted cell-of-origin is not related to white blood
cells.

```{r inf7, message=FALSE}
betas.tissue <- sesameDataGet('HM450.1.TCGA.PAAD')$betas
estimateLeukocyte(betas.tissue)
```

# Genomic Privacy

The goal of data sanitization is to modifiy IDAT files in place, so they can be
released to public domain without privacy leak. This will be achieved by
deIdentification.

```{r inf8, message=FALSE, warning=FALSE, include=FALSE}
library(sesame)
sesameDataCacheAll()
```

Let's take DNA methylation data from the HM450 platform for example.
```{r inf9, eval=FALSE}
tmp = tempdir()
res_grn = sesameAnno_download("Test/3999492009_R01C01_Grn.idat", dest_dir=tmp)
res_red = sesameAnno_download("Test/3999492009_R01C01_Red.idat", dest_dir=tmp)
```
                                                   
## De-identify by Masking

This first method of deIdentification masks SNP probe intensity mean by zero.
As a consequence, the allele frequency will be 0.5. 

```{r inf10, eval=FALSE}

deIdentify(res_grn$dest_file, sprintf("%s/deidentified_Grn.idat", tmp))
deIdentify(res_red$dest_file, sprintf("%s/deidentified_Red.idat", tmp))

betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))])
```

Note that before deIdentify, the rs values will all be different. After
deIdentify, the rs values will all be masked at an intensity of 0.5. 

## De-identify by Scrambling

This second method of deIdentification will scramble the intensities using
a secret key to help formalize a random number. Therefore, randomize needs
to be set to TRUE. 

```{r inf11, eval=FALSE}

my_secret <- 13412084
set.seed(my_secret)

deIdentify(res_grn$dest_file,
    sprintf("%s/deidentified_Grn.idat", tmp), randomize=TRUE)

my_secret <- 13412084
set.seed(my_secret)
deIdentify(res_red$dest_file,
    sprintf("%s/deidentified_Red.idat", tmp), randomize=TRUE)

betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))]) 

```
Note that the rs values are scrambled after deIdentify.  

## Re-identify

To restore order of the deIdentified intensities, one can re-identify IDATs.
The reIdentify function can thus restore the scrambled SNP intensities. 

```{r inf12, eval=FALSE}

my_secret <- 13412084
set.seed(my_secret)

reIdentify(sprintf("%s/deidentified_Grn.idat", tmp),
    sprintf("%s/reidentified_Grn.idat", tmp))

my_secret <- 13412084
set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Red.idat", tmp),
    sprintf("%s/reidentified_Red.idat", tmp))

betas1 = getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas2 = getBetas(readIDATpair(sprintf("%s/reidentified", tmp)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))]) 
```

Note that reIdentify restored the values. Subsequently, they are the same as
betas1. 

# Session Info

```{r}
sessionInfo()
```