---
title: "Example Workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{egworkflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r message=FALSE}
library(dplyr)
library(blockstrap)
```

# Introduction to Blockstrap 

Blockstrap is an R package developed for resampling data structures that naturally come in blocks. A common example is the line-listed hospital admission data, where each subject may appear in multiple rows. These rows can be grouped into meaningful units such as a sequence of visits by the same subject that occur close together. We refer to each such grouping as a block.

Blockstrap provides tools to perform a block bootstrap, resampling these blocks rather than resampling individual rows independently as in a traditional bootstrap.

In this document, we demonstrate how to:

- Create an example line-listed dataset in the hospital admission context
- Partition the dataset into blocks
- Perform block bootstrap using slice_block() function

## Generate an example dataset 

For illustration, we use the `create_fake_subjectDB()` function from the `HospitalNetwork` package to generate a fake subject database containing admission/discharge records. Note that this dataset includes $100$ subjects and each subject can have more than one record. 

```{r message=FALSE}
library(HospitalNetwork)
set.seed(1)
subject_db <- create_fake_subjectDB()

head(subject_db)
```

Here, `subject_db` contains four columns: 

- `sID`: Subject ID
- `fID`: Facility ID
- `Adate`: Admission date for the visit
- `Ddate`: Discharge date for the visit

## Group rows into blocks 

We define a block as a sequence of visits by the same subject that occur close together, that is, a new block begins when:

- It is the first record for the subject, or
- The difference in days between the previous discharge (`Ddate`) and the current admission (`Adate`) exceeds 40 days

```{r}
grouped_subjects <- subject_db |>  
  group_by(sID) |>
  mutate(Adate = as.Date(Adate),
         Ddate = as.Date(Ddate)) |>
  arrange(Adate, .by_group= TRUE) |>
  mutate(diff_time = Adate - lag(Ddate),
         is_start = is.na(diff_time) | diff_time > 40,
         idx_within_sid = cumsum(is_start),
         idx_block = as.factor(paste0(sID, "_",   idx_within_sid)))

head(grouped_subjects)

```

The column `idx_block` is a factor that identifies which block each row belongs to. Note that rows that share the identical entries in `idx_block` belong to the same block. 

```{r}
nrow(distinct(grouped_subjects,idx_block))
```

There are $`r nrow(distinct(grouped_subjects, idx_block))`$ unique blocks in the dataset.


## Block bootstrap

With the block IDs defined, we use the `slice_block()` function to perform block bootstrap. Here, we sample 10 blocks with replacement:

```{r}
blockstrapped_db <- grouped_subjects |>
  group_by(idx_block) |>
  slice_block(n = 10, replace=TRUE)

head(blockstrapped_db)
```

`blockstrapped_db` is the resulting dataset sampled using block bootstrap and can be used for further statistical analysis. The example above demonstrates sampling with equal weights, but the `slice_block()` function also allows for weighted sampling. For instance, we use the block size to give larger blocks a higher probability of being selected to generate `blockstrapped_db`.

```{r}
blockstrapped_db <- grouped_subjects |>
  group_by(idx_block) |>
  slice_block(n = 10, replace=TRUE, weight_by = n())

blockstrapped_db
```