--- title: "Example Workflow" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{egworkflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r message=FALSE} library(dplyr) library(blockstrap) ``` # Introduction to Blockstrap Blockstrap is an R package developed for resampling data structures that naturally come in blocks. A common example is the line-listed hospital admission data, where each subject may appear in multiple rows. These rows can be grouped into meaningful units such as a sequence of visits by the same subject that occur close together. We refer to each such grouping as a block. Blockstrap provides tools to perform a block bootstrap, resampling these blocks rather than resampling individual rows independently as in a traditional bootstrap. In this document, we demonstrate how to: - Create an example line-listed dataset in the hospital admission context - Partition the dataset into blocks - Perform block bootstrap using slice_block() function ## Generate an example dataset For illustration, we use the `create_fake_subjectDB()` function from the `HospitalNetwork` package to generate a fake subject database containing admission/discharge records. Note that this dataset includes $100$ subjects and each subject can have more than one record. ```{r message=FALSE} library(HospitalNetwork) set.seed(1) subject_db <- create_fake_subjectDB() head(subject_db) ``` Here, `subject_db` contains four columns: - `sID`: Subject ID - `fID`: Facility ID - `Adate`: Admission date for the visit - `Ddate`: Discharge date for the visit ## Group rows into blocks We define a block as a sequence of visits by the same subject that occur close together, that is, a new block begins when: - It is the first record for the subject, or - The difference in days between the previous discharge (`Ddate`) and the current admission (`Adate`) exceeds 40 days ```{r} grouped_subjects <- subject_db |> group_by(sID) |> mutate(Adate = as.Date(Adate), Ddate = as.Date(Ddate)) |> arrange(Adate, .by_group= TRUE) |> mutate(diff_time = Adate - lag(Ddate), is_start = is.na(diff_time) | diff_time > 40, idx_within_sid = cumsum(is_start), idx_block = as.factor(paste0(sID, "_", idx_within_sid))) head(grouped_subjects) ``` The column `idx_block` is a factor that identifies which block each row belongs to. Note that rows that share the identical entries in `idx_block` belong to the same block. ```{r} nrow(distinct(grouped_subjects,idx_block)) ``` There are $`r nrow(distinct(grouped_subjects, idx_block))`$ unique blocks in the dataset. ## Block bootstrap With the block IDs defined, we use the `slice_block()` function to perform block bootstrap. Here, we sample 10 blocks with replacement: ```{r} blockstrapped_db <- grouped_subjects |> group_by(idx_block) |> slice_block(n = 10, replace=TRUE) head(blockstrapped_db) ``` `blockstrapped_db` is the resulting dataset sampled using block bootstrap and can be used for further statistical analysis. The example above demonstrates sampling with equal weights, but the `slice_block()` function also allows for weighted sampling. For instance, we use the block size to give larger blocks a higher probability of being selected to generate `blockstrapped_db`. ```{r} blockstrapped_db <- grouped_subjects |> group_by(idx_block) |> slice_block(n = 10, replace=TRUE, weight_by = n()) blockstrapped_db ```