By default, the Poisson, binomial, and normal models in bage assume that any measurement errors in the input data are small enough to be ignored. These models can, however, be extended to accommodate various types of measurement error. This is done by adding a “data model”—also referred to as a “measurement error model”—to the base model. The data models that have been implemented so far in bage are fairly generic: they aim to perform reasonably well on a wide range of applications, rather than performing optimally on any particular application. Future versions of bage will add some more specialised data models.
This vignette begins with an overview of the current menu of data models. It then shows how the simple models presented in the overview can be extended to deal with more complicated situations, and discusses forecasting and confidentialization.
The overview will use a model of births in a single Korean province in a single year. The input data is:
births
#> # A tibble: 9 × 3
#>   age   births  popn
#>   <chr>  <int> <int>
#> 1 10-14      0 27084
#> 2 15-19      9 25322
#> 3 20-24    110 23935
#> 4 25-29    912 28936
#> 5 30-34   2547 30964
#> 6 35-39   1235 31611
#> 7 40-44    262 44567
#> 8 45-49      6 41774
#> 9 50-54      0 51312Our base model treats the input data as error free. We specify and fit the base model as follows.
library(bage)
library(dplyr)
mod_base <- mod_pois(births ~ age,
                     data = births,
                     exposure = popn) |>
  fit()
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
mod_base
#> 
#>     ------ Fitted Poisson model ------
#> 
#>    births ~ age
#> 
#>                  exposure: popn
#> 
#>         term  prior along n_par n_par_free std_dev
#>  (Intercept) NFix()     -     1          1       -
#>          age   RW()   age     9          9    1.27
#> 
#>  disp: mean = 1
#> 
#>  n_draw var_age optimizer
#>    1000     age    nlminb
#> 
#>  time_total time_max time_draw iter converged
#>        0.02     0.01      0.01   15      TRUE
#>                                            message
#>    both X-convergence and relative convergence (5)The base model yields the following estimates for birth rates:
In full-sized applications, we generally want inclusion probabilities, coverage rates, standard deviations, and coefficients of variation to vary across dimensions such as age, sex, and time. The data models implemented in bage allow this sort of variation.
We illustrate the use of varying data model parameters using a slightly extended version of our births model. We use a new dataset that includes a time variable:
births_time
#> # A tibble: 27 × 4
#>    age    time births  popn
#>    <chr> <int>  <int> <int>
#>  1 10-14  2021      0 27208
#>  2 10-14  2022      0 27185
#>  3 10-14  2023      0 27084
#>  4 15-19  2021      3 25039
#>  5 15-19  2022     16 25276
#>  6 15-19  2023      9 25322
#>  7 20-24  2021    178 29008
#>  8 20-24  2022    120 26201
#>  9 20-24  2023    110 23935
#> 10 25-29  2021   1212 30826
#> # ℹ 17 more rowsWe create a new prior for inclusion probabilities where the mean and dispersion vary over time:
prob_under_time <- data.frame(time = c(2021, 2022, 2023),
                              mean = c(0.99, 0.8,  0.99),
                                    disp = c(0.01, 0.02, 0.01))We implement one model with the old time-constant prior, and one with the new time-varying prior.
mod_timeconst <- mod_pois(births ~ age + time,
                          data = births_time,
                                exposure = popn) |>
  set_datamod_undercount(prob = prob_under)
mod_timevarying <- mod_timeconst |>
  set_datamod_undercount(prob = prob_under_time)
#> → Replacing existing "undercount" data model with new "undercount" data modelAs we would expect, these two models give different results.
Data model parameters can vary across more than one dimension. We set up a prior that varies across age and time. Rather than using different inclusion probabilities for every age group, however, we use one set for people aged 10-34, and one for people aged 35+. To implement this, we need to create a new, aggregated age group.
births_time <- births_time |>
  mutate(agegp = if_else(age %in% c("10-14", "15-19",
                                    "20-24", "25-29",
                                    "30-34"),
                         "10-34",
                         "35+"))                             
prob_under_agetime <- data.frame(
  time =  c(2021,    2022,    2023,    2021,  2022,  2023),
  agegp = c("10-34", "10-34", "10-34", "35+", "35+", "35+"),
  mean =  c(0.95,    0.95,    0.95,    0.95,  0.5,  0.95),
  disp =  c(0.01,    0.01,    0.01,    0.01,  0.1,  0.01))
mod_agetime <- mod_pois(births ~ age + time,
                        data = births_time,
                              exposure = popn) |>
  set_datamod_undercount(prob = prob_under_agetime)The model assumes that births for women aged 35+ in 2022 were subject to an unusual degree of under-reporting. The results reflect that assumption.
Data models can be used in forecasting applications. Implementation is easiest with the data model does not contain any time-varying parameters, like the mod_timeconst constructed above:
mod_timeconst |>
  fit() |>
  forecast(label = 2024)
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 9 × 8
#>   age    time births .births  popn .observed                    .fitted
#>   <chr> <dbl>  <dbl>   <dbl> <int>     <dbl>               <rdbl<1000>>
#> 1 10-14  2024     NA      NA    NA        NA   5.5e-05 (2e-05, 0.00015)
#> 2 15-19  2024     NA      NA    NA        NA 0.00044 (0.00025, 0.00076)
#> 3 20-24  2024     NA      NA    NA        NA    0.0061 (0.0038, 0.0089)
#> 4 25-29  2024     NA      NA    NA        NA       0.041 (0.025, 0.059)
#> 5 30-34  2024     NA      NA    NA        NA         0.11 (0.066, 0.16)
#> 6 35-39  2024     NA      NA    NA        NA        0.048 (0.03, 0.069)
#> 7 40-44  2024     NA      NA    NA        NA     0.0074 (0.0047, 0.011)
#> 8 45-49  2024     NA      NA    NA        NA 0.00013 (6.8e-05, 0.00024)
#> 9 50-54  2024     NA      NA    NA        NA 1.2e-05 (2.3e-06, 4.7e-05)
#> # ℹ 1 more variable: .expected <rdbl<1000>>If future values for population are supplied, then the forecast will include true and reported values for the outcome variable:
newdata_births <- data.frame(
  age = c("10-14", "15-19", "20-24", "25-29", "30-34",
          "35-39", "40-44", "45-49", "50-54"),
  time = rep(2024, 9),
  popn = c(27084, 25322, 23935, 28936, 30964,
           31611, 44567, 41774, 51312))
mod_timeconst |>
  fit() |>
  forecast(newdata = newdata_births)
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 9 × 8
#>   age    time            births           .births  popn .observed
#>   <chr> <dbl>      <rdbl<1000>>      <rdbl<1000>> <dbl>     <dbl>
#> 1 10-14  2024          1 (0, 5)          1 (0, 6) 27084        NA
#> 2 15-19  2024         9 (3, 19)        11 (4, 23) 25322        NA
#> 3 20-24  2024     116 (69, 183)     145 (91, 224) 23935        NA
#> 4 25-29  2024   942 (577, 1369)  1176 (721, 1726) 28936        NA
#> 5 30-34  2024 2592 (1584, 3897) 3249 (2032, 4824) 30964        NA
#> 6 35-39  2024  1196 (722, 1775)  1500 (942, 2220) 31611        NA
#> 7 40-44  2024    265 (158, 396)    331 (213, 501) 44567        NA
#> 8 45-49  2024         4 (0, 11)         5 (1, 13) 41774        NA
#> 9 50-54  2024          0 (0, 3)          0 (0, 4) 51312        NA
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>When the data model contains parameters that vary over time, future values for these parameters must be specified. This is done when the data model is first created. Here, for, instance, we create a version of the prob_under_time prior that includes values for 2024.
prob_under_time_ext <- rbind(
  prob_under_time,
  data.frame(time = 2024,
             mean = 0.95,
             disp = 0.05))
prob_under_time_ext
#>   time mean disp
#> 1 2021 0.99 0.01
#> 2 2022 0.80 0.02
#> 3 2023 0.99 0.01
#> 4 2024 0.95 0.05Our dataset only contains values up to 2023. When we fit our model, bage only uses prior values for the data model up to 2023. But when we forecast, bage uses the prior values for 2024.
bage allows for the possibility that the outcome variable has been subject to measurement errors and has been confidentialized. The following model assumes that births have been under-reported, and have been randomly rounded to multiples of 3.
The undercount, overcount, miscount, noise, and exposure data models allow analysts to account for general types of measurement error commonly encountered in applied demography. Like all models in bage, the data models are designed to be fast, even with large datasets.
As bage develops further, we would like to complement the existing suite of data models with additional, more specialized models. We would, for instance, like to add a special model for data, such as births data, where there are gaps between the date of occurrence and the date of registration. We welcome suggestions for specialised models.