--- title: "The TwoRegression Package" author: "Paul R. Hibbing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The TwoRegression Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## INTRODUCTION The TwoRegression package allows users to quickly and accurately develop/apply two-regression algorithms to data from research-grade wearable devices. This vignette is designed to demonstrate usage of the package's core features. Before getting into that, it's valuable to cover some history. The package was initially established as a home for the models of [Hibbing et al. (2018)](https://pubmed.ncbi.nlm.nih.gov/29271847). Since the initial release, support has been added for developing/applying new models, as well as applying others from prior research (see [Crouter et al.(2006)](https://pubmed.ncbi.nlm.nih.gov/16322367/), [Crouter et al.(2010)](https://pubmed.ncbi.nlm.nih.gov/20400882/), and [Crouter et al.(2012)](https://pubmed.ncbi.nlm.nih.gov/22143114/)). As of version 1.0.0, a new approach has been implemented for invoking prior methods via the `TwoRegression` function, which we will look at in the following section. Afterwards, we will cover the process of creating and cross-validating new models, plus other aspects of using them effectively. ## IMPLEMENTING TWO-REGRESSION MODELS THAT ALREADY EXIST #### Approach and Options Prior models are implemented using the `TwoRegression` function. Currently, support is available for the following: * [Crouter et al.(2006)](https://pubmed.ncbi.nlm.nih.gov/16322367/): The original Crouter two-regression model (for adults) * [Crouter et al.(2010)](https://pubmed.ncbi.nlm.nih.gov/20400882/): The refined Crouter two-regression model (for adults) * [Crouter et al.(2012)](https://pubmed.ncbi.nlm.nih.gov/22143114/): Two youth-specific Crouter two-regression models for activity count data, one using vertical axis counts and the other using vector magnitude counts * [Hibbing et al. (2018)](https://pubmed.ncbi.nlm.nih.gov/29271847/): Fifteen two-regression models for adults, each corresponding to one of five attachment sites (hip, wrists, ankles) and one of three model configurations (accelerometer only, accelerometer and gyroscope, or accelerometer, gyroscope, and magnetometer) #### Help documentation It's very important that you look at the `TwoRegression` function documentation. It will help you understand what settings you need to provide in order to run a specific model correctly. To view the documentation, run the following: ```{r, eval = FALSE} ?TwoRegression::TwoRegression ``` This will pull up a documentation page where you can see the syntax for calling the `TwoRegression` function. Importantly, the page also lists the syntax for several internal applicators (i.e., `crouter_2006`, `crouter_2010`, `crouter_2012`, and `hibbing_2018`), which are the functions that actually do the work of applying your selected model. That is, the `TwoRegression` function is just a wrapper around those other internal functions, and based on the method you select, `TwoRegression` will call out to the corresponding applicator. In most cases, you will need to designate some extra settings for the applicator, which is why the syntax is listed in the documentation file alongside the `TwoRegression` syntax. Arguments for the internal functions can be passed into the `TwoRegression` function directly, as if they were arguments to that function itself. This will be easier to see and understand in the coding samples later on in this section, but it's important to be aware of all this from the get-go. #### Assumptions and Data The `TwoRegression` function operates under the assumption you already have data read into R. You can do this with the `AGread` package. ```{r, eval=FALSE} if (!"remotes" %in% installed.packages) install.packages("remotes") if (!"AGread" %in% installed.packages()) remotes::install_github("paulhibbing/AGread") ``` For the sake of this illustration, the TwoRegression package provides some sample data we can use. If you can get your own data into a form that mirrors the sample data below, you'll be in good shape. Here's how to access it: ```{r} data(count_data, package = "TwoRegression") data(all_data, package = "TwoRegression") ``` The `count_data` object contains activity count data (for the Crouter two-regression models), while the `all_data` object contains raw sensor data (for the Hibbing models). We can view the first few rows of count data as follows: ```{r} utils::head(count_data) ``` We can do the same using a similar approach for the raw data. However, let's remove some extraneous variables first. ```{r} all_data <- all_data[ ,setdiff( names(all_data), ## These are the variables to remove: c( "file_source_PrimaryAccel", "date_processed_PrimaryAccel", "file_source_IMU", "date_processed_IMU", "day_of_year", "minute_of_day", ## Remove the following because they'll be recalculated later "ENMO_CV10s", "GVM_CV10s", "Direction" ) )] utils::head(all_data) ``` #### Applying Crouter models (for activity count data) Once you have your dataset ready, it's easy to apply a two-regression model. Just invoke the `TwoRegression` function like this: ```{r} crouter2006_results <- TwoRegression::TwoRegression( count_data, "Crouter 2006", movement_var = "Axis1", time_var = "time" ) crouter2010_results <- TwoRegression::TwoRegression( count_data, "Crouter 2010", movement_var = "Axis1", time_var = "time" ) crouter2012_va_results <- TwoRegression::TwoRegression( count_data, "Crouter 2012", movement_var = "Axis1", time_var = "time", model = "VA", check = FALSE ) crouter2012_vm_results <- TwoRegression::TwoRegression( count_data, "Crouter 2012", movement_var = "Vector.Magnitude", time_var = "time", model = "VM", check = FALSE ) ``` For the Crouter 2012 models, you have to choose between the vertical axis model and the vector magnitude model. If you don't set `check = FALSE`, you will get a warning about which movement variable and model you've selected. This is meant as a prompt for you to ensure your selected movement variable matches your selected model. Once you're confident in your selection, you can set `check = FALSE` and the warning won't show up. For the time being, you can only implement Crouter models one at a time. Of course, you can combine the output from multiple models yourself. Ideally, with ongoing development, a point will come where this can be done automatically and efficiently (see the [GitHub issue on this topic](https://github.com/paulhibbing/TwoRegression/issues/1)), but for now it isn't built in. As we'll see in the following subsection, though, it *is* doable for the Hibbing models. Here's a look at the output from the prior commands: ```{r} utils::head(crouter2006_results) utils::head(crouter2010_results) utils::head(crouter2012_va_results) utils::head(crouter2012_vm_results) ``` #### Applying Hibbing models (for raw sensor data) The Hibbing models are implemented similarly to the Crouter models. A key difference, though, is that you can ask the function to run multiple models simultaneously. That's what we'll see in the following example: ```{r} hibbing2018_results <- TwoRegression::TwoRegression( all_data, "Hibbing 2018", accel_var = "ENMO", gyro_var = "Gyroscope_VM_DegPerS", direction_var = "mean_magnetometer_direction", ## Here is where we can select an algorithm from multiple sites: site = c("Left Ankle", "Right Ankle"), ## And here is where we can select multiple algorithms ## (1 = accelerometer only; 2 = accelerometer and gyroscope; ## 3 = accelerometer, gyroscope, and magnetometer) algorithm = 1:2, ## We can also ask the function to collapse data every minute by making an ## extra call to `smooth_2rm` smooth = TRUE ) utils::head(hibbing2018_results) ``` So, each algorithm is run, and the information is stored in a unique and descriptive variable name. ## FITTING AND EXAMINING NEW MODELS #### Background and Setup The TwoRegression package is also useful if you want to create your own model. To get this going, though, your dataset needs to have some more complex information in it. We'll use our previous `all_data` object in this illustration. First, we need to label it with some pretend activity labels and energy expenditure values (METs). In a real-life setting, the MET values would likely come from indirect calorimetry. To create some of this imaginary data, we can run the following: ```{r} set.seed(307) fake_sed <- c("Lying", "Sitting") fake_lpa <- c("Sweeping", "Dusting") fake_cwr <- c("Walking", "Running") fake_ila <- c("Tennis", "Basketball") fake_activities <- c(fake_sed, fake_lpa, fake_cwr, fake_ila) all_data$Activity <- sample(fake_activities, nrow(all_data), TRUE) all_data$fake_METs <- ifelse( all_data$Activity %in% c(fake_sed, fake_lpa), runif(nrow(all_data), 1, 2), runif(nrow(all_data), 2.5, 8) ) ``` For this demonstration, a couple of extra hacks are needed, which would be much more natural to handle with real data. Still, they're helpful to see. First, we need to make sure our dataset has a column indicating which participant each data point came from. In this case, we'll just label our data to pretend it came from two sample files instead of one (where 'sample file' is analogous to 'participant'). The other step is calculating the coefficient of variation (CV). We technically could have avoided this by choosing not to delete the CV variables earlier. But that decision now gives us an excuse to show how convenient it is to calculate CV in the TwoRegression package.There were also some technical reasons for deleting the variables earlier, but nevermind that (see [another GitHub issue](https://github.com/paulhibbing/TwoRegression/issues/2) if you're curious). ```{r} all_data$PID <- rep( c("Test1", "Test2"), each = ceiling(nrow(all_data) / 2) )[seq(nrow(all_data))] all_data$ENMO_CV10s <- TwoRegression::cv_2rm(all_data$ENMO) ``` #### Fitting the Model When we go to fit the model, we'll use the `fit_2rm` function. There are a lot of arguments to provide here: * **data: ** The dataset * **activity_var: ** The name of the variable that indicates the activity * **sed_cp_activities: ** The subset of values from `activity_var` that should be included when calibrating the 2RM sedentary cut point * **sed_activities: ** The subset of values from `activity_var` that should be labeled as positive for sedentary behavior when calibrating the 2RM sedentary cut point * **sed_cp_var: ** The name of the variable for which the 2RM sedentary cut point should be calibrated. * **sed_METs: ** The MET value to assign for sedentary behaviors (i.e., when the value for `sed_cp_var` falls below the 2RM sedentary cut point) * **walkrun_activities: ** The subset of values from `activity_var` that should be labeled as positive for "continuous walking/running" (CWR) when calibrating the 2RM CWR cut point * **walkrun_cp_var: ** The name of the variable for which the 2RM CWR cut point should be calibrated * **met_var: ** The name of the MET variable that the 2RM should be fitted to predict * **walkrun_formula: ** A character representation of the formula that should be used when fitting the CWR model (formulated as `outcome ~ predictors`) * **intermittent_formula: ** A character representation of the formula that should be used when fitting the intermittent lifestyle activities model (formulated like `walkrun_formula` -- note that data transformations like squaring or cubing should be wrapped in `I()`) From there, we can fit our model like this: ```{r} my_model <- TwoRegression::fit_2rm( data = all_data, activity_var = "Activity", sed_cp_activities = c(fake_sed, fake_lpa), sed_activities = fake_sed, sed_cp_var = "ENMO", sed_METs = 1.25, walkrun_activities = fake_cwr, walkrun_cp_var = "ENMO_CV10s", met_var = "fake_METs", walkrun_formula = "fake_METs ~ ENMO", intermittent_formula = "fake_METs ~ ENMO + I(ENMO^2) + I(ENMO^3)" ) ``` #### Examining Model Performance The package provides summary and plot methods to understand, cross-validate, and visualize the model. Notably, this demonstration model is not meant to perform well or look pretty (the data are just numbers that have no real meaning), but we'll still take a look at how to run the code. As far as the summary method goes, this is where we need the participant identification column we set up earlier. Specifically, it will be used for leave-one-out cross-validation, where the data are split up into different chunks while the model is repeatedly re-fitted. Other information in the output includes a textual representation of the overall algorithm and summaries of the fit/performance of individual components (i.e., ROC and regression analyses). To pull all of this up, you just have to run code that matches the following pattern: ```{r} summary( my_model, subject_var = "PID", MET_var = "fake_METs", activity_var = "Activity" ) ``` For the plot function, you'll need to fill in some of the same values from the original call to `fit_2rm`. Use code that matches the following pattern: ```{r, fig.align='center', fig.width=12, fig.height=6, out.width="85%"} ## You have to explicitly type `object = ` for this to work plot( object = my_model, sed_cp_activities = c(fake_sed, fake_lpa), sed_activities = fake_sed, sed_cpVar = "ENMO", activity_var = "Activity", met_var = "fake_METs", walkrun_activities = fake_cwr, walkrun_cpVar = "ENMO_CV10s", print = TRUE ) ``` ## USING NEW MODELS Once you've created your model, you want to use it on new data. That's easy to do using the `predict` method included in the package. If we pretend our `all_data` object is a new dataset, we could get predictions by running code like this: ```{r} new_results <- predict(my_model, all_data) utils::head(new_results) ``` When making predictions, you can specify `verbose = TRUE` if you want to print a message to the console about making predictions from your model. By default, it will say it's making predictions using the 'user_unspecified' model. To give your model a name, you can assign a value to its `method` element. Consider the following: ```{r} results_default <- predict(my_model, all_data, verbose = TRUE) my_model$method <- "My Customized 2RM" results_updated <- predict(my_model, all_data, verbose = TRUE) ``` And, of course, you can collapse the estimates to a particular time granularity using `smooth_2rm` like this: ```{r} ## This code illustrates collapsing every 60 seconds. (This is the default ## period and also the typical recommendation, but you could do anything, ## e.g., "10 sec", "30 sec", or "0.25 hour") TwoRegression::smooth_2rm(results_updated, "Timestamp", "60 sec") ``` ## CONCLUSION That's it. This has been a quick crash course in the core features and functions of the TwoRegression package. If you have questions or feedback, feel free to connect by [posting an issue on the TwoRegression GitHub page](https://github.com/paulhibbing/TwoRegression/issues/new). Happy coding!