Introduction to the pervasive package

The pervasive package is designed to help assess the pervasiveness of effects in correlational data by providing the Observed Percentage of Concordant Pairs (OPCP) and association rule mining for binned variables.

If researchers want OPCP values without conducting association rule mining, they may use OPCP(), OPCP_glm() for binary outcomes, or OPCP_mat() for a whole matrix. These OPCP functions do not apply any binning.

The pervasive_tric(), pervasive_tric_glm(), pervasive_dic() and pervasive_dic_glm() all provide the OPCP and the results of targeted association rule mining. The function pervasive_tric() should likely be considered the go-to for most research contexts. When the sample size is too small for trichotomization to make sense, or if too many variables are binary, researchers may use pervasive_dic() instead. The functions pervasive_tric_glm() and pervasive_dic_glm() should be preferred if the outcome is binary.

This vignette will walk you through examples for the pervasive functions using data from the psychTools package.

#> Install from CRAN (not yet available)
#install.packages("pervasive")

#> Install the development version from GitHub
#devtools::install_github("dlajoiemoncton/pervasive")

#load the package
library(pervasive)

We can start with a quick overview of the OPCP_mat(), OPCP(), and OPCP_glm() functions. OPCP is pretty easy to calculate for any models (correlate predicted values and observed values using the “kendall” method, divide the result by 2 and add .5), but for convenience, the pervasive package provides functions for linear and logistic regression. In this vignette, we will use data from the psychTools package to illustrate the functions.

#>Example using the spi dataset from the psychTools package. 

#>Big 5 factor scores are calculated as suggested in the psych package documentation. Variables should be scored prior to the analysis.
sc <- psych::scoreVeryFast(psychTools::spi.keys, psychTools::spi)

#>scores for traits are combined to the original dataset
spi_sc <- cbind(psychTools::spi, sc)

#>Let's use age as the outcome of interest and the Big 5 as predictors for the regression models and sex as the outcome for binomial logistic regression models

spi_sc_age_sex_B5 <- spi_sc |>
 dplyr::select(age, sex, Agree, Consc, Neuro, Extra, Open) |> 
  na.omit()

spi_sc_age <- spi_sc_age_sex_B5 |>
dplyr::select(age, Agree, Consc, Neuro, Extra, Open)

spi_sc_sex <- spi_sc_age_sex_B5 |>
 dplyr::select(sex, Agree, Consc, Neuro, Extra, Open)
 
#>The glm() function expects outputs to be coded as 0 and 1 but sex is coded 1 and 2. There are some NAs for sex.
spi_sc_sex$sex = spi_sc_sex$sex -1

The OPCP_mat function works like cor(). Instead of returning a correlation matrix, it provides a matrix containing Pearson’s r in the lower triangle and OPCP values in the upper triangle. It takes a little longer to run than a normal correlation table. This may be worth considering if you have a quite large number of variables. In the resulting matrix we can see, for example, that in approcximately 56% of pairs of participants (excluding ties), the person with the higher agreeableness score is also the older participant.

OPCP_mat(spi_sc_age_sex_B5)
#>        Mean    SD   age   sex Agree Consc Neuro Extra  Open
#> age   26.93 11.47  1.00 46.51 55.91 56.43 44.00 48.66 54.96
#> sex    1.60  0.49 -0.05  1.00 57.16 53.60 59.91 52.38 43.65
#> Agree  4.25  0.79  0.18  0.17  1.00 58.55 44.89 57.89 50.27
#> Consc  3.98  0.81  0.18  0.09  0.24  1.00 43.42 52.41 49.97
#> Neuro  3.73  1.02 -0.17  0.24 -0.12 -0.19  1.00 43.32 46.17
#> Extra  3.47  0.95 -0.02  0.06  0.23  0.07 -0.20  1.00 53.93
#> Open   4.64  0.66  0.13 -0.15  0.00  0.01 -0.12  0.13  1.00

The other pervasive functions expect a formula parameter that is in the same shape as a typical lm() or glm() model.

formula <- age ~ Agree + Consc + Neuro + Extra + Open 

formula_glm <- sex ~ Agree + Consc + Neuro + Extra + Open

#>This would also be acceptable: formula <- formula(age ~ Agree + Consc + Neuro + Extra + Open)
#>It is possible to include a single predictor if one is working in a bivariate case

OPCP() and OPCP_glm() can provide OPCPs for regression models. In the context of binomial logistic regression, researchers have options for model performance evaluation based on the confusion matrix (e.g., accuracy, specificity, Jaccard index, etc.) such that the OPCP might feel like simply one of many. One advantage for the OPCP is that it makes it convenient to directly compare the performance of glm() and lm() models, on the same metric. In the examples below, there appears to be more information in personality for sex than for age.

OPCP(formula = formula, data = spi_sc_age)
#> [1] 0.6078206
OPCP_glm(formula = formula_glm, data = spi_sc_sex)
#> [1] 0.6491876

The following functions (pervasive_tric(), _dic(), etc.) include some results from association rule mining. This allows some data exploration and provides context for models. A new parameter is required by the functions, min_support.

This parameter refers to the minimum frequency with which a configuration appears in the data. The idea is to go low enough where we do not miss interesting patterns but high enough that configurations are not completely spurious.

If a dataset is smaller (everything is contextual, but, say, below 300), it may be worthwhile to consider using pervasive_dic() instead of pervasive_tric() and to use a higer min_support. The old rule-of-thumb of n > 30 for the smallest class is an easy “minimum” to aim for, but context always matters. Here, .03*3946 = 118.38, such that configurations with fewer than 119 individuals will not be considered to have enough support to be of note.

example <- pervasive_tric(formula = formula, data = spi_sc_age, min_support = .03)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime      support minlen
#>           0    0.1    1 none FALSE            TRUE       5 0.0002534212      1
#>  maxlen target  ext
#>      10  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 1 
#> 
#> set item appearances ...[18 item(s)] done [0.00s].
#> set transactions ...[18 item(s), 3946 transaction(s)] done [0.00s].
#> sorting and recoding items ... [18 item(s)] done [0.00s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4 5 6 done [0.00s].
#> writing ... [3048 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].

example
#>                                                       Description
#>                             The rule suggested by your regression
#>              The rule suggested by the low end of your regression
#>    The rule with the highest lift for high values of your outcome
#>  The rule with the highest lift for medium values of your outcome
#>     The rule with the highest lift for low values of your outcome
#>                                                                        LHS
#>  {tri_Agree=high,tri_Consc=high,tri_Neuro=low,tri_Extra=low,tri_Open=high}
#>   {tri_Agree=low,tri_Consc=low,tri_Neuro=high,tri_Extra=high,tri_Open=low}
#>                              {tri_Agree=high,tri_Consc=high,tri_Neuro=low}
#>                                           {tri_Consc=medium,tri_Open=high}
#>                                               {tri_Consc=low,tri_Open=low}
#>               RHS     Support Confidence    Coverage     Lift
#>    {tri_age=high} 0.001520527  0.6666667 0.002280791 1.963184
#>     {tri_age=low} 0.003294475  0.6500000 0.005068424 2.078525
#>    {tri_age=high} 0.044095286  0.5938567 0.074252408 1.748775
#>  {tri_age=medium} 0.049163710  0.4190065 0.117334009 1.205102
#>     {tri_age=low} 0.046122656  0.4802111 0.096046629 1.535586


#> or print(pervasive_tric(formula = formula, data = spi_sc_age, min_support = .03))

From the results, it appears we would be rather unlikely to meet individuals with the patterns of personality traits suggested for old and young people by a linear regression when data is trichotomized. The regression model explains ~9% of variance. This coincides with an observed percentage of concordant pairs of ~61%, as reported above. The middle category of age appears to be rather difficult to predict, with a lift of 1.20 being the best that is achieved (lift can globally be interpreted in the same way as an overall or omnibus odds ratio).

In the frequency tables below, we can see that the bins are of fairly equivalent size. However, the cutoffs make for some unequal ranges for many variables. This is typical with skewed variables.

example$freq_tables
#> $age
#> 
#> [11,20) [20,28) [28,80] 
#>    1234    1372    1340 
#> 
#> $Agree
#> 
#>    [1,4) [4,4.64) [4.64,6] 
#>     1271     1340     1335 
#> 
#> $Consc
#> 
#> [1.29,3.64) [3.64,4.36)    [4.36,6] 
#>        1261        1345        1340 
#> 
#> $Neuro
#> 
#>    [1,3.29) [3.29,4.21)    [4.21,6] 
#>        1276        1303        1367 
#> 
#> $Extra
#> 
#>    [1,3) [3,3.86) [3.86,6] 
#>     1223     1312     1411 
#> 
#> $Open
#> 
#> [1.36,4.36) [4.36,4.93)    [4.93,6] 
#>        1251        1231        1464

We can compare to the results of dichotomizing instead of trichotomizing. This improves support for the regression solutions. However, lift is generally smaller.


example_dic <- pervasive_dic(formula = formula, data = spi_sc_age, min_support = .03)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime      support minlen
#>           0    0.1    1 none FALSE            TRUE       5 0.0002534212      1
#>  maxlen target  ext
#>      10  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 1 
#> 
#> set item appearances ...[12 item(s)] done [0.00s].
#> set transactions ...[12 item(s), 3946 transaction(s)] done [0.00s].
#> sorting and recoding items ... [12 item(s)] done [0.00s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4 5 6 done [0.00s].
#> writing ... [486 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].

example_dic
#>                                                     Description
#>                           The rule suggested by your regression
#>            The rule suggested by the low end of your regression
#>  The rule with the highest lift for high values of your outcome
#>   The rule with the highest lift for low values of your outcome
#>                                                                             LHS
#>  {dich_Agree=high,dich_Consc=high,dich_Neuro=low,dich_Extra=low,dich_Open=high}
#>   {dich_Agree=low,dich_Consc=low,dich_Neuro=high,dich_Extra=high,dich_Open=low}
#>                 {dich_Agree=high,dich_Consc=high,dich_Neuro=low,dich_Extra=low}
#>                  {dich_Consc=low,dich_Neuro=high,dich_Extra=high,dich_Open=low}
#>              RHS    Support Confidence   Coverage     Lift
#>  {dich_age=high} 0.02407501  0.7600000 0.03167765 1.459348
#>   {dich_age=low} 0.02179422  0.7478261 0.02914344 1.560509
#>  {dich_age=high} 0.04510897  0.7542373 0.05980740 1.448282
#>   {dich_age=low} 0.03826660  0.7224880 0.05296503 1.507635

example_dic$freq_tables
#> $age
#> 
#> [11,23) [23,80] 
#>    1891    2055 
#> 
#> $Agree
#> 
#> [1,4.36) [4.36,6] 
#>     1962     1984 
#> 
#> $Consc
#> 
#> [1.29,4)    [4,6] 
#>     1943     2003 
#> 
#> $Neuro
#> 
#> [1,3.79) [3.79,6] 
#>     1960     1986 
#> 
#> $Extra
#> 
#> [1,3.5) [3.5,6] 
#>    1958    1988 
#> 
#> $Open
#> 
#> [1.36,4.64)    [4.64,6] 
#>        1852        2094

If we want to examine pervasiveness with binary outcomes, we can use pervasive_tric_glm() or pervasive_dic_glm().

When variables are trichotomized, we can see that the “prototypical” woman suggested by the regression, with high agreeableness, conscienciousness, neuroticism, extraversion and low openness, is again rather rare (14 individuals out of a sample of 3946). This does not imply that the linear model is somehow “wrong”, it simply provides context to interpret the results.


 example_dic_glm <- pervasive_dic_glm(formula = formula_glm, data = spi_sc_sex, min_support = .03)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime      support minlen
#>           0    0.1    1 none FALSE            TRUE       5 0.0002534212      1
#>  maxlen target  ext
#>      10  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 1 
#> 
#> set item appearances ...[12 item(s)] done [0.00s].
#> set transactions ...[12 item(s), 3946 transaction(s)] done [0.00s].
#> sorting and recoding items ... [12 item(s)] done [0.00s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4 5 6 done [0.00s].
#> writing ... [486 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].
 example_dic_glm
#>                                                     Description
#>                           The rule suggested by your regression
#>            The rule suggested by the low end of your regression
#>  The rule with the highest lift for high values of your outcome
#>   The rule with the highest lift for low values of your outcome
#>                                                                              LHS
#>  {dich_Agree=high,dich_Consc=high,dich_Neuro=high,dich_Extra=high,dich_Open=low}
#>     {dich_Agree=low,dich_Consc=low,dich_Neuro=low,dich_Extra=low,dich_Open=high}
#>                  {dich_Consc=high,dich_Neuro=high,dich_Extra=high,dich_Open=low}
#>                   {dich_Agree=low,dich_Neuro=low,dich_Extra=high,dich_Open=high}
#>              RHS    Support Confidence   Coverage     Lift
#>  {dich_sex=high} 0.02458186  0.8660714 0.02838317 1.455502
#>   {dich_sex=low} 0.02255449  0.6691729 0.03370502 1.652413
#>  {dich_sex=high} 0.03649265  0.8421053 0.04333502 1.415225
#>   {dich_sex=low} 0.04181450  0.6932773 0.06031424 1.711935
 
 example_tric_glm <- pervasive_tric_glm(formula = formula_glm, data = spi_sc_sex, min_support = .03)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime      support minlen
#>           0    0.1    1 none FALSE            TRUE       5 0.0002534212      1
#>  maxlen target  ext
#>      10  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 1 
#> 
#> set item appearances ...[17 item(s)] done [0.00s].
#> set transactions ...[17 item(s), 3946 transaction(s)] done [0.00s].
#> sorting and recoding items ... [17 item(s)] done [0.00s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4 5 6 done [0.00s].
#> writing ... [2039 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].
 example_tric_glm
#>                                                     Description
#>                           The rule suggested by your regression
#>            The rule suggested by the low end of your regression
#>  The rule with the highest lift for high values of your outcome
#>   The rule with the highest lift for low values of your outcome
#>                                                                         LHS
#>  {tri_Agree=high,tri_Consc=high,tri_Neuro=high,tri_Extra=high,tri_Open=low}
#>     {tri_Agree=low,tri_Consc=low,tri_Neuro=low,tri_Extra=low,tri_Open=high}
#>                              {tri_Agree=high,tri_Neuro=high,tri_Extra=high}
#>                                 {tri_Agree=low,tri_Neuro=low,tri_Open=high}
#>             RHS     Support Confidence    Coverage     Lift
#>  {tri_sex=high} 0.003547897  0.9333333 0.003801318 1.568541
#>   {tri_sex=low} 0.005575266  0.7857143 0.007095793 1.940193
#>  {tri_sex=high} 0.030157121  0.8814815 0.034211860 1.481399
#>   {tri_sex=low} 0.032944754  0.7303371 0.045108971 1.803448