In a standard sequence analysis, similar trajectories are clustered together to create a typology of trajectories, which is then often used to evaluate the association between sequence patterns and covariates inside regression models. The sampling uncertainty, which affects both the derivation of the typology and associated regressions, is typically ignored in this analysis, an oversight that may lead to wrong statistical conclusions.
RARCAT aims to overcome this limitation by assessing the robustness of regression results obtained from this standard analysis. It works as follows. Bootstrap samples are drawn from the data, and for each bootstrap, a new typology replicating the original one is constructed, followed by the estimation of the corresponding regression models. The bootstrap estimates are then combined using a multilevel modelling framework. The method and the exact interpretation of the results are fully discussed in the following reference:
You are kindly asked to cite the above reference if you use the methods presented in this document.
This document focuses on the R code needed to use
RARCAT. It is structured in two parts. First, a short tutorial with a
streamlined standard analysis of sequence data and a robustness
assessment made with the rarcat function. Then, an extended
tutorial with the same data illustration and a detailed explanation of
the functions, their arguments and their outputs.
Let us start by setting the seed for reproducible results.
Time for this code chunk to run: 0.01 seconds
For this example, we use the openly available mvad
dataset on transitions from school to work. First, we create the state
sequence object.
## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)
## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")
## Creating the state sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, states = mvad.shortlab,
labels = mvad.lab, xtstep = 6)Time for this code chunk to run: 0.21 seconds
We will now construct a typology using cluster analysis. For readers
seeking more details, the WeightedCluster library manual
provides an in-depth explanation of the process and the computation of
cluster quality measures (Studer, 2013).
We start by computing dissimilarities with the seqdist
function using the longest common subsequence distance. We then use Ward
clustering to create a typology of the trajectories.
## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")Time for this code chunk to run: 3.84 seconds
We can now compute several cluster quality indices using
as.clustrange function with two to ten groups.
# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3 0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4 0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5 0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6 0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7 0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8 0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9 0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08
Time for this code chunk to run: 0.27 seconds
Different clustering solutions may be argued based on the information
above. We are interested in a relatively detailed partition of the data,
so focus hereafter on a six-cluster solution. The
clustassoc function supports considering at least 6 groups
to sufficiently account for this association (see
WeightedCluster vignette Validating Sequence Analysis
Typologies to be used in Subsequent Regression). Therefore, we
consider a six-cluster solution for further illustration. The
corresponding typology is shown next:
Time
for this code chunk to run: 0.49 seconds
We are now interested in the relationship between this typology and
the funemp and gcse5eq covariates, which
represent the father’s unemployment status and the qualifications gained
by the end of compulsory education, respectively (both are binary
variables). Such associations can be studied with different approaches.
Here, we focus on implementing separate logistic regressions for each
cluster and estimating average marginal effects (AMEs) with these
models. The readers are referred to the article by Roth et al. (2024)
for theoretical and practical explanations for these methodological
choices.
Multiple commands would normally be required to explore all possible
combinations between the clusters and their related covariates. This can
also be done with the function rarcat below.
# Add the clustering solution to the dataset
mvad$clustering <- clustqual$clustering$cluster6
# The first argument is a formula for the association between clustering and covariates of interest
# The function estimates separate logistic regressions for each cluster and compute the corresponding AMEs
# with their confidence interval
rarcatout <- rarcat(clustering ~ funemp + gcse5eq, data=mvad, diss=diss, robust=FALSE)
rarcatout##
## NAIVE ESTIMATES
## Average Marginal Effect (AME) to be in a given cluster instead of any others
## Cluster 1 Cluster 2 Cluster 3
## funempyes -0.0701 -0.0192 0.0111
## (-0.150; 0.010) (-0.0749; 0.0364) (-0.0722; 0.0944)
## gcse5eqyes -0.1811 0.1592 0.0140
## (-0.244; -0.118) ( 0.1093; 0.2092) (-0.0496; 0.0776)
## Cluster 4 Cluster 5 Cluster 6
## funempyes 0.0563 -0.0489 0.0512
## (-0.0274; 0.140) (-0.110; 0.0124) (-0.0030; 0.1055)
## gcse5eqyes -0.2020 0.2543 -0.0506
## (-0.2597; -0.144) ( 0.197; 0.3116) (-0.0815; -0.0196)
##
## Robust estimates not estimated
Time for this code chunk to run: 1.25 seconds
The results in the table above are AMEs, which measure the expected change in the probability of belonging to a given cluster if the covariate takes a given value, together with their 95% confidence intervals. Thus, after adjustment for the GCSEs, father unemployment status is only marginally associated with membership to cluster 6, which contains the children’s unemployment trajectories (the p-value is actually 0.06).
On the other hand, there are strong associations apparent between the GCSEs obtained and the trajectory groups, with a distinctively higher probability of entering higher education (clusters 2 and 5) compared to being employed, in training or, to a lesser degree, unemployed (clusters 1, 4 and 6) if five or more GCSEs were gained. However, the standard analysis presented to this point does not properly account for the sampling uncertainty, such that the findings’ reliability remains in the balance.
The Robustness Assessment of Regressions using Cluster Analysis
Typologies (RARCAT) procedure allows for evaluating the reproducibility
of the analysis on resamples from the data. We focus here on a quick
implementation of the procedure and refer to the extended tutorial below
or the article by Roth et al. (2024) for further details. The function
rarcat is run again with the following further
arguments:
formula: A formula object with the clustering solution
on the left side and the covariates of interest.data: The dataset (data frame) with column names
corresponding to the information in formula. The number of individuals
(row number) should match the dimension of diss. The
clustering solution should be available in the data.diss: The numerical dissimilarity matrix used for
clustering. Only a precomputed matrix (i.e. where pairwise
dissimilarities do not depend on the resample) is currently
supported.R: The integer number of bootstraps. Set to 500 by
default to attain a satisfactory precision around the estimates as the
procedure involves multiple steps.TRUE, “pam” (calling the function
wcKMedRange) is used, and “hierarchical” is used
otherwise.hclust.method: A character string with the method
argument of hclust, “ward.D” by default.fixed: Logical. TRUE implies that the number of
clusters is the same in every bootstrap. FALSE (default) implies that an
optimal number of clusters is evaluated each time.ncluster: Integer. Either the number of clusters in
every bootstrap if fixed is TRUE or the maximum number of
clusters (starting from 2) to be evaluated in each bootstrap if
fixed is FALSE.cqi: A character string with the cluster quality index
to be evaluated for each new partition. Any column of
as.clustrange is supported, “HC” (the Hubert’s C index) by
default.parallel: Logical. Whether to initialize the parallel
processing of the future package using the default
multisession strategy. If FALSE (default),
then the current plan is used. If TRUE,
multisession plan is initialized using default values.progressbar: Logical. Whether to initialize a
progressbar using the future package. If FALSE (default),
then the current progress bar handlers is used . If TRUE, a
new global progress bar handlers is initialized..Parallel processing can be set up by using the
parallel=TRUE argument. However, it is generally
recommended to set up the parallel environment by yourselves. This
ensures that the parallel environment is initialized only once and that
it matches your environment. This is achieved by using the
plan() function of the future framework (see the
documentation for more detail). If you do so, rarcat
automatically uses parallel computations. You can do it by running the
following code before rarcat(). If you use
plan(), you shouldn’t use the parallel=TRUE
argument.
Time for this code chunk to run: 0.71 seconds
Now we can run it using parallel processing.
# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with 50 bootstrap replications.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# The number of clusters is fixed to 6 here.
rarcatout <- rarcat(clustering ~ funemp + gcse5eq, data = mvad, diss = diss, robust = TRUE, R = 100,
kmedoid = FALSE, hclust.method = "ward.D",
fixed = TRUE, ncluster = 6)
rarcatout##
## NAIVE ESTIMATES
## Average Marginal Effect (AME) to be in a given cluster instead of any others
## Cluster 1 Cluster 2 Cluster 3
## funempyes -0.0701 -0.0192 0.0111
## (-0.150; 0.010) (-0.0749; 0.0364) (-0.0722; 0.0944)
## gcse5eqyes -0.1811 0.1592 0.0140
## (-0.244; -0.118) ( 0.1093; 0.2092) (-0.0496; 0.0776)
## Cluster 4 Cluster 5 Cluster 6
## funempyes 0.0563 -0.0489 0.0512
## (-0.0274; 0.140) (-0.110; 0.0124) (-0.0030; 0.1055)
## gcse5eqyes -0.2020 0.2543 -0.0506
## (-0.2597; -0.144) ( 0.197; 0.3116) (-0.0815; -0.0196)
##
## ROBUST ESTIMATES (RARCAT)
## Average Marginal Effect (AME) to be in a given cluster instead of any others
##
## Number of bootstraps: 100 with fixed number of clusters
## Cluster 1 Cluster 2 Cluster 3
## funempyes -0.0442 -0.0162 -0.00131
## (-0.107; 0.0189) (-0.0755; 0.0432) (-0.0721; 0.0695)
## gcse5eqyes -0.1813 0.1721 0.02948
## (-0.274; -0.0888) ( 0.0963; 0.2478) (-0.0482; 0.1071)
## Cluster 4 Cluster 5 Cluster 6
## funempyes 0.016 -0.0486 0.0838
## (-0.0544; 0.0864) (-0.110; 0.0126) ( 0.00556; 0.16210)
## gcse5eqyes -0.159 0.2509 -0.0894
## (-0.2331; -0.0858) ( 0.172; 0.3297) (-0.17013; -0.00873)
Time for this code chunk to run: 44.52 seconds
Now, a second table presents the estimates of the RARCAT analysis. The results stay fairly close to the ones from the original analysis (first table), in particular regarding the “significant” associations. The relationship between father unemployment status and unemployment trajectories (cluster 6) is somewhat stronger than it did in the original sample, and the relationship between GCSEs and long training spells (cluster 4) is a bit weaker. The other relevant associations are virtually unchanged. Thus, we conclude that the original analysis appears to be valid and robust to sampling variation. However, fixing the number of clusters (six here) throughout the bootstrap procedure may not always be warranted.
In the following subsections, we present several extensions and complementary analysis that can be really helpful to make the most of the RARCAT results and better interpret the outcome. Generally speaking, the Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure works as follows:
In this section we further detail some of these steps. First, we
consider the estimation of the number of clusters (instead of using a
fixed number). Second, we analyse the AME estimated in each
bootstrap and their distribution by cluster. This provides relevant
information on which clusters lead to stable or unstable results. We
then discuss several diagnostics and check to ensure that a sufficient
number of bootstraps have been used. We finally discuss how to
investigate the stability of the clustering itself and its potential
impact on the regression results.
Usually, the number of groups of a typology is not known in advance,
but rather estimated from the data, for instance by maximizing (or
minimizing) a given cluster quality index. The RARCAT procedure can
account for the estimation of the number of groups chosen through CQI
maximization (or minimization). This is achieved by specifying
fixed=FALSE, the maximum number of groups to consider (here
ncluster=10) as well as the target CQI (here the Hubert’s C
index). To limit computation time, we only used R=50 here,
but this should be increased in any application.
# Bootstrap replicates of the typology and its association with the variable funemp.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# Also, an optimal clustering solution with n between 2 and 10 is evaluated each time by
# maximizing the HC index.
# As we previously initalized parallel computing, it is used in these computations
vary.rarcat <- rarcat(clustering ~ funemp, data = mvad, diss = diss, R = 50,
kmedoid=FALSE, hclust.method = "ward.D",
fixed = FALSE, ncluster = 10, cqi = "HC")
vary.rarcat##
## NAIVE ESTIMATES
## Average Marginal Effect (AME) to be in a given cluster instead of any others
## Cluster 1 Cluster 2 Cluster 3 Cluster 4
## funempyes -0.0447 -0.0412 0.00878 0.0927
## (-0.13; 0.0407) (-0.0874; 0.00509) (-0.0734; 0.091) (0.00205; 0.183)
## Cluster 5 Cluster 6
## funempyes -0.0797 0.0641
## (-0.131; -0.0285) (0.00462; 0.123)
##
## ROBUST ESTIMATES (RARCAT)
## Average Marginal Effect (AME) to be in a given cluster instead of any others
##
## Number of bootstraps: 50 with varying number of clusters
## Distribution of the number of clusters across bootstraps
##
## 4 5 6 7 8 9 10
## 3 2 1 3 2 15 24
##
## Cluster 1 Cluster 2 Cluster 3
## funempyes -0.0229 -0.033 -0.0113
## (-0.0862; 0.0403) (-0.0943; 0.0283) (-0.0413; 0.0186)
## Cluster 4 Cluster 5 Cluster 6
## funempyes 0.0205 -0.08 0.0871
## (-0.017; 0.0581) (-0.135; -0.025) (-0.0086; 0.183)
Time for this code chunk to run: 10.91 seconds
The output now print information about the distributions of the optimal number of groups per bootstrap, i.e. the one identified by maximizing the HC index. Here, most bootstrap partitions have nine or ten clusters. The RARCAT procedure can take into account the different number of groups across the bootstrap in its computation of the robust AME. Logically, the “naive” approach provides the same results, but the robust ones changes as we modified the bootstrap procedure.
RARCAT works by estimating for each observation the AME of its
cluster in each bootstrap. We can plot these AME (across individuals and
bootstraps) using the plot() function and specifying the
covar (covariable) of interest.
# Histogram of the estimated AMEs for all individuals and all bootstraps
plot(rarcatout, covar="funempyes")Time
for this code chunk to run: 0.21 seconds
The plot also shows the estimated overall AME using the naive (in blue) or RARCAT approach (in red). Corresponding confidence intervals are also represented using colored shaded area.
For some clusters, we identify an unimodal distribution of the AME, usually with a low dispersion. This denotes that the results for this specific cluster are stable. The observations initially regrouped in this cluster were clustered in clusters with similar AME values across the bootstraps. We can therefore be confident about our interpretation (whether a positive, negative, neutral association).
On the contrary, if there’s a large dispersion of the AME, the bootstraps lead to the estimation of different values, showing less stable results. We can therefore be less confident and the RARCAT confidence interval will generally be larger.
Multimodal distributions of the AME generally reveal an interesting situation. It often means that a subgroup of observations switch to a different cluster, with a different AME, in several of the bootstraps. This can happens for instance in presence of observation lying in-between types. We can identify those observations by looking at the outliers (see below).
The final RARCAT estimates are computed or pooled using a multilevel models. This model allows identifying the pooled estimate and its the expected variation across bootstraps. It also allows us to identify outliers.
We rely on a cross-classified multilevel model, using bootstrap and
indiviudal as upper level, to pool the results. The residual standard
errors provides us information on the remaining estimation related to
the bootstrapping procedure. It should therefore be low if we have a
sufficiently large number of iteration. The summary()
function provide several informations.
##
## RARCAT Diagnostics
## Distribution of standardized observation random intercept
##
## funempyes
##
## < -2 -2....-1 -1...1 1...2 >2
## 2 54 567 69 20
##
##
## gcse5eqyes
##
## < -2 -2....-1 -1...1 1...2 >2
## 12 53 583 17 47
##
##
## Standard errors
## 1 2 3 4 5
## funempyes 0.003576662 0.002978417 0.003737584 0.003983932 0.003075888
## gcse5eqyes 0.008404634 0.003879203 0.005412312 0.004271173 0.004193584
## 6
## funempyes 0.004280292
## gcse5eqyes 0.004535972
Time for this code chunk to run: 0.02 seconds
In our example, the standard errors are low. The number of bootstrap was therefore sufficient. If those numbers are too high, increasing the number of bootstrap should fix it.
The previous output also shows the number of outliers for each covariate. These observations lies far from the average estimate in at least some of the bootstrap. Interestingly, outliers are generally regrouped in some of the clusters only. We can look at the distribution of standardized observation-level random effects per clusters. These random effects can be interpreted as estimated deviation from the pooled estimates. Therefore, we can use them to identify observations (i.e. sequences) that are not well represented by their pooled estimates.
Time
for this code chunk to run: 0.12 seconds
We can also identify outliers by looking at the observation level
standardized random effect. Those are available in the
observation.stdranef component of the rarcat
object. Absolute values above 2 or even 3 are generally used to identify
outliers. To get an idea of deviating observation by cluster, we use an
absolute value of 1. We then plot identified sequences in each of the
clusters.
outliers <- abs(rarcatout$observation.stdranef[, "funempyes"])>2
seqIplot(mvad.seq[outliers, ], group=mvad$clustering[outliers])Time
for this code chunk to run: 0.28 seconds
Interestingly, there are no deviating sequences in some of the clusters.
Lastly, the individual-specific random effects in the output of
unirarcat inform on the within-cluster homogeneity. Indeed,
individuals with fitted values that diverge from the other individuals
in the same cluster were often assigned to different clusters in the
bootstrap, thus impacting their estimated AMEs. We can see this by
looking at the distribution of these random effects.
There are several reasons that might lead to non-robust results. Being able to distinguish them allows for a better understanding of RARCAT results.
One of them relies on the stability of the clustering results. The
clustering (i.e. the estimation of the typology) might be sample
dependent. As a result, inference and interpretation should be made with
extreme caution. Hennig (2007) proposed to look at the cluster stability
across difference bootstraps to estimate this sampling uncertainty. The
analysis allows to identify specific clusters that stable or not across
bootstraps. The method is available in the fpcpackage and
can be run as follows for hierarchical clustering with
B=500 the number of bootstraps.
# Loading the fpc library
library(fpc)
# Cluster-wise stability assessment by bootstrap
stab <- clusterboot(diss, B = 500, distances = TRUE, clustermethod = disthclustCBI,
method = "ward.D", k = 6, count = FALSE)
stab## * Cluster stability assessment *
## Cluster method: hclust
## Full clustering results are given as parameter result
## of the clusterboot object, which also provides further statistics
## of the resampling results.
## Number of resampling runs: 500
##
## Number of clusters found in data: 6
##
## Clusterwise Jaccard bootstrap (omitting multiple points) mean:
## [1] 0.5946359 0.8804019 0.7663908 0.3992446 0.9393705 0.6298396
## dissolved:
## [1] 70 31 37 449 0 190
## recovered:
## [1] 32 438 318 28 449 187
Time for this code chunk to run: 12.74 seconds
The results shows the cluster the “Clusterwise Jaccard bootstrap mean”. Values above 0.85 indicate high cluster-wise stability, meaning that most of the individuals belonging to any of the two clusters in the original partition tend to be clustered again in the bootstrapped partitions. Values below 0.60 show unstable clusters, which are likely not recovered in other subsamples of the data. They should, therefore, be interpreted with caution.
[1] In fact, the estimated Jaccard coefficient for this cluster is only 0.4, compared to 0.94 for the fifth cluster and 0.63 for the sixth.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Roth, L., Studer, M., Zuercher, E., & Peytremann-Bridevaux, I. (2024). Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis. BMC medical research methodology, 24(1), 303. https://doi.org/10.1186/s12874-024-02435-8.
Studer M. WeightedCluster Library Manual A practical guide to creating typologies of trajectories in the social sciences with R. 2013.