Clusters wide-format sequences using pairwise string dissimilarity and either PAM (Partitioning Around Medoids) or hierarchical clustering. Supports 9 distance metrics including temporal weighting for Hamming distance. When the stringdist package is available, uses C-level distance computation for 100-1000x speedup on edit distances.
Usage
cluster_data(
data,
k,
dissimilarity = "hamming",
method = "pam",
na_syms = c("*", "%"),
weighted = FALSE,
lambda = 1,
seed = NULL,
q = 2L,
p = 0.1,
covariates = NULL,
...
)
cluster_sequences(
data,
k,
dissimilarity = "hamming",
method = "pam",
na_syms = c("*", "%"),
weighted = FALSE,
lambda = 1,
seed = NULL,
q = 2L,
p = 0.1,
covariates = NULL,
...
)Arguments
- data
Input data. Accepts multiple formats:
- data.frame / matrix
Wide-format sequences (rows = sequences, columns = time points, values = state names).
- netobject
A network object from
build_network. Extracts the stored sequence data. Only valid for sequence-based methods (relative, frequency, co_occurrence, attention).- tna
A tna model from the tna package. Decodes the integer-encoded sequence data using stored labels.
- cograph_network
A cograph network object. Extracts the stored sequence data.
- k
Integer. Number of clusters (must be between 2 and
nrow(data) - 1).- dissimilarity
Character. Distance metric. One of
"hamming","osa"(optimal string alignment),"lv"(Levenshtein),"dl"(Damerau-Levenshtein),"lcs"(longest common subsequence),"qgram","cosine","jaccard","jw"(Jaro-Winkler). Default:"hamming".- method
Character. Clustering method.
"pam"for Partitioning Around Medoids, or a hierarchical method:"ward.D2","ward.D","complete","average","single","mcquitty","median","centroid". Default:"pam".- na_syms
Character vector. Symbols treated as missing values. Default:
c("*", "%").- weighted
Logical. Apply exponential decay weighting to Hamming distance positions? Only valid when
dissimilarity = "hamming". Default:FALSE.- lambda
Numeric. Decay rate for weighted Hamming. Higher values weight earlier positions more strongly. Default: 1.
- seed
Integer or NULL. Random seed for reproducibility. Default:
NULL.- q
Integer. Size of q-grams for
"qgram","cosine", and"jaccard"distances. Default:2L.- p
Numeric. Winkler prefix penalty for Jaro-Winkler distance (clamped to 0–0.25). Default:
0.1.- covariates
Optional. Post-hoc covariate analysis of cluster membership via multinomial logistic regression. Accepts:
- formula
~ Age + Gender- character vector
c("Age", "Gender")- string
"Age + Gender"- data.frame
All columns used as covariates
- NULL
No covariate analysis (default)
Covariates are looked up in
netobject$metadataor non-sequence columns of the input data. Fortnaandcograph_networkinputs, pass covariates as a data.frame. Results stored in$covariates. Requires the nnet package.- ...
Additional arguments (currently unused).
Value
An object of class "net_clustering" containing:
- data
The original input data.
- k
Number of clusters.
- assignments
Named integer vector of cluster assignments.
- silhouette
Overall average silhouette width.
- sizes
Named integer vector of cluster sizes.
- method
Clustering method used.
- dissimilarity
Distance metric used.
- distance
The computed dissimilarity matrix (
distobject).- medoids
Integer vector of medoid row indices (PAM only; NULL for hierarchical methods).
- seed
Seed used (or NULL).
- weighted
Logical, whether weighted Hamming was used.
- lambda
Lambda value used (0 if not weighted).
Examples
# \donttest{
seqs <- data.frame(
V1 = sample(LETTERS[1:3], 20, TRUE), V2 = sample(LETTERS[1:3], 20, TRUE),
V3 = sample(LETTERS[1:3], 20, TRUE), V4 = sample(LETTERS[1:3], 20, TRUE)
)
cl <- cluster_data(seqs, k = 2)
print(cl)
#> Sequence Clustering
#> Method: pam
#> Dissimilarity: hamming
#> Clusters: 2
#> Silhouette: 0.2704
#> Cluster sizes: 8, 12
#> Medoids: 18, 16
summary(cl)
#> Sequence Clustering Summary
#> Method: pam
#> Dissimilarity: hamming
#> Silhouette: 0.2704
#>
#> Per-cluster statistics:
#> cluster size mean_within_dist
#> 1 8 2.428571
#> 2 12 2.136364
# }