The segmentation techniques jungle

apples-and-pears-ffp

There are various tools the data scientist need to master:

  1. Generalized linear models
  2. Hypothesis testing
  3. Principal component and factor analysis
  4. Market basket analysis
  5. Choice modelling
  6. Optimisation (finding extremes and particular points in curves)
  7. Time series analysis

and

SEGMENTATION

Now, in terms of segmentations there are more ways to do a segmentations than to combine coffee and milk. A few very popular methods are:

  1. K-means with the various distances
  2. Hierarchical clustering
  3. Bi-clustering
  4. Plain crosstabs
  5. Bayesian classifiers
  6. Two steps K-means
  7. Latent class analysis

and more…

In particular I believe there is a way to find segments that it is underestimated: Latent class regression.

In particular this methodology can find clusters on the basis of, for example, a particular customer spend over time.

You cold, in principle, find a cluster of people that increase spend over, say, a hundred days steadily, another group that increases very steeply and a group that seems to behave the same over the time observed etc etc…

Below there’s some code in R to explain what I am talking about

#Playing around with flexmix
library(flexmix)
## Loading required package: lattice
#Simulating the data

interval <- 1:100 #Transactions over a 100 weeks
group_a <- rep(50 + rnorm(100,0,2),100) #spending the same
group_b <- rep(100 + 0.05*interval + rnorm(100,0,5),100) #spending more
group_c <- rep(150 -0.05*interval+ rnorm(100,0,5),100) #spending less
id <- list()
for(n in 1:300) {id[[n]] <- rep(n,100)}
id <- unlist(id)
data.df <- data.frame(date =rep(interval,100), amount_spent = c(group_a, group_b, group_c),id = id)

#Flexmix working its magic

model_1 <- flexmix(amount_spent ~ date | id, data = data.df, k=3)
model_1
## Call:
## flexmix(formula = amount_spent ~ date | id, data = data.df, 
##     k = 3)
## 
## Cluster sizes:
##     1     2     3 
## 10000 10000 10000 
## 
## convergence after 4 iterations
#The algorithm rightly identified three transactional trajectories
parameters(model_1, component =1)
##                        Comp.1
## coef.(Intercept) 100.99432992
## coef.date          0.03633786
## sigma              4.63147132
parameters(model_1, component =2)
##                        Comp.2
## coef.(Intercept) 149.86774002
## coef.date         -0.03905822
## sigma              4.91297606
parameters(model_1, component =3)
##                        Comp.3
## coef.(Intercept) 49.804688685
## coef.date         0.005226355
## sigma             1.916632596
#The parameters also are rightly estimated

# flexmix did the job apparently but let's check the groups compositions
# Component 1 should give us IDs between 100 and 200 etc...
data.df$cluster <- model_1@cluster
unique(data.df$id[data.df$cluster == 1])
##   [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
##  [18] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
##  [35] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
##  [52] 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
##  [69] 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
##  [86] 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

The example above it’s just an illustration but, this methodology can be used in various way, e.g. in insurance to distinguish between claimants populations and detect fraudsters.

Like most analytics methodologies the applications depend on the user imagination.

I believe with segmentation technique there is a real danger of data scientists avoiding the use of various methods for the sake of simplicity.

On the other hand, to really take commercial advantage of segmentation techniques, there aren’t many shortcuts. Hard work and creativity are the only way to gain an advantage from competitors.

In short my advice is: be adventurous anytime but test how robust the segmentation is, quite to the detail.

Leave a comment