Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

setwd("C:/Users/Adam/CatCluster/kmodes") <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes([,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)


The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

setwd("C:/Users/Adam/CatCluster/rock") <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix([,2:5]), 3 )
cluster.output <- cbind(,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.

7 thoughts on “Clustering categorical data with R

  1. It is good to see a post about K-modes!
    Thanks for kindness explanation.

    I tried to do the weighted K-modes and typed the command like
    kmodes(data,3, weighted=TRUE). but I got the error message as below..
    “Error in n_obj[i] <- weight[which(names == obj[different[i]])] :
    replacement has length zero"

    how can I deal with this?? I still could not find the solution..


    1. Hi,

      Thank you for your comment,

      Regarding the error, it is not one I have ever had (also I’m not sure that I ever used the weighted version).

      Although it is not quite the same scenario, I saw this post on stackoverflow:
      which suggested that a “replacement has length zero” error is generated when you have missing data in your table.

      If you have any missing values (NAs) in your data, perhaps try removing them first and seeing if that helps?

      Sorry that I don’t have any direct experience, but I hope the above helps.



      1. Hey Adam!,
        I actually started reading for the zero structure error and ended up finding this wonderful article and a solution for my original problem.

        Thanks buddy!



    1. for ex:
      result to see modes for each cluster

      exit it adds the number of the cluster that the tuple belongs to all tuples
      write.csv(exit, file = “file_name.csv”,row.names = TRUE) –> writes a csv with what u wanna know


  2. Hello

    Thank you for the post explaining this.

    My experience testing these 2 packages,

    ####### klaR::kmodes

    it’s a good one as an exploratory technique; although if one wanted to extend it to, let’s say, use the kmodes approach to a set of binary encoded categorical variables and determine the cluster of a new dataset – there is no current predict method to use as such. As to why I mite wanna do such a thing is to just experiment creating a new feature in my predictive modeling workflow.


    x <- rbind(matrix(rbinom(250, 1, 0.25), ncol = 5),
    matrix(rbinom(250, 1, 0.75), ncol = 5))
    colnames(x) <- c("a", "b", "c", "d", "e")

    ## run algorithm on x:
    (cl <- klaR::kmodes(x,modes = 2,iter.max = 5))

    #integrating the results back
    x <- data.table(x)
    x[,cl := as.integer(cl$cluster)]

    w <- rbind(matrix(rbinom(150, 1, 0.25), ncol = 5),
    matrix(rbinom(150, 1, 0.75), ncol = 5))
    colnames(w) <- c("a", "b", "c", "d", "e")


    Error in UseMethod("predict") :
    no applicable method for 'predict' applied to an object of class "kmodes"

    I spend time looking on the klaR package documentation and the gitHub but there is no mention whatsoever.
    If you do find it, please share.

    ####### rockCluster

    I've been struggling to use rockCluster package. Although, rockCluster doesn't have the same limitation as klaR i.e. there is a predict function , one can use to apply on a new data set – I seem to be failing to get results.


    x <- rbind(matrix(rbinom(250, 1, 0.25), ncol = 5),
    matrix(rbinom(250, 1, 0.75), ncol = 5))
    colnames(x) <- c("a", "b", "c", "d", "e")

    y <- as.dummy(x)
    rc <- rockCluster(y, n=3, theta=0.73, debug=F,fun='dist',funArgs = list(method="binary"))

    rf <- fitted(rc)

    [1] "1" "3" "4" "6" "7" "9" "11" "13" "14" "15" "21" "22" "24" "26"
    [15] "27" "28" NA

    What I dont seem to understand is why does it produce 28 levels when I clearly asked for 3 (with n=3 argument).

    I tried to replicate this with the example on the cba package documentation


    mush <- data.table(copy(Mushroom))
    mush[,class := as.factor(class)]

    trainIndex <- createDataPartition(mush$class, p = .8,
    list = FALSE,
    times = 1)

    mdTRAIN <- mush[trainIndex,]
    mdVAL <- mush[-trainIndex,]

    x <- as.dummy(mdTRAIN[-1])
    rc <- rockCluster(x[sample(dim(x)[1],1000),], n=3, theta=0.8)

    rf <- fitted(rc)
    rp <- predict(rc, x)

    [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
    [15] "15" "16" "17" "19" "20" "21" "28" "32" "34" NA

    and again, 34 clusters as opposed to the asked 3

    What am I missing?
    I'd appreciate any insight on this


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s