Last month I attended **1**^{st} International Symposium on Computing in Science and Engineering, held by Gediz University in Kusadasi, Turkey, with a paper and a presentation.

My topic was “**An empirical comparison of fast and efficient tools for mining textual data**“. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, **Cluto** and **Gmeans**.

The abstract of the paper is below:

*In order to effectively manage and retrieve the information comprised in vast amount of text documents, powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.*

*
***Keywords**: text mining, document clustering, spherical k-means, bisecting k-means

### About Cluto

Written in ANSI C by George Karypis, CLUTO (CLUstering TOolkit) is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

Cluto contains partitional, agglomerative, and graph-partitioning based clustering algorithms. Bisecting k-means is the default option from the partitional class of algorithms, which is under consideration of the paper. In addition, Cluto offers multiple distance (similarity) functions like cosine, euclidean, correlation coefficient, extended Jaccard, where cosine is the default option. Cluto also has an option to select one of several clustering criterion functions from four categories: internal, external, hybrid, and graph-based.

### About Gmeans

Gmeans is a C++ program for clustering, developed by Yuqiang Guan as part of his PhD thesis. The program employs four different k-means type clustering algorithms with four different distance (similarity) measures: cosine, euclidean, diametric distance, and Kullback-Leibler divergence, where cosine is the default similarity measure applied for spherical k-means, with each document vector to be (L2) normalised. Moreover, a local search strategy to overcome the local optima problem, called first variation, is also included. The program generates one-way, hard-clustering of a given dataset.

### Download the Paper

If you are interested in the details of this comparison such as datasets used and experiments performed, you can freely download the paper here (.pdf inside .rar archive 281K).

### View the Presentation