Category: Clustering

Using “awk” to Join Text Files on Windows

comments Comments Off
By Volkan TUNALI, November 4, 2010 11:34 pm

Last time I used awk to split single cisi.all file into small files like cisi.1, cisi.2 etc. Now, I have needed to join these small files into a single one in a kind of XML format. I have read some tutorials on awk but I am unable to find such a thing as looping over many text files. So, I have used another solution with Windows BATCH file scripting. I have written a little awk program to format and output the content of a given file to some output file. Then, in a batch file, I loop over the files in a directory and for each file, I run the awk program.

Here’s the batch file named JOIN.BAT:

del output.xml
for /r %%X in (dataset\*.*) do (awk -f join.awk %%X)

Here’s the awk file named JOIN.AWK:

BEGIN { print "<DOC>\n<BODY>" >>"output.xml"}
{print $0 >>"output.xml"}
END { print "</BODY>\n</DOC>\n">>"output.xml"}

As you see in the awk program, the content of each file is appended to the file output.xml. On Unix-like systems, you can write similar shell scripts instead of batch file.

Fundamentals of Predictive Text Mining

comments Comments Off
By Volkan TUNALI, September 4, 2010 11:09 pm

Fundamentals of Predictive Text MiningFundamentals of Predictive Text Mining is the new book I’ve found recently about text mining. This book explains the essentials of text mining very very well with very good examples, so I strongly recommend it to the newcomers to the field. Although the goal of the book is predictive text mining, its content is sufficiently broad to cover such topics as text clustering, information retrieval, and information extraction.

The book also contains several case studies that find solutions to several real life problems.

Authors: Sholom M. Weiss, Nitin Indurkyha, Tong Zhang
ISBN: 9781849962254
Publisher: Springer

Introduction to Information Retrieval

comments Comments Off
By Volkan TUNALI, September 1, 2010 11:36 pm

Introduction to Information RetrievalI will introduce a new book I find very useful: Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, from Cambridge University Press (ISBN: 0521865719).

The book provides a modern approach to information retrieval from a computer science perspective. It is based on a course the authors have been teaching in various forms at Stanford University and at the University of Stuttgart.

It contains the following chapters:

  1. Boolean retrieval
  2. The term vocabulary & postings lists
  3. Dictionaries and tolerant retrieval
  4. Index construction
  5. Index compression
  6. Scoring, term weighting & the vector space model
  7. Computing scores in a complete search system
  8. Evaluation in information retrieval
  9. Relevance feedback & query expansion
  10. XML retrieval
  11. Probabilistic information retrieval
  12. Language models for information retrieval
  13. Text classification & Naive Bayes
  14. Vector space classification
  15. Support vector machines & machine learning on documents
  16. Flat clustering
  17. Hierarchical clustering
  18. Matrix decompositions & latent semantic indexing
  19. Web search basics
  20. Web crawling and indexes
  21. Link analysis

The book is freely available for download in PDF format at http://nlp.stanford.edu/IR-book/information-retrieval-book.html (as a whole or as individual chapters).

Cluto vs. Gmeans – An Empirical Comparison

comments Comments Off
By Volkan TUNALI, July 16, 2010 5:55 pm

Text MiningLast month I attended 1st International Symposium on Computing in Science and Engineering, held by Gediz University in Kusadasi, Turkey, with a paper and a presentation.

My topic was “An empirical comparison of fast and efficient tools for mining textual data“. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans.

The abstract of the paper is below:

In order to effectively manage and retrieve the information comprised in vast amount of text documents, powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.

Keywords: text mining, document clustering, spherical k-means, bisecting k-means

About Cluto

Written in ANSI C by George Karypis, CLUTO (CLUstering TOolkit) is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

Cluto contains partitional, agglomerative, and graph-partitioning based clustering algorithms. Bisecting k-means is the default option from the partitional class of algorithms, which is under consideration of the paper. In addition, Cluto offers multiple distance (similarity) functions like cosine, euclidean, correlation coefficient, extended Jaccard, where cosine is the default option. Cluto also has an option to select one of several clustering criterion functions from four categories: internal, external, hybrid, and graph-based.

About Gmeans

Gmeans is a C++ program for clustering, developed by Yuqiang Guan as part of his PhD thesis. The program employs four different k-means type clustering algorithms with four different distance (similarity) measures: cosine, euclidean, diametric distance, and Kullback-Leibler divergence, where cosine is the default similarity measure applied for spherical k-means, with each document vector to be (L2) normalised. Moreover, a local search strategy to overcome the local optima problem, called first variation, is also included. The program generates one-way, hard-clustering of a given dataset.

Download the Paper

If you are interested in the details of this comparison such as datasets used and experiments performed, you can freely download the paper here (.pdf inside .rar archive 281K).

View the Presentation

Gmeans Clustering Software – Compatible with GCC 4+

By Volkan TUNALI, June 27, 2010 9:10 pm

Gmeans is a C++ program for clustering, which has essentially spherical k-means clustering algorithm. The original source code of the program (released under the GNU Public License (GPL)) is known to be compiled using gcc 3.0.3 in Solaris and Linux. However, recent Linux distributions come with gcc 4 or newer, and Gmeans cannot be compiled with gcc 4 due to several changes in gcc for compliance with the standards of C++ language. So, if we want to compile Gmeans using gcc 4+, we need to make modifications to almost all code files.

As part of my research, I need to use Gmeans on Linux, and I use Ubuntu 9.1, on which currently gcc 4.4.1 is installed. So, I have made the necessary modifications. You can download the gcc 4+ compatible source code of Gmeans here (.tar.gz file 810 KB). I have not removed any license statements or notes/comments of the author.

If this revised version is helpful to you, please let me know with your comments.

Panorama Theme by Themocracy