Gmeans Clustering Software – Compatible with GCC 4+

By Volkan TUNALI, June 27, 2010 9:10 pm

Gmeans is a C++ program for clustering, which has essentially spherical k-means clustering algorithm. The original source code of the program (released under the GNU Public License (GPL)) is known to be compiled using gcc 3.0.3 in Solaris and Linux. However, recent Linux distributions come with gcc 4 or newer, and Gmeans cannot be compiled with gcc 4 due to several changes in gcc for compliance with the standards of C++ language. So, if we want to compile Gmeans using gcc 4+, we need to make modifications to almost all code files.

As part of my research, I need to use Gmeans on Linux, and I use Ubuntu 9.1, on which currently gcc 4.4.1 is installed. So, I have made the necessary modifications. You can download the gcc 4+ compatible source code of Gmeans here (.tar.gz file 810 KB). I have not removed any license statements or notes/comments of the author.

If this revised version is helpful to you, please let me know with your comments.

Data Mining Books

comments Comments Off
By Volkan TUNALI, June 23, 2010 8:25 pm

The following are the books I think very useful for beginners as well as advanced researchers in data mining field.

Data Mining: Concepts and Techniques, Second Edition

Data Mining: Concepts and Techniques, Second Edition

Authors: Jiawei HanMicheline Kamber
Publisher: Morgan Kaufmann Publishers

This may be the most known data mining book. I recommend especially to beginners.

Introduction to Data Mining

Introduction to Data Mining

Authors: Pang-Ning TanMichael SteinbachVipin Kumar
Publisher: Addisson-Wesley

Grouping Multidimensional Data: Recent Advances in Clustering

Grouping Multidimensional Data: Recent Advances in Clustering

Authors: Jacob Kogan, Charles Nicholas, Marc Teboulle (Editors)
Publisher: Springer

There are other books I really like. I’ll introduce them later.

Twitter Datasets

By Volkan TUNALI, June 22, 2010 11:14 pm

Twitter data gathered through the Twitter’s streaming API was published at (about 5.5 GB) in February and April. Second release was a bit of cleaning the first one.

On June 14, Twitter event detection corpus was also released at the same address.

Actually I have no idea how those datasets can be made use of for clustering experiments. Worth having a look at, though. Especially, it may be used in large scale data mining research, as the authors say.

UPDATE April 12, 2011: I’m sorry to see that Twitter dataset is no longer available due to the request of Twitter. I hope an up-to-date Twitter dataset will be available soon.

I am unable to provide download links for this dataset. Please do not make requests for this dataset. I’m sorry.

Hello world!

comments Comments Off
By Volkan TUNALI, June 22, 2010 8:05 am

Hello world of Data Mining!

New posts are coming soon.

Panorama Theme by Themocracy