Category: Software

PRETO: A High-performance Text Mining Tool for Preprocessing Turkish Texts

comments Comments Off
By Volkan TUNALI, June 26, 2012 1:59 am

For my text mining research, I often need to preprocess document collections of varying size. Besides texts in English, I also work on texts in Turkish. Therefore, I need special preprocessing options for texts in Turkish.

In order to meet my special preprocessing needs, I have developed a text mining tool for preprocessing texts in Turkish as well as English. I call this tool PRETO. It is now available as an open source project at Google Code under GNU GPL v3 license. You can freely download and use it. Address of the project is http://code.google.com/p/preto/

If you use this tool for academic research purposes, please cite it as below:

Volkan Tunalı, Turgay Tugay Bilgin, “PRETO: A High-performance Text Mining Tool for Preprocessing Turkish Texts”, International Conference on Computer Systems and Technologies (CompSysTech), Ruse, Bulgaria, June 22-23, 2012, 134-140.

You can access the paper via ACM Digital Library.

Guide to Intelligent Data Analysis

By Volkan TUNALI, December 22, 2010 11:03 pm

I want to introduce a new Data Mining book from Springer: Guide to Intelligent Data Analysis. This book provides a hands-on instructional approach to many basic data analysis techniques, and explains how these are used to solve data analysis problems.

Authors: Michael R. Berthold (University of Konstanz, Germany), Christian Borgelt (European Centre for Soft Computing, Spain), Frank Höppner (Ostfalia University of Applied Sciences, Germany), Frank Klawonn (Ostfalia University of Applied Sciences, Germany).
Publisher: Springer
ISBN: 978-1-84882-259-7

In the book, chapters proceed with examples where KNIME and/or R are used as analysis tools. In addition, two chapters of appendices are dedicated to KNIME and R.

This is an excellent book which contains a very good combination of both theory and practice of data analysis. I strongly recommend this book to data mining researchers. For more information you can visit Springer page of the book.

RapidMiner Community Meeting And Conference – RCOMM 2010

comments Comments Off
By Volkan TUNALI, July 18, 2010 1:54 am

RapidMinerRapid-I hosts the first RapidMiner Community Meeting And Conference (RCOMM 2010) and invites users and developers of RapidMiner to take part and share their RapidMiner experiences with other members of the community.

Important Dates

Submission Deadline: August 6, 2010
Notification of Acceptance: August 13, 2010
Camera-ready Papers: August 20, 2010
Conference: September 13 – 16, 2010

Location

University of Dortmund, Germany

More Info & Registration

You can visit the conference home page.

Cluto vs. Gmeans – An Empirical Comparison

comments Comments Off
By Volkan TUNALI, July 16, 2010 5:55 pm

Text MiningLast month I attended 1st International Symposium on Computing in Science and Engineering, held by Gediz University in Kusadasi, Turkey, with a paper and a presentation.

My topic was “An empirical comparison of fast and efficient tools for mining textual data“. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans.

The abstract of the paper is below:

In order to effectively manage and retrieve the information comprised in vast amount of text documents, powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.

Keywords: text mining, document clustering, spherical k-means, bisecting k-means

About Cluto

Written in ANSI C by George Karypis, CLUTO (CLUstering TOolkit) is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

Cluto contains partitional, agglomerative, and graph-partitioning based clustering algorithms. Bisecting k-means is the default option from the partitional class of algorithms, which is under consideration of the paper. In addition, Cluto offers multiple distance (similarity) functions like cosine, euclidean, correlation coefficient, extended Jaccard, where cosine is the default option. Cluto also has an option to select one of several clustering criterion functions from four categories: internal, external, hybrid, and graph-based.

About Gmeans

Gmeans is a C++ program for clustering, developed by Yuqiang Guan as part of his PhD thesis. The program employs four different k-means type clustering algorithms with four different distance (similarity) measures: cosine, euclidean, diametric distance, and Kullback-Leibler divergence, where cosine is the default similarity measure applied for spherical k-means, with each document vector to be (L2) normalised. Moreover, a local search strategy to overcome the local optima problem, called first variation, is also included. The program generates one-way, hard-clustering of a given dataset.

Download the Paper

If you are interested in the details of this comparison such as datasets used and experiments performed, you can freely download the paper here (.pdf inside .rar archive 281K).

View the Presentation

KNIME – Open Source Data Mining Software

comments Comments Off
By Volkan TUNALI, July 13, 2010 1:50 pm

KNIMEKNIME is my favorite visual data mining tool with easy-to-use and intuitive data flow user interface and powerful data mining elements. If you are looking for a well-designed and effective open source data mining software, KNIME is a perfect-fit.

Developed by the Chair for Bioinformatics and Information Mining at the University of Konstanz, Germany, KNIME is a modular visual data exploration platform that offers hundreds of processing elements for data I/O, preprocessing and cleansing, modeling, analysis, data mining, and viewing.

Based on the Eclipse platform, KNIME is built with Java, so it works on Windows, Linux and MacOSX environments.

You can find more information about KNIME here, and you can download it here.

The following are two sample screenshots from KNIME.

KNIME Screenshot

KNIME Screenshot

Panorama Theme by Themocracy