PRETO: A High-performance Text Mining Tool for Preprocessing Turkish Texts

comments Comments Off
By Volkan TUNALI, June 26, 2012 1:59 am

For my text mining research, I often need to preprocess document collections of varying size. Besides texts in English, I also work on texts in Turkish. Therefore, I need special preprocessing options for texts in Turkish.

In order to meet my special preprocessing needs, I have developed a text mining tool for preprocessing texts in Turkish as well as English. I call this tool PRETO. It is now available as an open source project at Google Code under GNU GPL v3 license. You can freely download and use it. Address of the project is

If you use this tool for academic research purposes, please cite it as below:

Volkan Tunalı, Turgay Tugay Bilgin, “PRETO: A High-performance Text Mining Tool for Preprocessing Turkish Texts”, International Conference on Computer Systems and Technologies (CompSysTech), Ruse, Bulgaria, June 22-23, 2012, 134-140.

You can access the paper via ACM Digital Library.

Salford Analytics and Data Mining Conference 2012

comments Comments Off
By Volkan TUNALI, March 30, 2012 12:50 am

The 2012 Salford Analytics & Data Mining Conference is aimed at bringing together researchers, practitioners, and data mining enthusiasts to learn about data mining technology from practical and theoretical experts. Conference attendees can expect to exchange ideas and experiences focused on the practice of both data mining and real-world analysis of complex data. Salford Systems promises an unparalleled opportunity to learn from experienced professionals solving real-world problems.

Conference Location: San Diego, CA
Conference Sessions: May 24-25, 2012
Pre-Conference Training: May 21-23, 2012

For more information please visit

Reactive Business Intelligence

comments Comments Off
By Volkan TUNALI, December 25, 2010 3:14 pm

Reactive Business Intelligence - coverI’ve recently found an interesting data analysis and visualization book: Reactive Business Intelligence: From Data to Models to Insight by Roberto Battiti and Mauro Brunato.

The book explains data analysis concepts in an easy and intuitive way, supported with visual elements. It is freely available for download at

There are also funny pictures that depict the subject. Below are two examples of such pictures I find very nice. I hope the authors don’t mind me putting them here. :)

Figure 7.1 from page 42 – Clustering.

Figure 17.1 from page 167 – Local search.
Update April 6, 2011: A little note from the authors:
Dear colleague:

Our latest book is now printed:

Reactive Business Intelligence. From Data to Models to Insight.
R. Battiti and M. Brunato,
Reactive Search Srl, Italy, February 2011.
ISBN: 978-88-905795-0-9

Full details at the book web site:

Reactive Business Intelligence is about integrating data mining,
modeling and interactive visualization, into an end-to-end
discovery and continuous innovation process powered
by human and automated learning.
This holistic and unifying goal requires collecting and integrating
topics which are usually dissected in books dedicated to
different areas.

We plan to place figures and slides in the same place very soon.

– Roberto Battiti and Mauro Brunato

Guide to Intelligent Data Analysis

By Volkan TUNALI, December 22, 2010 11:03 pm

I want to introduce a new Data Mining book from Springer: Guide to Intelligent Data Analysis. This book provides a hands-on instructional approach to many basic data analysis techniques, and explains how these are used to solve data analysis problems.

Authors: Michael R. Berthold (University of Konstanz, Germany), Christian Borgelt (European Centre for Soft Computing, Spain), Frank Höppner (Ostfalia University of Applied Sciences, Germany), Frank Klawonn (Ostfalia University of Applied Sciences, Germany).
Publisher: Springer
ISBN: 978-1-84882-259-7

In the book, chapters proceed with examples where KNIME and/or R are used as analysis tools. In addition, two chapters of appendices are dedicated to KNIME and R.

This is an excellent book which contains a very good combination of both theory and practice of data analysis. I strongly recommend this book to data mining researchers. For more information you can visit Springer page of the book.

Using “awk” to Join Text Files on Windows

comments Comments Off
By Volkan TUNALI, November 4, 2010 11:34 pm

Last time I used awk to split single cisi.all file into small files like cisi.1, cisi.2 etc. Now, I have needed to join these small files into a single one in a kind of XML format. I have read some tutorials on awk but I am unable to find such a thing as looping over many text files. So, I have used another solution with Windows BATCH file scripting. I have written a little awk program to format and output the content of a given file to some output file. Then, in a batch file, I loop over the files in a directory and for each file, I run the awk program.

Here’s the batch file named JOIN.BAT:

del output.xml
for /r %%X in (dataset\*.*) do (awk -f join.awk %%X)

Here’s the awk file named JOIN.AWK:

BEGIN { print "<DOC>\n<BODY>" >>"output.xml"}
{print $0 >>"output.xml"}
END { print "</BODY>\n</DOC>\n">>"output.xml"}

As you see in the awk program, the content of each file is appended to the file output.xml. On Unix-like systems, you can write similar shell scripts instead of batch file.

Panorama Theme by Themocracy