KDD2011: 17th ACM SIGKDD Conference on KDD

comments Comments Off
By Volkan TUNALI, October 18, 2010 12:41 am

KDD2011The annual ACM SIGKDD conference is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2011 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

KDD-2011 will run between from August 21-24 in San Diego, CA and will feature hundreds of practitioners and academic data miners converging on the one location.

Important Dates

  • Aug 21-24, 2011 KDD-2011 Conference
  • May 13, 2011 Paper acceptance
  • Feb 18, 2011 Full Paper deadline
  • Feb 11, 2011 Paper abstract deadline

* All deadlines are for 11:59 PM Pacific time.

For more information, you can visit the conference home page.

Using “awk” to Extract Title and Body Text from “cisi.all” File

comments Comments Off
By Volkan TUNALI, October 10, 2010 11:55 pm

In my latest post I wrote an awk program to extract title part of the documents that reside in cisi.all file from the cisi document collection. In this new post, I extend that program to include the body part of the documents with a few additional lines.

Here’s the code:

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 textStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { textStarted = 0 }  # when a new field separator encountered stop printing until next .T or .W
$0 == ".W" { textStarted = 1  # body part starts with .W
			 printThisLine = 0
			}
	{ if (textStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			printThisLine = 1 #consequent lines after .T or .W will be printed
		}
	}

Similar awk code should be sufficient to extract title and body part from CRAN, MED and CACM collections.

Absolute Beginner’s First “awk” Program

comments Comments Off
By Volkan TUNALI, October 1, 2010 12:09 am

Until recently I did not know anything about such an excellent text processing tool (a programming language actually) as awk. When I saw a few tweets about awk, I just downloaded a version that can be run under Windows, and started learning by trying.

Best way to learn a new programming language is to do something real and really useful with that language. So, I have wanted to see if I can process the well known Classic3 dataset. My goal is to split the documents of cisi collection that reside inside a single file named cisi.all into separate files named like cisi.1, cisi.2, etc.

I have just started with only the title part of the documents. Later I will continue with the body part. The following is my very first awk program code to do the job. It works fine.

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 titleStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { titleStarted = 0 } # when new field separator encountered stop printing until next doc
	{ if (titleStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			if (printThisLine == 0) #next lines after .T will be printed
				printThisLine = 1
		}
	}

Put this code into a text file named for example cisi.awk, and then run the following command line to split the cisi collection residing in the file cisi.all:

awk -f cisi.awk cisi.all

There can be much better and simpler ways to do the same processing with advanced awk features. As I learn more, I will modify the code above and put here again. So far so good. I needed to write a much more complicated C++ program to split the Classic3 documents into separate files. Now I can do it with a few lines of awk code.

I will improve the code above to include the body part of the documents and put here.

You can find more information about the Classic document collection in one of my recent posts “Classic3 and Classic4 DataSets“.

awk is now among my favorite text processing tools.

ICDM’10: The 10th IEEE International Conference on Data Mining

comments Comments Off
By Volkan TUNALI, September 20, 2010 5:42 pm

Sponsored by the IEEE Computer Society
December 14-17, 2010, Sydney, Australia
http://datamining.it.uts.edu.au/icdm10

Important Dates
*****************
Apr 14, 2010: Deadline for workshop proposals
May 06, 2010: Deadline for ICDM contest proposals
Jul 02, 2010: Deadline for full paper submissions
Jul 13, 2010: Deadline for demo and tutorial proposals
Jul 23, 2010: Deadline for 18 Workshop paper submissions (extended to Aug 9th)
Sep 17, 2010: Notification of acceptance of full papers
Sep 20, 2010: Notification of acceptance of workshop papers
Oct 11, 2010: Camera-ready copies and copyright forms
(11:59pm Hawaii time)

The IEEE International Conference on Data Mining (ICDM) has established itself as the world’s premier research conference in data mining. The 10th edition of ICDM (ICDM ’10) provides a leading forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications. In addition, ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference will feature invited talks from Christos Faloutsos (CMU), Geoff McLachlan (UQ) and Xindong Wu (UVM), workshops, tutorials, panels, and the ICDM data mining contest.

Classic3 and Classic4 DataSets

By Volkan TUNALI, September 10, 2010 4:53 am

One well known benchmark dataset used in text mining is the Classic collection that can be obtained from ftp://ftp.cs.cornell.edu/pub/smart/. This dataset consists of 4 different document collections: CACM, CISI, CRAN, and MED. These collections can be downloaded as one file per collection. In order to get the individual documents, a processing is needed to extract the title and document body. I have separated all these documents as individual documents and made available for public download. You can freely download the whole collection (1.5MB RAR file).

The composition of the collection is as follows:

  • CACM: 3204 documents
  • CISI: 1460 documents
  • CRAN: 1398 documents
  • MED: 1033 documents

This dataset is usually referred to as Classic3 dataset (CISI, CRAN and MED only), and sometimes referred to as Classic4 dataset.

As a further step, I have preprocessed the whole dataset and obtained the document-term matrix in various forms. You can download the matrices and related files here (7.4MB RAR file). The list of the files are explained below:

  • docbyterm.mat: Term frequencies only (in Cluto’s MAT file format)
  • docbyterm.tfidf.mat: Weighted with TFIDF scheme (in Cluto’s MAT file format)
  • docbyterm.tfidf.norm.mat: Weighted with TFIDF scheme and normalized to 1 (in Cluto’s MAT file format)
  • docbyterm.txt: Term frequencies only (in Coordinate file format)
  • docbyterm.tfidf.txt: Weighted with TFIDF scheme (in Coordinate file format)
  • docbyterm.tfidf.norm.txt: Weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
  • documents.txt: List of the document names as they appear in the data matrix
  • terms.txt: List of terms that appear in the data matrix
  • terms_detailed.txt: A detailed list of terms (ie. term id, term, # of documents the term appears)

As you see, preprocessing results are in two simple and well-known text file formats: Coordinate file and Cluto’s MAT file. In addition, term frequency, TFIDF, and normalized TFIDF weighting schemes are available. Terms are single words; that is, there are no n-grams. Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95% of the documents. Moreover, Porter’s stemming is applied while preprocessing.

If you find these documents and data matrix files useful, please let me know with your comments. Any questions or criticism is also welcome.

Panorama Theme by Themocracy