KDD2011: 17th ACM SIGKDD Conference on KDD

comments Comments Off
By Volkan TUNALI, October 18, 2010 12:41 am

KDD2011The annual ACM SIGKDD conference is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2011 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

KDD-2011 will run between from August 21-24 in San Diego, CA and will feature hundreds of practitioners and academic data miners converging on the one location.

Important Dates

  • Aug 21-24, 2011 KDD-2011 Conference
  • May 13, 2011 Paper acceptance
  • Feb 18, 2011 Full Paper deadline
  • Feb 11, 2011 Paper abstract deadline

* All deadlines are for 11:59 PM Pacific time.

For more information, you can visit the conference home page.

Using “awk” to Extract Title and Body Text from “cisi.all” File

comments Comments Off
By Volkan TUNALI, October 10, 2010 11:55 pm

In my latest post I wrote an awk program to extract title part of the documents that reside in cisi.all file from the cisi document collection. In this new post, I extend that program to include the body part of the documents with a few additional lines.

Here’s the code:

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 textStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { textStarted = 0 }  # when a new field separator encountered stop printing until next .T or .W
$0 == ".W" { textStarted = 1  # body part starts with .W
			 printThisLine = 0
			}
	{ if (textStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			printThisLine = 1 #consequent lines after .T or .W will be printed
		}
	}

Similar awk code should be sufficient to extract title and body part from CRAN, MED and CACM collections.

Absolute Beginner’s First “awk” Program

comments Comments Off
By Volkan TUNALI, October 1, 2010 12:09 am

Until recently I did not know anything about such an excellent text processing tool (a programming language actually) as awk. When I saw a few tweets about awk, I just downloaded a version that can be run under Windows, and started learning by trying.

Best way to learn a new programming language is to do something real and really useful with that language. So, I have wanted to see if I can process the well known Classic3 dataset. My goal is to split the documents of cisi collection that reside inside a single file named cisi.all into separate files named like cisi.1, cisi.2, etc.

I have just started with only the title part of the documents. Later I will continue with the body part. The following is my very first awk program code to do the job. It works fine.

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 titleStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { titleStarted = 0 } # when new field separator encountered stop printing until next doc
	{ if (titleStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			if (printThisLine == 0) #next lines after .T will be printed
				printThisLine = 1
		}
	}

Put this code into a text file named for example cisi.awk, and then run the following command line to split the cisi collection residing in the file cisi.all:

awk -f cisi.awk cisi.all

There can be much better and simpler ways to do the same processing with advanced awk features. As I learn more, I will modify the code above and put here again. So far so good. I needed to write a much more complicated C++ program to split the Classic3 documents into separate files. Now I can do it with a few lines of awk code.

I will improve the code above to include the body part of the documents and put here.

You can find more information about the Classic document collection in one of my recent posts “Classic3 and Classic4 DataSets“.

awk is now among my favorite text processing tools.

Panorama Theme by Themocracy