Category: DataSet

Using “awk” to Join Text Files on Windows

comments Comments Off
By Volkan TUNALI, November 4, 2010 11:34 pm

Last time I used awk to split single cisi.all file into small files like cisi.1, cisi.2 etc. Now, I have needed to join these small files into a single one in a kind of XML format. I have read some tutorials on awk but I am unable to find such a thing as looping over many text files. So, I have used another solution with Windows BATCH file scripting. I have written a little awk program to format and output the content of a given file to some output file. Then, in a batch file, I loop over the files in a directory and for each file, I run the awk program.

Here’s the batch file named JOIN.BAT:

del output.xml
for /r %%X in (dataset\*.*) do (awk -f join.awk %%X)

Here’s the awk file named JOIN.AWK:

BEGIN { print "<DOC>\n<BODY>" >>"output.xml"}
{print $0 >>"output.xml"}
END { print "</BODY>\n</DOC>\n">>"output.xml"}

As you see in the awk program, the content of each file is appended to the file output.xml. On Unix-like systems, you can write similar shell scripts instead of batch file.

Using “awk” to Extract Title and Body Text from “cisi.all” File

comments Comments Off
By Volkan TUNALI, October 10, 2010 11:55 pm

In my latest post I wrote an awk program to extract title part of the documents that reside in cisi.all file from the cisi document collection. In this new post, I extend that program to include the body part of the documents with a few additional lines.

Here’s the code:

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 textStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { textStarted = 0 }  # when a new field separator encountered stop printing until next .T or .W
$0 == ".W" { textStarted = 1  # body part starts with .W
			 printThisLine = 0
			}
	{ if (textStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			printThisLine = 1 #consequent lines after .T or .W will be printed
		}
	}

Similar awk code should be sufficient to extract title and body part from CRAN, MED and CACM collections.

Absolute Beginner’s First “awk” Program

comments Comments Off
By Volkan TUNALI, October 1, 2010 12:09 am

Until recently I did not know anything about such an excellent text processing tool (a programming language actually) as awk. When I saw a few tweets about awk, I just downloaded a version that can be run under Windows, and started learning by trying.

Best way to learn a new programming language is to do something real and really useful with that language. So, I have wanted to see if I can process the well known Classic3 dataset. My goal is to split the documents of cisi collection that reside inside a single file named cisi.all into separate files named like cisi.1, cisi.2, etc.

I have just started with only the title part of the documents. Later I will continue with the body part. The following is my very first awk program code to do the job. It works fine.

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 titleStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { titleStarted = 0 } # when new field separator encountered stop printing until next doc
	{ if (titleStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			if (printThisLine == 0) #next lines after .T will be printed
				printThisLine = 1
		}
	}

Put this code into a text file named for example cisi.awk, and then run the following command line to split the cisi collection residing in the file cisi.all:

awk -f cisi.awk cisi.all

There can be much better and simpler ways to do the same processing with advanced awk features. As I learn more, I will modify the code above and put here again. So far so good. I needed to write a much more complicated C++ program to split the Classic3 documents into separate files. Now I can do it with a few lines of awk code.

I will improve the code above to include the body part of the documents and put here.

You can find more information about the Classic document collection in one of my recent posts “Classic3 and Classic4 DataSets“.

awk is now among my favorite text processing tools.

Classic3 and Classic4 DataSets

By Volkan TUNALI, September 10, 2010 4:53 am

One well known benchmark dataset used in text mining is the Classic collection that can be obtained from ftp://ftp.cs.cornell.edu/pub/smart/. This dataset consists of 4 different document collections: CACM, CISI, CRAN, and MED. These collections can be downloaded as one file per collection. In order to get the individual documents, a processing is needed to extract the title and document body. I have separated all these documents as individual documents and made available for public download. You can freely download the whole collection (1.5MB RAR file).

The composition of the collection is as follows:

  • CACM: 3204 documents
  • CISI: 1460 documents
  • CRAN: 1398 documents
  • MED: 1033 documents

This dataset is usually referred to as Classic3 dataset (CISI, CRAN and MED only), and sometimes referred to as Classic4 dataset.

As a further step, I have preprocessed the whole dataset and obtained the document-term matrix in various forms. You can download the matrices and related files here (7.4MB RAR file). The list of the files are explained below:

  • docbyterm.mat: Term frequencies only (in Cluto’s MAT file format)
  • docbyterm.tfidf.mat: Weighted with TFIDF scheme (in Cluto’s MAT file format)
  • docbyterm.tfidf.norm.mat: Weighted with TFIDF scheme and normalized to 1 (in Cluto’s MAT file format)
  • docbyterm.txt: Term frequencies only (in Coordinate file format)
  • docbyterm.tfidf.txt: Weighted with TFIDF scheme (in Coordinate file format)
  • docbyterm.tfidf.norm.txt: Weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
  • documents.txt: List of the document names as they appear in the data matrix
  • terms.txt: List of terms that appear in the data matrix
  • terms_detailed.txt: A detailed list of terms (ie. term id, term, # of documents the term appears)

As you see, preprocessing results are in two simple and well-known text file formats: Coordinate file and Cluto’s MAT file. In addition, term frequency, TFIDF, and normalized TFIDF weighting schemes are available. Terms are single words; that is, there are no n-grams. Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95% of the documents. Moreover, Porter’s stemming is applied while preprocessing.

If you find these documents and data matrix files useful, please let me know with your comments. Any questions or criticism is also welcome.

Links to Several Datasets

comments Comments Off
By Volkan TUNALI, August 2, 2010 7:18 pm

Here you can find several datasets for use in data mining research. I’ll add more datasets as I encounter any. If you find any link broken, please let me know.

Text Datasets

Frequent Itemset Mining Datasets

UC Irvine Machine Learning Repository

Panorama Theme by Themocracy