Absolute Beginner’s First “awk” Program

By Volkan TUNALI, October 1, 2010 12:09 am

Until recently I did not know anything about such an excellent text processing tool (a programming language actually) as awk. When I saw a few tweets about awk, I just downloaded a version that can be run under Windows, and started learning by trying.

Best way to learn a new programming language is to do something real and really useful with that language. So, I have wanted to see if I can process the well known Classic3 dataset. My goal is to split the documents of cisi collection that reside inside a single file named cisi.all into separate files named like cisi.1, cisi.2, etc.

I have just started with only the title part of the documents. Later I will continue with the body part. The following is my very first awk program code to do the job. It works fine.

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 titleStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
$0 == ".A" || $0 == ".X" || $0 == ".I" { titleStarted = 0 } # when new field separator encountered stop printing until next doc
	{ if (titleStarted == 1)
			if (printThisLine)
				print $0 > docName

			if (printThisLine == 0) #next lines after .T will be printed
				printThisLine = 1

Put this code into a text file named for example cisi.awk, and then run the following command line to split the cisi collection residing in the file cisi.all:

awk -f cisi.awk cisi.all

There can be much better and simpler ways to do the same processing with advanced awk features. As I learn more, I will modify the code above and put here again. So far so good. I needed to write a much more complicated C++ program to split the Classic3 documents into separate files. Now I can do it with a few lines of awk code.

I will improve the code above to include the body part of the documents and put here.

You can find more information about the Classic document collection in one of my recent posts “Classic3 and Classic4 DataSets“.

awk is now among my favorite text processing tools.

Comments are closed

Panorama Theme by Themocracy