Using “awk” to Extract Title and Body Text from “cisi.all” File

By Volkan TUNALI, October 10, 2010 11:55 pm

In my latest post I wrote an awk program to extract title part of the documents that reside in cisi.all file from the cisi document collection. In this new post, I extend that program to include the body part of the documents with a few additional lines.

Here’s the code:

BEGIN { docNo = 0 }
$0 == ".T" { docNo++  #starting a new doc when we encounter a .T line
			 textStarted = 1
			 printThisLine = 0
			 print "Processing Doc #" docNo
			 docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
			}
$0 == ".A" || $0 == ".X" || $0 == ".I" { textStarted = 0 }  # when a new field separator encountered stop printing until next .T or .W
$0 == ".W" { textStarted = 1  # body part starts with .W
			 printThisLine = 0
			}
	{ if (textStarted == 1)
		{
			if (printThisLine)
				print $0 > docName

			printThisLine = 1 #consequent lines after .T or .W will be printed
		}
	}

Similar awk code should be sufficient to extract title and body part from CRAN, MED and CACM collections.

Comments are closed

Panorama Theme by Themocracy