Using “awk” to Extract Title and Body Text from “cisi.all” File
In my latest post I wrote an awk program to extract title part of the documents that reside in cisi.all file from the cisi document collection. In this new post, I extend that program to include the body part of the documents with a few additional lines.
Here’s the code:
BEGIN { docNo = 0 }
$0 == ".T" { docNo++ #starting a new doc when we encounter a .T line
textStarted = 1
printThisLine = 0
print "Processing Doc #" docNo
docName = sprintf("%s%d", "cisi.", docNo) # doc name is like cisi.1 cisi.2 etc.
}
$0 == ".A" || $0 == ".X" || $0 == ".I" { textStarted = 0 } # when a new field separator encountered stop printing until next .T or .W
$0 == ".W" { textStarted = 1 # body part starts with .W
printThisLine = 0
}
{ if (textStarted == 1)
{
if (printThisLine)
print $0 > docName
printThisLine = 1 #consequent lines after .T or .W will be printed
}
}
Similar awk code should be sufficient to extract title and body part from CRAN, MED and CACM collections.
Fundamentals of Predictive Text Mining is the new book I’ve found recently about text mining. This book explains the essentials of text mining very very well with very good examples, so I strongly recommend it to the newcomers to the field. Although the goal of the book is predictive text mining, its content is sufficiently broad to cover such topics as text clustering, information retrieval, and information extraction.