Classic3 and Classic4 DataSets

By Volkan TUNALI, September 10, 2010 4:53 am

One well known benchmark dataset used in text mining is the Classic collection that can be obtained from ftp://ftp.cs.cornell.edu/pub/smart/. This dataset consists of 4 different document collections: CACM, CISI, CRAN, and MED. These collections can be downloaded as one file per collection. In order to get the individual documents, a processing is needed to extract the title and document body. I have separated all these documents as individual documents and made available for public download. You can freely download the whole collection (1.5MB RAR file).

The composition of the collection is as follows:

  • CACM: 3204 documents
  • CISI: 1460 documents
  • CRAN: 1398 documents
  • MED: 1033 documents

This dataset is usually referred to as Classic3 dataset (CISI, CRAN and MED only), and sometimes referred to as Classic4 dataset.

As a further step, I have preprocessed the whole dataset and obtained the document-term matrix in various forms. You can download the matrices and related files here (7.4MB RAR file). The list of the files are explained below:

  • docbyterm.mat: Term frequencies only (in Cluto’s MAT file format)
  • docbyterm.tfidf.mat: Weighted with TFIDF scheme (in Cluto’s MAT file format)
  • docbyterm.tfidf.norm.mat: Weighted with TFIDF scheme and normalized to 1 (in Cluto’s MAT file format)
  • docbyterm.txt: Term frequencies only (in Coordinate file format)
  • docbyterm.tfidf.txt: Weighted with TFIDF scheme (in Coordinate file format)
  • docbyterm.tfidf.norm.txt: Weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
  • documents.txt: List of the document names as they appear in the data matrix
  • terms.txt: List of terms that appear in the data matrix
  • terms_detailed.txt: A detailed list of terms (ie. term id, term, # of documents the term appears)

As you see, preprocessing results are in two simple and well-known text file formats: Coordinate file and Cluto’s MAT file. In addition, term frequency, TFIDF, and normalized TFIDF weighting schemes are available. Terms are single words; that is, there are no n-grams. Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95% of the documents. Moreover, Porter’s stemming is applied while preprocessing.

If you find these documents and data matrix files useful, please let me know with your comments. Any questions or criticism is also welcome.

11 Responses to “Classic3 and Classic4 DataSets”

  1. Cedric says:

    i hv downloaded classicdocspreprocessed dataset for TFIDF, but how to used it? which is train n test data? the data in docbyterm.tfidf.norm.txt,

    7095 5896 247158
    1 2 0.3296250911410358
    1 5 0.33140756275194266
    .
    .
    What does this figure represent?

  2. Cedric,

    This is the whole dataset, there is no separate data for training and testing.

    Figures on the first line means:
    * There are 7095 documents,
    * 5896 terms,
    * and totally 247158 non-zero values within the document-term matrix.

    The followin lines are the actual non-zero values, with the format of [Document No] [Term No] [Weight].

    I hope my explanation make it clear for you. This is a common format, that’s why I have not seen a need for explanation.

  3. Anonymouse says:

    Hello Volkan, is there any chance you could share the method used to split the files (source code)?

  4. I think you should see my post on “awk” where I give the awk code to split Classic collection into single documents: http://www.dataminingresearch.com/index.php/2010/10/absolute-beginners-first-awk-program/

    If it is not what you want, please let me know and please be specific.

  5. newbi says:

    Hi Volkan.
    In some papers using classic4 as dataset, they describe about the class inside dataset. For example in Med and Cisi, there are 4 classes. And the papers also shown the document size of each classes. My question in how to determine the class and the others statistics of a dataset. Because I didnt see any information about the them inside Classic4 datasets.
    Thank You.

  6. For the benchmark datasets, we already know the class distribution of the dataset. For example, Classic4 dataset is comprised of documents from 4 classes as:

    CACM: 3204 documents
    CISI: 1460 documents
    CRAN: 1398 documents
    MED: 1033 documents

    Is your question something different?

  7. newbi says:

    Sorry, i explained it not clearly last time.
    Some papers explain that for their experiment, from each class they prepared a kind of sub-classes. For example take one: the CACM, they have this statistic (only example):
    Num of Doc : 500
    Num of class : 4
    Min class size : 80
    Max class size : 120
    Avg Doc Length : 100
    Etc.
    Could you please explain where those statistics came from? Are they provided in Classic4 or decided by ourselves.

    ThankYou.

  8. mehrdad says:

    Hi,
    I’m using the processed Classic dataset in my paper. I’d like to konw how can I site you?
    Regards,
    Mehrdad

  9. Mehrdad,

    I usually cite Classic datasets like below:

    Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: ftp://ftp.cs.cornell.edu/pub/smart

    If you need to cite me due to the preprocessed dataset files, I think you may cite similarly like:

    Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets

    Thanks.

  10. Amara says:

    As i understand, these datasets are IR datasets. They have their own query sets aswell. (then you can match or rank documents for each query. similar to document retrieval). Are the query sets included in your preprocessed collection? if noet , then can you please share the software/technique you have used to pre-process documents so that query set can be processed the same way!

  11. Amara,

    As far as I know, there is no query set for these datasets. Given matrix files are only for document collections, so they do not include any query-spesific data.

    I use the software tools that I develop for preprocessing the original text document collections. Currently I cannot share the software. I’m planning to share the software after I complete my PhD thesis, possibly within 6 months or so. I’m sorry. I think you can find several text preprocessing tools on the web. Simply, the software counts the each word in each document; after getting the word frequencies, those frequencies are weighted according to TFIDF scheme. That’s all. I hope this information will help you. Good luck.

Panorama Theme by Themocracy