ICDM’10: The 10th IEEE International Conference on Data Mining

comments Comments Off
By Volkan TUNALI, September 20, 2010 5:42 pm

Sponsored by the IEEE Computer Society
December 14-17, 2010, Sydney, Australia
http://datamining.it.uts.edu.au/icdm10

Important Dates
*****************
Apr 14, 2010: Deadline for workshop proposals
May 06, 2010: Deadline for ICDM contest proposals
Jul 02, 2010: Deadline for full paper submissions
Jul 13, 2010: Deadline for demo and tutorial proposals
Jul 23, 2010: Deadline for 18 Workshop paper submissions (extended to Aug 9th)
Sep 17, 2010: Notification of acceptance of full papers
Sep 20, 2010: Notification of acceptance of workshop papers
Oct 11, 2010: Camera-ready copies and copyright forms
(11:59pm Hawaii time)

The IEEE International Conference on Data Mining (ICDM) has established itself as the world’s premier research conference in data mining. The 10th edition of ICDM (ICDM ’10) provides a leading forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications. In addition, ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference will feature invited talks from Christos Faloutsos (CMU), Geoff McLachlan (UQ) and Xindong Wu (UVM), workshops, tutorials, panels, and the ICDM data mining contest.

Classic3 and Classic4 DataSets

By Volkan TUNALI, September 10, 2010 4:53 am

One well known benchmark dataset used in text mining is the Classic collection that can be obtained from ftp://ftp.cs.cornell.edu/pub/smart/. This dataset consists of 4 different document collections: CACM, CISI, CRAN, and MED. These collections can be downloaded as one file per collection. In order to get the individual documents, a processing is needed to extract the title and document body. I have separated all these documents as individual documents and made available for public download. You can freely download the whole collection (1.5MB RAR file).

The composition of the collection is as follows:

  • CACM: 3204 documents
  • CISI: 1460 documents
  • CRAN: 1398 documents
  • MED: 1033 documents

This dataset is usually referred to as Classic3 dataset (CISI, CRAN and MED only), and sometimes referred to as Classic4 dataset.

As a further step, I have preprocessed the whole dataset and obtained the document-term matrix in various forms. You can download the matrices and related files here (7.4MB RAR file). The list of the files are explained below:

  • docbyterm.mat: Term frequencies only (in Cluto’s MAT file format)
  • docbyterm.tfidf.mat: Weighted with TFIDF scheme (in Cluto’s MAT file format)
  • docbyterm.tfidf.norm.mat: Weighted with TFIDF scheme and normalized to 1 (in Cluto’s MAT file format)
  • docbyterm.txt: Term frequencies only (in Coordinate file format)
  • docbyterm.tfidf.txt: Weighted with TFIDF scheme (in Coordinate file format)
  • docbyterm.tfidf.norm.txt: Weighted with TFIDF scheme and normalized to 1 (in Coordinate file format)
  • documents.txt: List of the document names as they appear in the data matrix
  • terms.txt: List of terms that appear in the data matrix
  • terms_detailed.txt: A detailed list of terms (ie. term id, term, # of documents the term appears)

As you see, preprocessing results are in two simple and well-known text file formats: Coordinate file and Cluto’s MAT file. In addition, term frequency, TFIDF, and normalized TFIDF weighting schemes are available. Terms are single words; that is, there are no n-grams. Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95% of the documents. Moreover, Porter’s stemming is applied while preprocessing.

If you find these documents and data matrix files useful, please let me know with your comments. Any questions or criticism is also welcome.

Fundamentals of Predictive Text Mining

comments Comments Off
By Volkan TUNALI, September 4, 2010 11:09 pm

Fundamentals of Predictive Text MiningFundamentals of Predictive Text Mining is the new book I’ve found recently about text mining. This book explains the essentials of text mining very very well with very good examples, so I strongly recommend it to the newcomers to the field. Although the goal of the book is predictive text mining, its content is sufficiently broad to cover such topics as text clustering, information retrieval, and information extraction.

The book also contains several case studies that find solutions to several real life problems.

Authors: Sholom M. Weiss, Nitin Indurkyha, Tong Zhang
ISBN: 9781849962254
Publisher: Springer

Introduction to Information Retrieval

comments Comments Off
By Volkan TUNALI, September 1, 2010 11:36 pm

Introduction to Information RetrievalI will introduce a new book I find very useful: Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, from Cambridge University Press (ISBN: 0521865719).

The book provides a modern approach to information retrieval from a computer science perspective. It is based on a course the authors have been teaching in various forms at Stanford University and at the University of Stuttgart.

It contains the following chapters:

  1. Boolean retrieval
  2. The term vocabulary & postings lists
  3. Dictionaries and tolerant retrieval
  4. Index construction
  5. Index compression
  6. Scoring, term weighting & the vector space model
  7. Computing scores in a complete search system
  8. Evaluation in information retrieval
  9. Relevance feedback & query expansion
  10. XML retrieval
  11. Probabilistic information retrieval
  12. Language models for information retrieval
  13. Text classification & Naive Bayes
  14. Vector space classification
  15. Support vector machines & machine learning on documents
  16. Flat clustering
  17. Hierarchical clustering
  18. Matrix decompositions & latent semantic indexing
  19. Web search basics
  20. Web crawling and indexes
  21. Link analysis

The book is freely available for download in PDF format at http://nlp.stanford.edu/IR-book/information-retrieval-book.html (as a whole or as individual chapters).

Panorama Theme by Themocracy