<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for Data Mining Research</title>
	<atom:link href="http://www.dataminingresearch.com/index.php/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dataminingresearch.com</link>
	<description>Data Mining Research, Algorithms, Tools, News, More</description>
	<lastBuildDate>Thu, 06 Oct 2011 20:24:21 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Volkan TUNALI</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-321</link>
		<dc:creator>Volkan TUNALI</dc:creator>
		<pubDate>Thu, 06 Oct 2011 20:24:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-321</guid>
		<description>Amara, 

As far as I know, there is no query set for these datasets. Given matrix files are only for document collections, so they do not include any query-spesific data.

I use the software tools that I develop for preprocessing the original text document collections. Currently I cannot share the software. I&#039;m planning to share the software after I complete my PhD thesis, possibly within 6 months or so. I&#039;m sorry. I think you can find several text preprocessing tools on the web. Simply, the software counts the each word in each document; after getting the word frequencies, those frequencies are weighted according to TFIDF scheme. That&#039;s all. I hope this information will help you. Good luck.</description>
		<content:encoded><![CDATA[<p>Amara, </p>
<p>As far as I know, there is no query set for these datasets. Given matrix files are only for document collections, so they do not include any query-spesific data.</p>
<p>I use the software tools that I develop for preprocessing the original text document collections. Currently I cannot share the software. I&#8217;m planning to share the software after I complete my PhD thesis, possibly within 6 months or so. I&#8217;m sorry. I think you can find several text preprocessing tools on the web. Simply, the software counts the each word in each document; after getting the word frequencies, those frequencies are weighted according to TFIDF scheme. That&#8217;s all. I hope this information will help you. Good luck.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Amara</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-320</link>
		<dc:creator>Amara</dc:creator>
		<pubDate>Thu, 06 Oct 2011 15:33:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-320</guid>
		<description>As i understand, these datasets are IR datasets. They have their own query sets aswell. (then you can match or rank documents for each query. similar to document retrieval). Are the query sets included in your preprocessed collection? if noet , then can you please share the software/technique you have used to pre-process documents so that query set can be processed the same way!</description>
		<content:encoded><![CDATA[<p>As i understand, these datasets are IR datasets. They have their own query sets aswell. (then you can match or rank documents for each query. similar to document retrieval). Are the query sets included in your preprocessed collection? if noet , then can you please share the software/technique you have used to pre-process documents so that query set can be processed the same way!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Volkan TUNALI</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-313</link>
		<dc:creator>Volkan TUNALI</dc:creator>
		<pubDate>Thu, 04 Aug 2011 16:49:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-313</guid>
		<description>Mehrdad, 

I usually cite Classic datasets like below: 

Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: ftp://ftp.cs.cornell.edu/pub/smart

If you need to cite me due to the preprocessed dataset files, I think you may cite similarly like:

Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets


Thanks.</description>
		<content:encoded><![CDATA[<p>Mehrdad, </p>
<p>I usually cite Classic datasets like below: </p>
<p>Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: <a href="ftp://ftp.cs.cornell.edu/pub/smart" rel="nofollow">ftp://ftp.cs.cornell.edu/pub/smart</a></p>
<p>If you need to cite me due to the preprocessed dataset files, I think you may cite similarly like:</p>
<p>Classic3 dataset. Retrieved November 29, 2009 from World Wide Web: <a href="http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets" rel="nofollow">http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets</a></p>
<p>Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by mehrdad</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-312</link>
		<dc:creator>mehrdad</dc:creator>
		<pubDate>Thu, 04 Aug 2011 14:58:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-312</guid>
		<description>Hi,
I&#039;m using the processed Classic dataset in my paper. I&#039;d like to konw how can I site you?
Regards,
Mehrdad</description>
		<content:encoded><![CDATA[<p>Hi,<br />
I&#8217;m using the processed Classic dataset in my paper. I&#8217;d like to konw how can I site you?<br />
Regards,<br />
Mehrdad</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by newbi</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-259</link>
		<dc:creator>newbi</dc:creator>
		<pubDate>Fri, 13 May 2011 03:18:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-259</guid>
		<description>Sorry, i explained it not clearly last time.
Some papers explain that for their experiment, from each class they prepared a kind of sub-classes. For example take one: the CACM, they have this statistic (only example): 
Num of Doc  : 500
Num of class : 4
Min class size : 80
Max class size : 120
Avg Doc Length : 100
Etc.
Could you please explain where those statistics came from? Are they provided in Classic4 or decided by ourselves.

ThankYou.</description>
		<content:encoded><![CDATA[<p>Sorry, i explained it not clearly last time.<br />
Some papers explain that for their experiment, from each class they prepared a kind of sub-classes. For example take one: the CACM, they have this statistic (only example):<br />
Num of Doc  : 500<br />
Num of class : 4<br />
Min class size : 80<br />
Max class size : 120<br />
Avg Doc Length : 100<br />
Etc.<br />
Could you please explain where those statistics came from? Are they provided in Classic4 or decided by ourselves.</p>
<p>ThankYou.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Volkan TUNALI</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-258</link>
		<dc:creator>Volkan TUNALI</dc:creator>
		<pubDate>Mon, 09 May 2011 08:54:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-258</guid>
		<description>For the benchmark datasets, we already know the class distribution of the dataset. For example, Classic4 dataset is comprised of documents from 4 classes as:

CACM: 3204 documents
CISI: 1460 documents
CRAN: 1398 documents
MED: 1033 documents

Is your question something different?</description>
		<content:encoded><![CDATA[<p>For the benchmark datasets, we already know the class distribution of the dataset. For example, Classic4 dataset is comprised of documents from 4 classes as:</p>
<p>CACM: 3204 documents<br />
CISI: 1460 documents<br />
CRAN: 1398 documents<br />
MED: 1033 documents</p>
<p>Is your question something different?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by newbi</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-257</link>
		<dc:creator>newbi</dc:creator>
		<pubDate>Mon, 09 May 2011 08:47:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-257</guid>
		<description>Hi Volkan.
In some papers using classic4 as dataset, they describe about the class inside dataset. For example in Med and Cisi, there are 4 classes. And the papers also shown the document size of each classes. My question in how to determine the class and the others statistics of a dataset. Because I didnt see any information about the them inside Classic4 datasets.
Thank You.</description>
		<content:encoded><![CDATA[<p>Hi Volkan.<br />
In some papers using classic4 as dataset, they describe about the class inside dataset. For example in Med and Cisi, there are 4 classes. And the papers also shown the document size of each classes. My question in how to determine the class and the others statistics of a dataset. Because I didnt see any information about the them inside Classic4 datasets.<br />
Thank You.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Volkan TUNALI</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-248</link>
		<dc:creator>Volkan TUNALI</dc:creator>
		<pubDate>Sat, 02 Apr 2011 21:08:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-248</guid>
		<description>I think you should see my post on &quot;awk&quot; where I give the awk code to split Classic collection into single documents: http://www.dataminingresearch.com/index.php/2010/10/absolute-beginners-first-awk-program/

If it is not what you want, please let me know and please be specific.</description>
		<content:encoded><![CDATA[<p>I think you should see my post on &#8220;awk&#8221; where I give the awk code to split Classic collection into single documents: <a href="http://www.dataminingresearch.com/index.php/2010/10/absolute-beginners-first-awk-program/" rel="nofollow">http://www.dataminingresearch.com/index.php/2010/10/absolute-beginners-first-awk-program/</a></p>
<p>If it is not what you want, please let me know and please be specific.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Anonymouse</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-247</link>
		<dc:creator>Anonymouse</dc:creator>
		<pubDate>Sat, 02 Apr 2011 20:47:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-247</guid>
		<description>Hello Volkan, is there any chance you could share the method used to split the files (source code)?</description>
		<content:encoded><![CDATA[<p>Hello Volkan, is there any chance you could share the method used to split the files (source code)?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Classic3 and Classic4 DataSets by Volkan TUNALI</title>
		<link>http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/#comment-246</link>
		<dc:creator>Volkan TUNALI</dc:creator>
		<pubDate>Fri, 01 Apr 2011 07:38:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.dataminingresearch.com/?p=184#comment-246</guid>
		<description>Cedric,

This is the whole dataset, there is no separate data for training and testing. 

Figures on the first line means:
* There are 7095 documents,
* 5896 terms,
* and totally 247158 non-zero values within the document-term matrix.

The followin lines are the actual non-zero values, with the format of  [Document No] [Term No] [Weight].

I hope my explanation make it clear for you. This is a common format, that&#039;s why I have not seen a need for explanation.</description>
		<content:encoded><![CDATA[<p>Cedric,</p>
<p>This is the whole dataset, there is no separate data for training and testing. </p>
<p>Figures on the first line means:<br />
* There are 7095 documents,<br />
* 5896 terms,<br />
* and totally 247158 non-zero values within the document-term matrix.</p>
<p>The followin lines are the actual non-zero values, with the format of  [Document No] [Term No] [Weight].</p>
<p>I hope my explanation make it clear for you. This is a common format, that&#8217;s why I have not seen a need for explanation.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

