当前位置: 动力学知识库 > 问答 > 编程问答 >

python - what's the meaning of the categories in the corpus reuters of NLTK

问题描述:

I suffered from problems, when doing text topic classification.

I got the data in NLTK "reuters" corpus..

However when I try "reuters.categories()"

the result is

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']

I almost don't know what each one means, can I find some explanations ?

网友答案:

Information about the Reuters corpus in NLTK corpus API:

  • The Reuters-21578 "ApteMod" corpus is built for text classification.

  • ApteMod is a collection of 10,788 documents from the Reuters financial newswire service

  • In the ApteMod corpus, each document belongs to one or more categories. There are 90 categories in the corpus.

The mapping of the fileids to the categories can be found in ~/nltk_data/corpora/reuters/cats.txt

from os.path import expanduser
from collections import defaultdict
from nltk.corpus import reuters

home = expanduser("~")
id2cat = defaultdict(list)

for line in open(home+'/nltk_data/corpora/reuters/cats.txt','r'):
    fid, _, cats = line.partition(' ')
    id2cat[fid] = cats.split()

for fileid in reuters.fileids():
    for sent in reuters.sents(fileid):
        print id2cat[fileid], sent

[out]:

['trade'] ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
...

You can find the information about the categories from this file:~/nltk_data/corpora/reuters/README:

  The Reuters-21578 benchmark corpus, ApteMod version

This is a publically available version of the well-known Reuters-21578 "ApteMod" corpus for text categorization. It has been used in publications like these:

  • Yiming Yang and X. Liu. "A re-examination of text categorization
    methods". 1999. Proceedings of 22nd Annual International SIGIR.
    http://citeseer.nj.nec.com/yang99reexamination.html

  • Thorsten Joachims. "Text categorization with support vector
    machines: learning with many relevant features". 1998. Proceedings
    of ECML-98, 10th European Conference on Machine Learning.
    http://citeseer.nj.nec.com/joachims98text.html

ApteMod is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents. The total size of the corpus is about 43 MB. It is also available for download from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html , which includes a more extensive history of the data revisions.

The distribution of categories in the ApteMod corpus is highly skewed, with 36.7% of the documents in the most common category, and only 0.0185% (2 documents) in each of the five least common categories. In fact, the original data source is even more skewed---in creating the corpus, any categories that did not contain at least one document in the training set and one document in the test set were removed from the corpus by its original creator.

In the ApteMod corpus, each document belongs to one or more categories. There are 90 categories in the corpus. The average number of categories per document is 1.235, and the average number of documents per category is about 148, or 1.37% of the corpus.

-Ken Williams [email protected]

     Copyright & Notification 

(extracted from the README at the UCI address above)

The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only.

If you publish results based on this data set, please acknowledge its use, refer to the data set by the name "Reuters-21578, Distribution 1.0", and inform your readers of the current location of the data set (see "Availability & Questions").

网友答案:

Thanks alvas for summing it up so nicely, this would help other people too. Moreover I also find any version of the Reuters dataset which has relatively less number of categories. This article here explains it better.

分享给朋友:
您可能感兴趣的文章:
随机阅读: