当前位置: 动力学知识库 > 问答 > 编程问答 >

scikit learn - sklearn: vectorizing in cross validation for text classification

问题描述:

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.

网友答案:

The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling

>>> cross_val_score(clf, documents, y)

will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

分享给朋友:
您可能感兴趣的文章:
随机阅读: