I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.
The scikit-learn solution to this problem is to cross-validate a
Pipeline of estimators, e.g.:
>>> from sklearn.cross_validation import cross_val_score >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.pipeline import Pipeline >>> from sklearn.svm import LinearSVC >>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])
clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python
list of string)
documents and their labels
>>> cross_val_score(clf, documents, y)
will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.