当前位置: 动力学知识库 > 问答 > 编程问答 >

machine learning - simply changing csv header in apache Mahout classifier input produces different model?

问题描述:

I'm trying to go through the Mahout classifier example (donut.csv ). but I found that simply changing the name of some columns in the header row, and changing the corresponding predictor variable name in the classifier command, would lead to a different model. this does not make sense.

first , you obtain the donut.csv by

mahout cat donut.csv |tail -40 > donut0.csv

(the "tail" was because mahout cat produces some initial info lines )

then we use the following commands to train donut0.csv : (as is suggested from "Mahout in action" book )

mahout trainlogistic --input donut0.csv \

--output ./model \

--target color --categories 2 \

--predictors x y a b c --types numeric \

--features 20 --passes 100 --rate 50

it gave the following output

color ~ 7.068*Intercept Term + 0.581*a + -1.369*b + -25.059*c + 0.581*x + 2.319*y

Intercept Term 7.06759

a 0.58123

b -1.36893

c -25.05945

x 0.58123

y 2.31879

0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -1.368933989 0.000000000 0.000000000 0.000000000 0.000000000 0.581234210 0.000000000 0.000000000 7.067587159 0.000000000 0.000000000 0.000000000 2.318786209 0.000000000 -25.059452292

12/04/27 09:29:21 INFO driver.MahoutDriver: Program took 789 ms (Minutes: 0.01315)

but if simply change the column "x" in the header to "xa", and corresponding predictor name in the command, the output model completely changes.

$ head -3 donut4.csv

xa,y,shape,color,k,k0,xx,xy,yy,a,b,c,bias

0.923307513352484,0.0135197141207755,21,20,4,8,0.852496764213146,0.0124828536260896,0.000182782669907495,0.923406490600458,0.0778750292332978,0.644866125183976,1

0.711011884035543,0.909141522599384,22,20,3,9,0.505537899239772,0.64641042683833,0.826538308114327,1.15415605849213,0.953966686673604,0.46035073663368,1

mahout trainlogistic --input donut4.csv \

--output ./model \

--target color --categories 2 \

--predictors xa y a b c --types numeric \

--features 20 --passes 100 --rate 50

color ~ 6.380*Intercept Term + -1.913*a + -0.577*b + -23.236*c + 2.647*xa + 3.009*y

Intercept Term 6.38017

a -1.91308

b -0.57676

c -23.23552

xa 2.64657

y 3.00925

0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -0.576759549 0.000000000 0.000000000 2.646572912 0.000000000 -1.913075634 0.000000000 0.000000000 6.380173126 0.000000000 0.000000000 0.000000000 3.009245162 0.000000000 -23.235521029

12/04/27 10:21:10 INFO driver.MahoutDriver: Program took 728 ms (Minutes: 0.012133333333333333)

I have not verified the new model, maybe it also fits the data, but simply changing a name supposedly should have NO effect on how the algorithm works. right??

thanks

Yang

网友答案:

I doubt it was the header change. I would much more readily expect it's because of different values of random values chosen in the algorithm. Try two runs with no changes to see if anything changes.

网友答案:

It has to do with feature hashing. The feature names are used to determine where in the feature vector the weights are placed.

In the 20 News groups example the feature vector is done in the org.apache.mahout.classifier.sgd.TrainNewsGroups class. The call

Vector v = helper.encodeFeatureVector(file, actual, leakType, overallCounts);

is what actually creates the feature vector.

It is using 'feature hashing', such that multiple features can be "hashed" into the same index in the vector. The actual feature hashing occurs in the NewsgroupHelper class using the following encoder:

    private final FeatureVectorEncoder encoder = new StaticWordValueEncoder("body");

You pass in 20 features (using the --features 20) command line argument, but only use 5 features (--predictors xa y a b c).

Looking back at the NewsgroupHelper code, the encoder.addToVector(word, Math.log1p(words.count(word)), v); call, it adds the word to the encoder. What is obviously happening is that for the 'x' feature, the hash index is not colliding with the other 5 features. However, when you use the 'xa' feature, it the hash collides with another of the names and adds their weight to the feature vector.

If you look at the org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder encoding method, it uses the int n = hashForProbe(originalForm, data.size(), name, i); method to calculate the feature index: originalForm is the name of the feature, data.size() is the number of features, name is the constant name of the encoder, and i is the "probe number" which is changed.

TL;DR The 'x' and 'xa' names collide in the feature hashing, and you are not looping over enough probes to find a set of encoded vectors where you do not have a collision.

分享给朋友:
您可能感兴趣的文章:
随机阅读: