当前位置: 动力学知识库 > 问答 > 编程问答 >

scala - Random Forest Analysis

问题描述:

I have a Spark (1.5.2) DataFrame and a trained RandomForestClassificationModel. I can easily fit the data and get a prediction, but I want to do deeper analysis on which edge values are the most common players in each binary classification scenario.

In the past I did something similar with RDD's to track feature usage by calculating the prediction on my own. In the code below I track a list of features used in calculating the prediction. DataFrame's don't seem to be quite as straightforward as the RDD's are in this regard.

def predict(node:Node, features: Vector, path_in:Array[Int]) : (Double,Double,Array[Int]) =

{

if (node.isLeaf)

{

(node.predict.predict,node.predict.prob,path_in)

}

else

{

//track our path through the tree

val path = path_in :+ node.split.get.feature

if (node.split.get.featureType == FeatureType.Continuous)

{

if (features(node.split.get.feature) <= node.split.get.threshold)

{

predict(node.leftNode.get, features, path)

}

else

{

predict(node.rightNode.get, features, path)

}

}

else

{

if (node.split.get.categories.contains(features(node.split.get.feature)))

{

predict(node.leftNode.get, features, path)

}

else

{

predict(node.rightNode.get, features, path)

}

}

}

}

I'd like to do something similar to this code, but instead for each feature vector I return a list of all feature/edge value pairs. Note, in my data set all features are categorical, and bin settings were used appropriately when building the forest.

网友答案:

I ended up building a custom udf to do this:

//Base Prediction method. Accepts a Random Forest Model and a Feature Vector
//  Returns an Array of predictions, one per tree, the impurity, the feature used on the final edge, and the feature value.
def predicForest(m:RandomForestClassificationModel, point: Vector) : (Double, Array[(Double,Double,(Int,Double))])={
    val results = m.trees.map(t=> predict(t.rootNode,point))

    (results.map(x=> x._1).sum/results.count(x=> true), results)
}

def predict(node:Node, features: Vector) : (Double,Double,(Int,Double)) = {
    if (node.isInstanceOf[InternalNode]){
      //track our path through the tree
      val internalNode = node.asInstanceOf[InternalNode]
      if (internalNode.split.isInstanceOf[CategoricalSplit]) {
        val split = internalNode.split.asInstanceOf[CategoricalSplit]
        val featureValue = features(split.featureIndex)
        if (split.leftCategories.contains(featureValue)) {
          if (internalNode.leftChild.isInstanceOf[LeafNode]) {
            (node.prediction,node.impurity,(internalNode.split.featureIndex, featureValue))
          } else
            predict(internalNode.leftChild, features)
        } else {
          if (internalNode.rightChild.isInstanceOf[LeafNode]) {
            (node.prediction,node.impurity,(internalNode.split.featureIndex, featureValue))
          } else
            predict(internalNode.rightChild, features)
        }
      } else {
        //If we run into an unimplemented type we just return
        (node.prediction,node.impurity,(-1,-1))
      }
    } else {
      //If we run into an unimplemented type we just return
      (node.prediction,node.impurity,(-1,-1))
    }
}

val rfModel = yourInstanceOfRandomForestClassificationModel

//This custom UDF executes the Random Forest Classification in a trackable way
def treeAnalyzer(m:RandomForestClassificationModel) = udf((x:Vector) =>
  predicForest(m,x))

//Execute the UDF, this will execute the Random Forest classification on each row and store the results from each tree in a new column named `prediction`
val df3 = testData.withColumn("prediction", treeAnalyzer(rfModel)(testData("indexedFeatures")))
分享给朋友:
您可能感兴趣的文章:
随机阅读: