Problem implementing a decision tree in apache spark with java

Question

Problem implementing a decision tree in apache spark with java

I am trying to implement a simple demo for a decision tree classifier using java and apache spark version 1.0.0. I am based on http://spark.apache.org/docs/1.0.0/mllib-decision-tree.html . So far I have written the code below.

according to the following code i get an error:

org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy();

Type Mismatch: Cannot be converted from Entropy to Impurity. This is strange to me, while the Entropy class implements an impurity interface:

https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/mllib/tree/impurity/Entropy.html

I am looking for an answer to the question, why can not I complete this task?

 package decisionTree; import java.util.regex.Pattern; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.configuration.Strategy; import org.apache.spark.mllib.tree.impurity.Gini; import org.apache.spark.mllib.tree.impurity.Impurity; import scala.Enumeration.Value; public final class DecisionTreeDemo { static class ParsePoint implements Function<String, LabeledPoint> { private static final Pattern COMMA = Pattern.compile(","); private static final Pattern SPACE = Pattern.compile(" "); @Override public LabeledPoint call(String line) { String[] parts = COMMA.split(line); double y = Double.parseDouble(parts[0]); String[] tok = SPACE.split(parts[1]); double[] x = new double[tok.length]; for (int i = 0; i < tok.length; ++i) { x[i] = Double.parseDouble(tok[i]); } return new LabeledPoint(y, Vectors.dense(x)); } } public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage:DecisionTreeDemo <file>"); System.exit(1); } JavaSparkContext ctx = new JavaSparkContext("local[4]", "Log Analizer", System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(DecisionTreeDemo.class)); JavaRDD<String> lines = ctx.textFile(args[0]); JavaRDD<LabeledPoint> points = lines.map(new ParsePoint()).cache(); int iterations = 100; int maxBins = 2; int maxMemory = 512; int maxDepth = 1; org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); Strategy strategy = new Strategy(Algo.Classification(), impurity, maxDepth, maxBins, null, null, maxMemory); ctx.stop(); } }

@samthebest if I remove the impurity variable and go into the following form:

 Strategy strategy = new Strategy(Algo.Classification(), new org.apache.spark.mllib.tree.impurity.Entropy(), maxDepth, maxBins, null, null, maxMemory);

the error changed to: the constructor of Entropy () is undefined.

[edit] I found that I consider the correct method call ( https://issues.apache.org/jira/browse/SPARK-2197 ):

 Strategy strategy = new Strategy(Algo.Classification(), new Impurity() { @Override public double calculate(double arg0, double arg1, double arg2) { return Gini.calculate(arg0, arg1, arg2); } @Override public double calculate(double arg0, double arg1) { return Gini.calculate(arg0, arg1); } }, 5, 100, QuantileStrategy.Sort(), null, 256);

Unfortunately, I encountered an error :(

+6

java machine-learning bigdata apache-spark decision-tree

caruso Jun 28 '14 at 10:38

source share

1 answer

emecas · Accepted Answer · 2014-08-14T12:17:56+0000

The Java solution for Bug 2197 is now available through this pull request :

Other improvements for decision trees for easy use with Java: * impurity classes: instance () methods have been added to help in the Java interface. * Strategy: Added constructor compatible with Java → Note. I removed quantileCalculationStrategy from the constructor compatible with Java, because (a) it is a special class and (b) there is only 1 option at the moment. I suspect that we are redesigning the API before other options are included.

You can see the full example that advances your problem using the Gini impurity intance () method here

 Strategy strategy = new Strategy(Algo.Classification(), Gini.instance(), maxDepth, numClasses,maxBins, categoricalFeaturesInfo); DecisionTreeModel model = DecisionTree$.MODULE$.train(rdd.rdd(), strategy);

Problem implementing a decision tree in apache spark with java

More articles: