I am trying to implement a simple demo for a decision tree classifier using java and apache spark version 1.0.0. I am based on http://spark.apache.org/docs/1.0.0/mllib-decision-tree.html . So far I have written the code below.
according to the following code i get an error:
org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy();
Type Mismatch: Cannot be converted from Entropy to Impurity. This is strange to me, while the Entropy class implements an impurity interface:
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/mllib/tree/impurity/Entropy.html
I am looking for an answer to the question, why can not I complete this task?
package decisionTree; import java.util.regex.Pattern; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.configuration.Strategy; import org.apache.spark.mllib.tree.impurity.Gini; import org.apache.spark.mllib.tree.impurity.Impurity; import scala.Enumeration.Value; public final class DecisionTreeDemo { static class ParsePoint implements Function<String, LabeledPoint> { private static final Pattern COMMA = Pattern.compile(","); private static final Pattern SPACE = Pattern.compile(" "); @Override public LabeledPoint call(String line) { String[] parts = COMMA.split(line); double y = Double.parseDouble(parts[0]); String[] tok = SPACE.split(parts[1]); double[] x = new double[tok.length]; for (int i = 0; i < tok.length; ++i) { x[i] = Double.parseDouble(tok[i]); } return new LabeledPoint(y, Vectors.dense(x)); } } public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage:DecisionTreeDemo <file>"); System.exit(1); } JavaSparkContext ctx = new JavaSparkContext("local[4]", "Log Analizer", System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(DecisionTreeDemo.class)); JavaRDD<String> lines = ctx.textFile(args[0]); JavaRDD<LabeledPoint> points = lines.map(new ParsePoint()).cache(); int iterations = 100; int maxBins = 2; int maxMemory = 512; int maxDepth = 1; org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); Strategy strategy = new Strategy(Algo.Classification(), impurity, maxDepth, maxBins, null, null, maxMemory); ctx.stop(); } }
@samthebest if I remove the impurity variable and go into the following form:
Strategy strategy = new Strategy(Algo.Classification(), new org.apache.spark.mllib.tree.impurity.Entropy(), maxDepth, maxBins, null, null, maxMemory);
the error changed to: the constructor of Entropy () is undefined.
[edit] I found that I consider the correct method call ( https://issues.apache.org/jira/browse/SPARK-2197 ):
Strategy strategy = new Strategy(Algo.Classification(), new Impurity() { @Override public double calculate(double arg0, double arg1, double arg2) { return Gini.calculate(arg0, arg1, arg2); } @Override public double calculate(double arg0, double arg1) { return Gini.calculate(arg0, arg1); } }, 5, 100, QuantileStrategy.Sort(), null, 256);
Unfortunately, I encountered an error :(