Spark efficiency on small data

Question

Spark efficiency on small data

I hope someone who is familiar with Spark can give me a “gut check” on whether I can abuse the SparkML framework or if the performance that I see is understandable in the context (#rows, #features).

In short, I have a small data set (~ 150 rows) that is pretty wide (~ 180 functions). I have encoded similar Lasso training codes in Spark and Scikit-learn, which lead to identical models (same model coefficients and LOOCVE). However, the spark code takes about 100 times (sklearn takes about 5 seconds, about 600 seconds).

I understand that Spark is optimized for large distributed datasets and that this difference can reasonably be attributed to an overhead delay that would be hidden by parallelism data, but that still seems extremely sluggish.

The spark code is essentially:

//... code to add a number of PipelineStages to a List<PipelineStage> (~90 UnaryTransformer stages), ending in a StandardScaler

// Add Lasso model
LinearRegression lasso = new LinearRegression()
                .setLabelCol(response)
                .setFeaturesCol("normed_features")
                .setMaxIter(100000)
                .setPredictionCol(response+"_prediction")
                .setElasticNetParam(1.0)
                .setFitIntercept(true)
                .setRegParam(0.2);

// stages is the List<PipelineStage> loaded with 90 or so UnaryTransformer steps
stages.add(lasso);

Pipeline pipeline = new Pipeline(stages);
DataFrame df = getTrainingData(trainingData, response);
RegressionEvaluator evaluator = new RegressionEvaluator()
                .setLabelCol(response)
                .setMetricName("mae")
                .setPredictionCol(response+"_prediction")
);

df.cache();

ParamMap[] paramGrid = new ParamGridBuilder().build();

CrossValidator cv = new CrossValidator()
            .setEstimator(pipeline)
            .setEvaluator(evaluator)
            .setEstimatorParamMaps(paramGrid)
            .setNumFolds(20);

double cve = cv.fit(df).avgMetrics()[0];

Python code uses Lasso and GridSearchCV with the same # fragments (20).

Unfortunately, I really can’t provide MWE, because we are using a custom Transformer that I will need to embed, but I wonder if anyone could want to weigh whether this time difference between sklearn and spark implies a user error. The only good practice that I consciously apply is to cache the DataFrame training before installing CrossValidator.

+4

java apache-spark apache-spark-sql

paradiso Oct 7 '15 at 16:48

:

38

ml Apache Spark

5

UID Spark MLLib

2

Random Forest Spark ML?

2

, ArrayType (StringType, true) Spark Scala