Spark mllib predicts a strange number or NaN

Question

Spark mllib predicts a strange number or NaN

I am new to Apache Spark and am trying to use a machine learning library to predict some data. My data set now is only about 350 points. Here are 7 of these points:

"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289

Here is my code:

def parsePoint(line):
    split = map(sanitize, line.split(','))
    rev = split.pop(-2)
    return LabeledPoint(rev, split)

def sanitize(value):
    return float(value.strip('"'))

parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)

print model.predict(parsedData.first().features)

Forecasting is something completely insane, for example -6.92840330273e+136. If I don't set the iteration to train(), I get nanthe result. What am I doing wrong? Is this my dataset (size, maybe?) Or my configuration?

+4

python apache-spark pyspark apache-spark-mllib gradient descent

Scot lawrie Jul 23 '15 at 10:53

source share

1 answer

Till Rohrmann · Accepted Answer · 2015-07-24T12:09:26+0000

, LinearRegressionWithSGD (SGD) . SGD stepSize, .

SGD , g , w. w, g. - s.

w(i+1) = w(i) - s * g

, MLlib stepSize = 1. , . , , , , LinearRegressionWithSGD:

LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Spark mllib predicts a strange number or NaN

More articles: