I am trying to create a NaiveBayes classifier by loading data from a database as a DataFrame that contains (label, text). Here's the sample data (multi-dimensional label):
label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...|
I used the following conversion for tokenization, stop, n-gram and hashTF:
val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words") val regexTokenizer = new RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W") val tokenized = tokenizer.transform(selectedData) tokenized.select("words", "label").take(3).foreach(println) // Removing stop words val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered") val parsedData = remover.transform(tokenized) // N-gram val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams") val ngramDataFrame = ngram.transform(parsedData) ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println) //hashing function val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000) val featurizedData = hashingTF.transform(ngramDataFrame)
Conversion result:
+-----+--------------------+--------------------+--------------------+------ --------------+--------------------+ |label| feature| words| filtered| ngrams| hash| +-----+--------------------+--------------------+--------------------+------ --------------+--------------------+ | 1|combusting prepar...|[combusting, prep...|[combusting, prep...| [combusting prepa...|(1000,[124,161,69...| | 1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...| | 1| | []| []| []| (1000,[],[])| | 1| salt for preserving|[salt, for, prese...| [salt, preserving]| [salt preserving]| (1000,[675],[1.0])| | 1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|
To build a Naive Bayes model, I need to convert the label and function to LabelPoint . Following the approach, I tried to convert the data frame to RDD and create a label:
val rddData = featurizedData.select("label","hash").rdd val trainData = rddData.map { line => val parts = line.split(',') LabeledPoint(parts(0), parts(1)) } val rddData = featurizedData.select("label","hash").rdd.map(r => (Try(r(0).asInstanceOf[Integer]).get.toDouble, Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get)) val trainData = rddData.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble))) }
I get the following error:
scala> val trainData = rddData.map { line => | val parts = line.split(',') | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble))) | } <console>:67: error: value split is not a member of (Double, org.apache.spark.mllib.linalg.SparseVector) val parts = line.split(',') ^ <console>:68: error: not found: value Vectors LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
Change 1:
As shown below, I created LabelPoint and trained the model.
val trainData = featurizedData.select("label","features") val trainLabel = trainData.map(line => LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get)) val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L) val training = splits(0) val test = splits(1) val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial") val predictionAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label)}
I get less accuracy of about 40% with and without N-gram, along with a different hash function number. My dataset contains 5,000 rows and 45 baritone labels. Is there a way to improve model performance? thanks in advance