Largest multi-level text classifier using data frame in Scala Spark

Question

Largest multi-level text classifier using data frame in Scala Spark

I am trying to create a NaiveBayes classifier by loading data from a database as a DataFrame that contains (label, text). Here's the sample data (multi-dimensional label):

label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...|

I used the following conversion for tokenization, stop, n-gram and hashTF:

 val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words") val regexTokenizer = new RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W") val tokenized = tokenizer.transform(selectedData) tokenized.select("words", "label").take(3).foreach(println) // Removing stop words val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered") val parsedData = remover.transform(tokenized) // N-gram val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams") val ngramDataFrame = ngram.transform(parsedData) ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println) //hashing function val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000) val featurizedData = hashingTF.transform(ngramDataFrame)

Conversion result:

 +-----+--------------------+--------------------+--------------------+------ --------------+--------------------+ |label| feature| words| filtered| ngrams| hash| +-----+--------------------+--------------------+--------------------+------ --------------+--------------------+ | 1|combusting prepar...|[combusting, prep...|[combusting, prep...| [combusting prepa...|(1000,[124,161,69...| | 1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...| | 1| | []| []| []| (1000,[],[])| | 1| salt for preserving|[salt, for, prese...| [salt, preserving]| [salt preserving]| (1000,[675],[1.0])| | 1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|

To build a Naive Bayes model, I need to convert the label and function to LabelPoint . Following the approach, I tried to convert the data frame to RDD and create a label:

 val rddData = featurizedData.select("label","hash").rdd val trainData = rddData.map { line => val parts = line.split(',') LabeledPoint(parts(0), parts(1)) } val rddData = featurizedData.select("label","hash").rdd.map(r => (Try(r(0).asInstanceOf[Integer]).get.toDouble, Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get)) val trainData = rddData.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble))) }

I get the following error:

  scala> val trainData = rddData.map { line => | val parts = line.split(',') | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble))) | } <console>:67: error: value split is not a member of (Double, org.apache.spark.mllib.linalg.SparseVector) val parts = line.split(',') ^ <console>:68: error: not found: value Vectors LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))

Change 1:

As shown below, I created LabelPoint and trained the model.

 val trainData = featurizedData.select("label","features") val trainLabel = trainData.map(line => LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get)) val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L) val training = splits(0) val test = splits(1) val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial") val predictionAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label)}

I get less accuracy of about 40% with and without N-gram, along with a different hash function number. My dataset contains 5,000 rows and 45 baritone labels. Is there a way to improve model performance? thanks in advance

+3

text-classification naivebayes apache-spark apache-spark-mllib

user2366149 Jan 18 '16 at 13:33

source share

1 answer

Alberto bonsanto · Answer 1 · 2016-01-18T14:16:22+0000

You do not need to convert featurizedData to RDD , because Apache Spark has two libraries ML and MLLib , the first works with DataFrame s, while MLLib works using RDD s. Therefore, you can work with ML because you already have a DataFrame .

To achieve this, you just need to rename your columns to ( label , features ) and fit your model, as shown in NaiveBayes , example below.

 df = sqlContext.createDataFrame([ Row(label=0.0, features=Vectors.dense([0.0, 0.0])), Row(label=0.0, features=Vectors.dense([0.0, 1.0])), Row(label=1.0, features=Vectors.dense([1.0, 0.0]))]) nb = NaiveBayes(smoothing=1.0, modelType="multinomial") model = nb.fit(df)

About error you get because you already have a SparseVector , and this class does not have a split method. Therefore, thinking more about this, your RDD almost has the structure you really need, but you need to convert Tuple to LabeledPoint .

There are several ways to improve performance, the first of which comes to my mind is to remove stop words (for example, a, a, to, though, etc.), the second is to count the number of different words in your texts, and then construct the vectors manually , i.e. this is because if the hash number is low, different words may have the same hash, hence poor performance.

Largest multi-level text classifier using data frame in Scala Spark

More articles: