How to extract n-gram characters based on large text

Question

How to extract n-gram characters based on large text

Given a large text file, I want to extract character n-grams using Apache Spark (execute the task in parallel).

Input example (2 lines of text): line 1: (Hello World, it) line 2: (nice day)

Output n-grams: Hel - ell-llo -lo_ - o_W - _Wo - Wor - orl - rld - ld, - d, _ -, _i - _it - it_ - t_i - _is -... and so on. So I want the return value to be RDD [String], each row contains n-grams.

Note that the new line is considered a space in the output n-grams. I put each line in parentheses to be clear. Also, to be clear, a string or text is not a single entry in an RDD. I read the file using sc.textFile () method.

+4

scala apache-spark

Al jenssen Jan 25 '16 at 17:49

source share

3 answers

:

def n_gram(str:String, n:Int) = (str + " ").sliding(n)

, , , . , , , :

def n_gram(str:String, n:Int) = str.replace('\n', ' ').sliding(n)

:

println(n_gram("Hello World, it", 3).map(_.replace(' ', '_')).mkString(" - "))

:

Hel - ell - llo - lo_ - o_W - _Wo - Wor - orl - rld - ld, - d,_ - ,_i - _it - it_

+1

Eduardo 25 . '16 19:37

There may be shorter ways to do this,

Assuming the entire line (including the new line) is a separate entry in the RDD, returning the next from flatMap should give you the desired result.

val strings = text.foldLeft(("", List[String]())) {
  case ((s, l), c) =>
    if (s.length < 2) {
      val ns = s + c
      (ns, l)
    } else if (s.length == 2) {
      val ns = s + c
      (ns, ns :: l)
    } else {
      val ns = s.tail + c
      (ns, ns :: l)
    }
}._2

0

Angelo genovese Jan 25 '16 at 19:23

source share

marios · Accepted Answer · 2016-01-26T05:05:30+0000

The main idea is to take all the lines inside each section and combine them into a long line. Then we replace "with" _ "and call sliding along this line to create trigrams for each section in parallel.

Note. . The resulting trigrams may not be 100% accurate, since we will skip a few trigrams from the beginning and end of each section. Given that each section can contain several million characters, the loss of confidence should be negligible. The main advantage here is that each section can be executed in parallel.

Here are some details about the toy. All sounds can be performed on any Spark REPL:

scala> val data = sc.parallelize(Seq("Hello World, it","is a nice day"))
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[12] 

val trigrams = data.mapPartitions(_.toList.mkString(" ").replace(" ","_").sliding(3))
trigrams: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14]

, , ( , )

scala> val asCollected = trigrams.collect
asCollected: Array[String] = Array(Hel, ell, llo, lo_, o_W, _Wo, Wor, orl, rld, ld,, d,_, ,_i, _it, is_, s_a, _a_, a_n, _ni, nic, ice, ce_, e_d, _da, day)

How to extract n-gram characters based on large text

More articles: