I have a large data frame (over 1.2 GB) with this structure:
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +
| country | date_data | text |
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +
| "EEUU" | "2016-10-03" | "T_D: QQWE \ nT_NAME: name_1 \ nT_IN: ind_1 \ nT_C: c1ws12 \ nT_ADD: Sec_1_P \ n ........... \ nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_D: QQAA \ nT_NAME: name_2 \ nT_IN: ind_2 \ nT_C: c1ws12 \ nT_ADD: Sec_1_P \ n ........... \ nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_D: QQWE \ nT_NAME: name_300000 \ nT_IN: ind_65 \ nT_C: c1ws12 \ nT_ADD: Sec_1_P \ n ........... \ nT_R: 47aa" |
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +
The number of lines is 300,000, and the text field is a line of 5,000 characters.
I would like to highlight the text field in the new fields:
+ --------- + ------------ + ------ + ------------- + ----- --- + -------- + --------- + -------- + ------ +
| country | date_data | t_d | t_name | t_in | t_c | t_add | ...... | t_r |
+ --------- + ------------ + ------ + ------------- + ----- --- + -------- + --------- + -------- + ------ +
| EEUU | 2016-10-03 | QQWE | name_1 | ind_1 | c1ws12 | Sec_1_P | ...... | 45ee |
| EEUU | 2016-10-03 | QQAA | name_2 | ind_2 | c1ws12 | Sec_1_P | ...... | 45ee |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| EEUU | 2016-10-03 | QQWE | name_300000 | ind_65 | c1ws12 | Sec_1_P | ...... | 47aa |
+ --------- + ------------ + ------ + ------------- + ----- --- + -------- + --------- + -------- + ------ +
I am currently using regular expressions to solve this problem. First, I write regular expressions and create a function to extract individual fields from text (a total of 90 regular expressions):
val D_text = "((?<=T_D: ).*?(?=\\\\n))".r val NAME_text = "((?<=nT_NAME: ).*?(?=\\\\n))".r val IN_text = "((?<=T_IN: ).*?(?=\\\\n))".r val C_text = "((?<=T_C: ).*?(?=\\\\n))".r val ADD_text = "((?<=T_ADD: ).*?(?=\\\\n))".r . . . . val R_text = "((?<=T_R: ).*?(?=\\\\n))".r //UDF function: def getFirst(pattern2: scala.util.matching.Regex) = udf( (url: String) => pattern2.findFirstIn(url) match { case Some(texst_new) => texst_new case None => "NULL" case null => "NULL" } )
Then I create a new Dataframe (tbl_separate_fields) as a result of applying the regex function to extract each new field from the text.
val tbl_separate_fields = hiveDF.select( hiveDF("country"), hiveDF("date_data"), getFirst(D_text)(hiveDF("texst")).alias("t_d"), getFirst(NAME_text)(hiveDF("texst")).alias("t_name"), getFirst(IN_text)(hiveDF("texst")).alias("t_in"), getFirst(C_text)(hiveDF("texst")).alias("t_c"), getFirst(ADD_text)(hiveDF("texst")).alias("t_add"), . . . . getFirst(R_text)(hiveDF("texst")).alias("t_r") )
Finally, I insert this dataframe into the Hive table:
tbl_separate_fields.registerTempTable("tbl_separate_fields") hiveContext.sql("INSERT INTO TABLE TABLE_INSERT PARTITION (date_data) SELECT * FROM tbl_separate_fields")
This solution lasts 1 hour for the entire data frame, so I want to optimize and reduce the execution time. Is there a solution?
We use Hadoop 2.7.1 and Apache-Spark 1.5.1 . Configuration for Spark:
val conf = new SparkConf().set("spark.storage.memoryFraction", "0.1") val sc = new SparkContext(conf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Thanks in advance.
EDIT DATA:
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +
| country | date_data | text |
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +
| "EEUU" | "2016-10-03" | "T_D: QQWE \ nT_NAME: name_1 \ nT_IN: ind_1 \ nT_C: c1ws12 \ nT_ADD: Sec_1_P \ n ........... \ nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_NAME: name_2 \ nT_D: QQAA \ nT_IN: ind_2 \ nT_C: c1ws12 ........... \ nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_NAME: name_300000 \ nT_ADD: Sec_1_P \ nT_IN: ind_65 \ nT_C: c1ws12 \ n ........... \ nT_R: 47aa" |
+ --------- + -------------- + ------------------------ -------------------------------------------------- ---------------------------- +