Sparklyr - Decimal precision 8 exceeds maximum precision 7

I am trying to copy a large database in Spark using spark_read_csv, but I am getting the following error as output:

Error: org.apache.spark.SparkException: the operation is interrupted due to the stage Error: Task 0 in step 16.0 failed 4 times, last failure: Lost task 0.3 in step 16.0 (TID 176, 10.1.2.235): java.lang. IllegalArgumentException: requirement not met: Decimal precision of 8 exceeds maximum precision of 7

data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE) 

This is a large data set, about 5.8 million records, with my data set I have data of types Int , num and chr .

+5
source share
1 answer

I think you have a couple of options depending on the spark version you are using

Spark> = 1.6.1

from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html it seems you can specifically specify your schema to make it use doubling

 csvSchema <- structType(structField("carat", "double"), structField("color", "string")) diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", schema = csvSchema) 

Spark & ​​lt; 1.6.1 consider test.csv

 1,a,4.1234567890 2,b,9.0987654321 

you can easily make it more efficient, but I think you get a gist

 linesplit <- function(x){ tmp <- strsplit(x,",") return ( tmp) } lineconvert <- function(x){ arow <- x[[1]] converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3])) return (converted) } rdd <- SparkR:::textFile(sc,'/path/to/test.csv') lnspl <- SparkR:::map(rdd, linesplit) ll2 <- SparkR:::map(lnspl,lineconvert) ddf <- createDataFrame(sqlContext,ll2) head(ddf) _1 _2 _3 1 1 a 4.1234567890 2 2 b 9.0987654321 

NOTE. SparkR methods: are private for some reason. The docs say, "Be careful when you use this."

+1
source

Source: https://habr.com/ru/post/1269530/


All Articles