Regarding Spark Dataframereader jdbc

I have a question regarding the Spark Dataframereader Mechanics. I would be grateful if anyone could help me. Let me explain the script here.

I am creating a DataFrame from a Dstream like this. This is in the input

 var config = new HashMap[String,String]();
        config += ("zookeeper.connect" ->zookeeper);        
        config += ("partition.assignment.strategy" ->"roundrobin");
        config += ("bootstrap.servers" ->broker);
        config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
        config += ("group.id" -> "default"); 

        val lines =  KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)

        lines.foreachRDD { rdd =>

                if(!rdd.isEmpty()){

                    val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }       





                    val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)

                    val rddDF = sqlContext.read.json(rddJson)

                    rddDF.registerTempTable("inputData")



 val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)

Here is the ReadDataFrameHelper code

def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
            columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={

        val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
                columnName = columnToPartition,
                lowerBound = lowerBound,
                upperBound = highBound,
                numPartitions = numOfPartiton,
                connectionProperties = new java.util.Properties()
                )

            jdbcDF  

    }

Finally i do join like this

 val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
                                 && rddDF("CODE") === dbDF("CODE"),"left_outer")
                        .drop(dbDF("code"))
                        .drop(dbDF("id"))
                        .drop(dbDF("number"))
                        .drop(dbDF("key"))
                        .drop(dbDF("loaddate"))
                        .drop(dbDF("fid"))
joinedData.show()

My DStream input will have 1000 lines and the data will contain millions of lines. Therefore, when I join this connection, it will spark all the rows from the database and read these rows or just read these specific rows from the database that have code,idfrom the input DStream

+1
source share
1 answer

zero323, , . , .

zero323

+2

Source: https://habr.com/ru/post/1650969/


All Articles