I have my own data source, and I want to load data into my Spark cluster to perform some calculations. For this, I see that I may need to implement a new RDD for my data source.
I am a complete Scala noob, and I hope that I can implement RDD in Java itself. I browsed the internet and could not find any resources. Any pointers?
My data is in S3 and indexed in Dynamo. For example, if I want to load data with a given time range, I first need to request Dynamo for the S3 file keys for the corresponding time range, and then load it into Spark. Files may not always have the same S3 path prefix, so sc.testFile("s3://directory_path/") will not work.
I am looking for pointers on how to implement something similar to HadoopRDD or JdbcRDD , but in Java. Something similar to what they did here: DynamoDBRDD . This one reads data from Dynamo, my custom RDD will query DynamoDB for S3 file files and then load it from S3.
source share