The main problem is that the enumeration of objects in s3 is very slow, and the way it looks like a directory tree kills performance whenever something does a binding (according to the template template templates).
The code in the message makes a list of all the children, which provides better performance, mainly that comes with Hadoop 2.8 and s3a listFiles (path, recursive), see HADOOP-13208 .
After receiving this list, you have lines for the paths of the objects, which you can then map to the s3a / s3n paths for the spark, which will be processed as inputs of text files, and which you can then apply to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",") sc.textFile(files).map(...)
And as requested, java code is used here.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/"; objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
Please note that I switched s3n to s3a because if you had hadoop-aws
and amazon-sdk
JAR devices on your CP, the s3a connector is the one you should use. This is better, and it is one that is supported and tested against the workloads of people (I). See Hadoop S3 Connector History .
source share