I would recommend just renaming them the same way as with the file system class, in the context of the context without displaying (only in the driver), it does not matter to rename 100 thousand files, this is too slow, then you can try for multithreading. Just do something like
FileSystem fileSystem = new Path("").getFileSystem(new Configuration()); File [] files = FileUtil.listFiles(directory) for (File file : files) { fileSystem.rename(new Path(file.getAbsolutePath()),new Path("renamed")); }
The BTW error you get in sparks is because the spark requires objects that it uses to implement a Serializable, which is not in the FileSystem.
I cannot confirm this, but it would seem that every renaming in HDFS would include a Node name, as it keeps track of the full directory structure and node file location ( confirmation ), which means that this cannot be done effectively in parallel. According to this answer, renaming is an operation only with metadata, so it must be performed very quickly.
source share