How spark flow identifies new files

How does sparking fileStreamidentify new files in the monitoring directory from one interval to another?

Does this rely on new file names or a file creation timestamp or any other approach?

What is the meaning of the argument newFilesOnly?

fileStream(String directory, Class<K> kClass, Class<V> vClass, Class<F> fClass, Function<org.apache.hadoop.fs.Path,Boolean> filter, boolean newFilesOnly, org.apache.hadoop.conf.Configuration conf)
+4
source share
1 answer

A quick response to monitoring is that it uses the file modification time ( isNewFileuses getFileModTime)

As for newFilesOnly.... this is not so straightforward, but you can distinguish this information from the code .

TL; DR; , (newFilesOnly = false) .

, initialModTimeIgnoreThreshold 0. modTimeIgnoreThreshold, (currentTime - durationToRemember.milliseconds). JUST fixed. , , 1 , false. , ... , , 3 .

+4

Source: https://habr.com/ru/post/1584528/