How spark flow identifies new files

Question

How spark flow identifies new files

How does sparking fileStreamidentify new files in the monitoring directory from one interval to another?

Does this rely on new file names or a file creation timestamp or any other approach?

What is the meaning of the argument newFilesOnly?

fileStream(String directory, Class<K> kClass, Class<V> vClass, Class<F> fClass, Function<org.apache.hadoop.fs.Path,Boolean> filter, boolean newFilesOnly, org.apache.hadoop.conf.Configuration conf)

+4

apache-spark spark-streaming

Vijay innamuri Apr 24 '15 at 16:01

source share

1 answer

Justin Pihony · Answer 1 · 2015-04-24T18:11:33+0000

A quick response to monitoring is that it uses the file modification time ( isNewFileuses getFileModTime)

As for newFilesOnly.... this is not so straightforward, but you can distinguish this information from the code .

TL; DR; , (newFilesOnly = false) .

, initialModTimeIgnoreThreshold 0. modTimeIgnoreThreshold, (currentTime - durationToRemember.milliseconds). JUST fixed. , , 1 , false. , ... , , 3 .

How spark flow identifies new files

More articles: