Merging streams with static datasets is a great feature of structured streaming. But in each batch, the data will be updated from the data sources. Since these sources are not always dynamic, there would be a performance gain for caching a static dataset for a certain period of time (or the number of batches). After a specified period / number of batches, the data set is reloaded from the source, otherwise retrieved from the cache.
In the spark stream, I dealt with this cached dataset and canceled it after a certain number of batch starts, but for some reason this no longer works with structured streaming.
Any suggestions for this with structured streaming?
Chris source
share