this use case looks pretty common
Not really, because it just doesn't work as you (probably) expect. Since each task works with standard Scala Iterators
, these operations will be crushed together. This means that all operations will be blocked in practice. Assuming you have three URLs ["x", "y", "z"], the code will execute in the following order:
Await.result(httpCall("x", 10.seconds)) Await.result(httpCall("y", 10.seconds)) Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your asynchronous code, you must handle this explicitly with mapPartitions
:
rdd.mapPartitions(iter => { ??? // Submit requests ??? // Wait until all requests completed and return Iterator of results })
but it is relatively difficult. There is no guarantee that all data for this section fits into memory, so you will probably also need a batch processing mechanism.
All that I canโt reproduce the problem you described may be a configuration problem or a problem with httpCall
.
On the side, a note allowing one timeout to kill an entire task does not seem like a good idea.
source share