I am running a local spark job that processes some log files. I make several passes for processing, because the logs from each month need to be processed separately. Everything works well only when it processes one month at a time. Even 3-4 in the loop manages to finish. But besides this, I get errors in memory:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4e506c7000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
12288 bytes for committing reserved memory.
./runspark.sh: line 1: 6222 Aborted (core dumped)
I am doing work with:
spark-submit
Where run.pythere is something like:
from pyspark import SparkConf, SparkContext
def do(sc, i):
rdd = sc.binaryFiles("cache/{}".format(i))
rdd1 = rdd.filter(...).map(...)
rdd2 = rdd.filter(...).map(...)
rdd1.persist()
rdd2.persist()
...
<processing>
...
rdd1.unpersist()
rdd2.unpersist()
if __name__ == "__main__":
conf = SparkConf.setMaster("local[*]")
sc = SparkContext(conf)
for i in range(12):
do(sc, i)
Watching the spark webUI, I don't see anything taking up memory on the storage tab (longer than for one cycle). But, looking at htop, memory usage continues to grow. Given that I am abandoning RDDs that may occupy memory, I am at a loss for what eats up all the memory. The launch of spark-1.5.2.