Running memory with pyspark only when looping

Question

Running memory with pyspark only when looping

I am running a local spark job that processes some log files. I make several passes for processing, because the logs from each month need to be processed separately. Everything works well only when it processes one month at a time. Even 3-4 in the loop manages to finish. But besides this, I get errors in memory:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4e506c7000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f4e29ee7000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
12288 bytes for committing reserved memory.
./runspark.sh: line 1:  6222 Aborted                 (core dumped)

I am doing work with:

spark-submit --driver-memory=12g run.py

Where run.pythere is something like:

from pyspark import SparkConf, SparkContext

def do(sc, i):
    rdd = sc.binaryFiles("cache/{}".format(i))
    rdd1 = rdd.filter(...).map(...)
    rdd2 = rdd.filter(...).map(...)
    rdd1.persist()
    rdd2.persist()
    ...
    <processing>
    ...
    rdd1.unpersist()
    rdd2.unpersist()

if __name__ == "__main__":
    conf = SparkConf.setMaster("local[*]")
    sc = SparkContext(conf)
    for i in range(12):
        do(sc, i)

Watching the spark webUI, I don't see anything taking up memory on the storage tab (longer than for one cycle). But, looking at htop, memory usage continues to grow. Given that I am abandoning RDDs that may occupy memory, I am at a loss for what eats up all the memory. The launch of spark-1.5.2.

+4

apache-spark pyspark

numentar 20 . '15 3:47

:

4

/ Spark script / ( Spark SQL)

3

: 11

2

Java-out-of-memory pyspark

2

Python Spark Session RDDs Alive