Does it make sense to use the spark offline on one large computer?

I work with ~ 120Gb csv files (from 1Gb to 20Gb each). I am using a 220Gb Ram computer with 36 hells.

I was wondering if it makes sense to use the spark offline for this analysis? I really like the natural spark concurrency plus (with pyspark). I have a good laptop environment.

I want to create material like union / aggregation and run machine learning in a converted dataset. Python tools like pandas want to use only one thread, which seems like a massive waste, as using all 36 threads should be much faster.

+4
source share
1 answer

, , node, , , ( ), .

, 1 node. . /spark -submit, :

--master local[*]

:

./spark-submit --master local[*] <your-app-name> <your-apps-args>

node, .

, , ; 512 . , , -, SparkConf.

+6

Source: https://habr.com/ru/post/1598245/


All Articles