I work with ~ 120Gb csv files (from 1Gb to 20Gb each). I am using a 220Gb Ram computer with 36 hells.
I was wondering if it makes sense to use the spark offline for this analysis? I really like the natural spark concurrency plus (with pyspark). I have a good laptop environment.
I want to create material like union / aggregation and run machine learning in a converted dataset. Python tools like pandas want to use only one thread, which seems like a massive waste, as using all 36 threads should be much faster.
, , node, , , ( ), .
, 1 node. . /spark -submit, :
--master local[*]
:
./spark-submit --master local[*] <your-app-name> <your-apps-args>
node, .
, , ; 512 . , , -, SparkConf.
Source: https://habr.com/ru/post/1598245/More articles:Javascript Asset Console: Error 404. How do I enable Google Analytics code in Rails? - ruby-on-railswpf - столбец сетки, не заполняющий оставшееся пространство, когда содержимое другого столбца рушится - wpfJQuery AJAX cross site query in Laravel CSRF protection - jqueryPython Selenium Webdriver Wait.Until показывает, что ошибка принимает ровно 2 аргумента 3 - pythonОтобразить полную docstring с docopt без -h - pythonМогу ли я использовать фильтры внутри оператора ngIf - angularjsIs it possible to filter jQuery DataTable by data attribute? - jqueryHow to control the priority of subqueries in Sitecore ContentSearch using Solr Provider? - c #Get the orthogonal value of 'data-sort' from DataTables in search.push function - datatablesRabbitMQ SASL loggin - loggingAll Articles