pyspark mllib vs pyspark ml packages

What is the difference between pyspark mllib and pyspark ml ?

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

pyspark mllib appears to be the target data-level algorithm pyspark ml

One difference found is pyspark ml implements pyspark.ml.tuning.CrossValidator , but pyspark mllib does not.

As far as I understand, the library should use if the implementation of the algorithms in the Apache Spark infrastructure is mllib , but there seems to be a separation?

It seems that there is no interaction between each of the frameworks without type conversion, since each of them contains a different package structure.

+5
source share
1 answer

From my experience, pyspark.mllib classes can only be used with pyspark.RDD , whereas (as you say) pyspark.ml classes can only be used with pyspark.sql.DataFrame . The documentation for pyspark.ml mentions that the first entry in the pyspark.ml package reads:

DataFrame-based machine learning APIs allow users to quickly build and configure practical machine learning pipelines.

Now, I recall an article I read some time ago about the three APIs available in Spark 2.0, their relative strengths / weaknesses, and their comparative performance. A tale of three Apache Spark APIs: RDD, DataFrames, and Datasets . I was in the middle of performance testing on new client servers and was interested if there was a scenario in which it would be advisable to develop an RDD-based approach, as opposed to a DataFrame-based approach (my approach of choice), but I got distracted.

The bottom line is that there are situations where each of them is very suitable, and the other is where they may be different. One example that I remember is that if the data is already structured, DataFrames brings some performance advantages over RDD, it seems to be strong as the complexity of your operations increases. Another observation was that DataSets and DataFrames consume much less memory when caching than in RDD. In summation, the author came to the conclusion that for low-level operations, RDDs are excellent, but for high-level operations, viewing and linking to other data API frames and datasets is superior.

So, to get back to your question full circle, I believe the answer is loud pyspark.ml , since the classes in this package are designed to use pyspark.sql.DataFrames . I would suggest that the execution of complex algorithms implemented in each of these packages would be significant if you were to test the same data as DataFrame vs RDD. In addition, viewing data and developing attractive visual effects will be more intuitive and have better performance.

+8
source

Source: https://habr.com/ru/post/1266383/


All Articles