From my experience, pyspark.mllib classes can only be used with pyspark.RDD , whereas (as you say) pyspark.ml classes can only be used with pyspark.sql.DataFrame . The documentation for pyspark.ml mentions that the first entry in the pyspark.ml package reads:
DataFrame-based machine learning APIs allow users to quickly build and configure practical machine learning pipelines.
Now, I recall an article I read some time ago about the three APIs available in Spark 2.0, their relative strengths / weaknesses, and their comparative performance. A tale of three Apache Spark APIs: RDD, DataFrames, and Datasets . I was in the middle of performance testing on new client servers and was interested if there was a scenario in which it would be advisable to develop an RDD-based approach, as opposed to a DataFrame-based approach (my approach of choice), but I got distracted.
The bottom line is that there are situations where each of them is very suitable, and the other is where they may be different. One example that I remember is that if the data is already structured, DataFrames brings some performance advantages over RDD, it seems to be strong as the complexity of your operations increases. Another observation was that DataSets and DataFrames consume much less memory when caching than in RDD. In summation, the author came to the conclusion that for low-level operations, RDDs are excellent, but for high-level operations, viewing and linking to other data API frames and datasets is superior.
So, to get back to your question full circle, I believe the answer is loud pyspark.ml , since the classes in this package are designed to use pyspark.sql.DataFrames . I would suggest that the execution of complex algorithms implemented in each of these packages would be significant if you were to test the same data as DataFrame vs RDD. In addition, viewing data and developing attractive visual effects will be more intuitive and have better performance.
source share