How does the performance of select expressions affect Spark DataFrames?

Using a lot of Select statements or Select statements on Spark DataFrames, I wonder what their effect is on subsequent transformations after triggering.

For a data frame dfwith 10 columns from a to j.

  • How does it affect if I use asto rename columns in each column?

    df.select (df ("a"). as ("1"), ..., df ("j"). as ("10"))

  • What if I select a subset (e.g. 5 columns)

    val df2 = df.select (df ("a"), ..., df ("e"))

    b. How does this projection handle? Is it preserved df(since it df2is a projection), therefore it dfcan serve as a kind of reference? Or df2is fresher created instead and dfdiscarded? (neglecting any perseverance here)

  • How does the general expression Columnused in select?

  • Are performance tests available for the above cases? Are performance measurements generally available? If not, what is the best way to measure performance?

+4
source share

Source: https://habr.com/ru/post/1609540/


All Articles