How to get a String Iterator using a Dataframe in SparkSQL

Question

How to get a String Iterator using a Dataframe in SparkSQL

I have an application in SparkSQL that returns a large number of rows that are very difficult to fit in memory, so I cannot use the collection function in the DataFrame, there is a way by which I can get all these rows as an Iterable installation of all rows as a list .

Note. I am running this SparkSQL application using a yarn client

+4

apache-spark apache-spark-sql apache-spark-1.3

Sachin janani Oct 6 '15 at 10:51

source share

2 answers

In fact, you can simply use:, df.toLocalIteratorhere is the link in the Spark source code:

/**
 * Return an iterator that contains all of [[Row]]s in this Dataset.
 *
 * The iterator will consume as much memory as the largest partition in this Dataset.
 *
 * Note: this results in multiple Spark jobs, and if the input Dataset is the result
 * of a wide transformation (e.g. join with different partitioners), to avoid
 * recomputing the input Dataset should be cached first.
 *
 * @group action
 * @since 2.0.0
 */
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
  queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
  }
}

0

Kehe cai Feb 22 '17 at 7:21

source share

zero323 · Accepted Answer · 2015-10-06T11:07:18+0000

Generally speaking, transferring all the data to the driver seems like a pretty bad idea, and most of the time there is a better solution, but if you really want to do this, you can use the method toLocalIteratoron RDD:

val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator

How to get a String Iterator using a Dataframe in SparkSQL

More articles: