I have a large distributed file on HDFS, and every time I use sqlContext with spark-csv package, it first downloads the whole file, which takes quite a lot of time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now, since I just want to do some quick checks at times, all I need is some / any n lines of the whole file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but they all start after the file is downloaded. Can't I just limit the number of lines when reading the file itself? I mean the equivalent of n_rows pandas in spark-csv, for example:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it may happen that the spark does not actually download the file, the first step, but in this case, why does my step of loading the file take too long?
I want to
df.count()
n, , ?