PySpark distinct (). Count () in csv file

Question

I am new to sparks and I am trying to make a separate (). count () based on some fields in the csv file.

Csv structure (no title):

id,country,type 01,AU,s1 02,AU,s2 03,GR,s2 03,GR,s2

To download .csv I typed:

 lines = sc.textFile("test.txt")

then a separate counter on lines returned 3, as expected:

 lines.distinct().count()

But I have no idea how to make a reporting account based on let say id and country .

+6

dimzak Jan 16 '15 at 15:28

2 answers

The split line can be optimized as follows:

 sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()

+2

rami May 11 '15 at 16:32

elyase · Accepted Answer · 2015-01-16T15:39:35+0000

In this case, you must select the columns that you want to consider, and then count:

 sc.textFile("test.txt")\ .map(lambda line: (line.split(',')[0], line.split(',')[1]))\ .distinct()\ .count()

This is for clarity, you can optimize lambda to avoid calling line.split twice.