PySpark distinct (). Count () in csv file

I am new to sparks and I am trying to make a separate (). count () based on some fields in the csv file.

Csv structure (no title):

id,country,type 01,AU,s1 02,AU,s2 03,GR,s2 03,GR,s2 

To download .csv I typed:

 lines = sc.textFile("test.txt") 

then a separate counter on lines returned 3, as expected:

 lines.distinct().count() 

But I have no idea how to make a reporting account based on let say id and country .

+6
source share
2 answers

In this case, you must select the columns that you want to consider, and then count:

 sc.textFile("test.txt")\ .map(lambda line: (line.split(',')[0], line.split(',')[1]))\ .distinct()\ .count() 

This is for clarity, you can optimize lambda to avoid calling line.split twice.

+6
source

The split line can be optimized as follows:

 sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count() 
+2
source

Source: https://habr.com/ru/post/981117/


All Articles