How to compare two columns of data and print that are different in scala

Question

How to compare two columns of data and print that are different in scala

We have two data frames:

expected data frame:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romin|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+

and actual data frame:

 +------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romino|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+

now the difference between two data frames:

 +------+--------+--------+----------+-------+--------+ |emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+--------+--------+----------+-------+--------+ | 4| sanjose| romino|9848022331| 45123|SanRamon| +------+--------+--------+----------+-------+--------+

We use the exclusive function df1.except (df2), however, the problem with this is that it returns all different rows. We want to see which columns are different in this row (in this case, “romin” and “romino” from “emp_name” are different). We had enormous difficulties with this, and any help would be great.

+5

scala compare bigdata apache-spark spark-dataframe

rominoushana Jun 2 '17 at 10:47

source share

1 answer

himanshuIIITian · Accepted Answer · 2017-06-03T08:22:57+0000

From the script described in the above question, it seems that this difference should be found between the columns, not the rows.

So, for this we need to apply a selective difference here, which will provide us with columns with different values along with the values.

Now, to apply the selective difference, we have to write the code something like this:

First we need to find the columns in the expected and actual data frames.
val columns = df1.schema.fields.map (_. name)
Then we need to find the difference in the column.
val selectiveDifferences = columns.map (col => df1.select (col) .except (df2.select (col)))
Finally, we need to find out which columns contain different values.
selectiveDifferences.map (diff => {if (diff.count> 0) diff.show})

And we get only columns that contain different values. Like this:

 +--------+ |emp_name| +--------+ | romino| +--------+

Hope this helps!

How to compare two columns of data and print that are different in scala

More articles: