How to compare two columns of data and print that are different in scala

We have two data frames:

expected data frame:

+------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romin|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+ 

and actual data frame:

 +------+---------+--------+----------+-------+--------+ |emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+---------+--------+----------+-------+--------+ | 3| Chennai| rahman|9848022330| 45000|SanRamon| | 1|Hyderabad| ram|9848022338| 50000| SF| | 2|Hyderabad| robin|9848022339| 40000| LA| | 4| sanjose| romino|9848022331| 45123|SanRamon| +------+---------+--------+----------+-------+--------+ 

now the difference between two data frames:

 +------+--------+--------+----------+-------+--------+ |emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site| +------+--------+--------+----------+-------+--------+ | 4| sanjose| romino|9848022331| 45123|SanRamon| +------+--------+--------+----------+-------+--------+ 

We use the exclusive function df1.except (df2), however, the problem with this is that it returns all different rows. We want to see which columns are different in this row (in this case, “romin” and “romino” from “emp_name” are different). We had enormous difficulties with this, and any help would be great.

+5
source share
1 answer

From the script described in the above question, it seems that this difference should be found between the columns, not the rows.

So, for this we need to apply a selective difference here, which will provide us with columns with different values ​​along with the values.

Now, to apply the selective difference, we have to write the code something like this:

  • First we need to find the columns in the expected and actual data frames.

    val columns = df1.schema.fields.map (_. name)

  • Then we need to find the difference in the column.

    val selectiveDifferences = columns.map (col => df1.select (col) .except (df2.select (col)))

  • Finally, we need to find out which columns contain different values.

    selectiveDifferences.map (diff => {if (diff.count> 0) diff.show})

And we get only columns that contain different values. Like this:

 +--------+ |emp_name| +--------+ | romino| +--------+ 

Hope this helps!

+10
source

Source: https://habr.com/ru/post/1268496/


All Articles