How to remove duplicate records from a Hive table?

I am trying to learn about deleting duplicate records from a Hive table.

My Hive: "dynpart" table with columns: Id, Name, Technology

Id  Name  Technology
1   Abcd  Hadoop
2   Efgh  Java
3   Ijkl  MainFrames
2   Efgh  Java

We have options such as "Distinct" to use in the select query, but the select query just retrieves the data from the table. Can anyone tell how to use a delete query to remove duplicate rows from a Hive table.

I am sure that this is not recommended or not the standard for deleting / updating entries in Hive. But I want to know how to do it.

+7
source share
3 answers

You can use insert overrite statement to update data.

insert overwrite table dynpart select distinct * from dynpart;
+20
source

create table temp as select distinct * from dynpart
+3

Just in case, when your table has duplicate rows in multiple or selected columns. Suppose you have a table structure as shown below:

id  Name    Technology
1   Abcd    Hadoop
2   Efgh    Java       --> Duplicate
3   Ijkl    Mainframe
2   Efgh    Python     --> Duplicate

Here are the id & Name columns having duplicate rows. You can use the analytic function to get a duplicate of the string:

select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;

This will give you a conclusion like:

id  Name    Technology  row_num
2   Efgh    Python           2

When you need to get as duplicate rows:

select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;

Output as:

id  Name    Technology  duplicate_count
2   Efgh    Java             2
2   Efgh    Python           2
0
source

Source: https://habr.com/ru/post/1016392/


All Articles