How to remove duplicate records from a Hive table?

Question

How to remove duplicate records from a Hive table?

I am trying to learn about deleting duplicate records from a Hive table.

My Hive: "dynpart" table with columns: Id, Name, Technology

Id  Name  Technology
1   Abcd  Hadoop
2   Efgh  Java
3   Ijkl  MainFrames
2   Efgh  Java

We have options such as "Distinct" to use in the select query, but the select query just retrieves the data from the table. Can anyone tell how to use a delete query to remove duplicate rows from a Hive table.

I am sure that this is not recommended or not the standard for deleting / updating entries in Hive. But I want to know how to do it.

+7

hadoop hive

Metadata Apr 7 '17 at 13:59

source share

3 answers

create table temp as select distinct * from dynpart

+3

Shalaj Shukla 07 . '17 15:33

Just in case, when your table has duplicate rows in multiple or selected columns. Suppose you have a table structure as shown below:

id  Name    Technology
1   Abcd    Hadoop
2   Efgh    Java       --> Duplicate
3   Ijkl    Mainframe
2   Efgh    Python     --> Duplicate

Here are the id & Name columns having duplicate rows. You can use the analytic function to get a duplicate of the string:

select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;

This will give you a conclusion like:

id  Name    Technology  row_num
2   Efgh    Python           2

When you need to get as duplicate rows:

select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;

Output as:

id  Name    Technology  duplicate_count
2   Efgh    Java             2
2   Efgh    Python           2

0

vikrant rana May 28 '19 at 9:41

source share

fi11er · Accepted Answer · 2017-04-11T17:00:20+0000

You can use insert overrite statement to update data.

insert overwrite table dynpart select distinct * from dynpart;

How to remove duplicate records from a Hive table?

More articles: