Safe data normalization with SQL query

Suppose I have a customer table:

CREATE TABLE customers ( customer_number INTEGER, customer_name VARCHAR(...), customer_address VARCHAR(...) ) 

There is no primary key in this table. However, customer_name and customer_address must be unique for any given customer_number .

Often this table contains many duplicate clients. To get around this duplication, the following query is used to isolate only unique clients:

 SELECT DISTINCT customer_number, customer_name, customer_address FROM customers 

Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number . However, suppose conflicting data is in the table. I want to write a query that will fail, and not return multiple rows for the customer_number in question.

For example, I tried this query without success:

 SELECT customer_number, DISTINCT(customer_name, customer_address) FROM customers GROUP BY customer_number 

Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?

EDIT: Justification for a strange request:

In truth, this customer table does not actually exist (thank god). I created it, hoping it would be clear enough to demonstrate the needs for the request. However, people (fortunately) will realize that the need for such a request is the least of my worries based on this example. Therefore, I must now clear part of the abstraction and, hopefully, restore my reputation for having offered such an abomination to the table ...

I get a flat file containing invoices (one per line) from an external system. I read this file in turn, inserting its fields into this table:

 CREATE TABLE unprocessed_invoices ( invoice_number INTEGER, invoice_date DATE, ... // other invoice columns ... customer_number INTEGER, customer_name VARCHAR(...), customer_address VARCHAR(...) ) 

As you can see, the data coming from the external system is denormalized. That is, the external system includes both account data and related customer data in one line. It is possible that multiple invoices will share the same customer, so it is possible to have duplicate customer data.

The system cannot start processing invoices until all customers are registered in the system. Therefore, the system must identify unique customers and register them as necessary. That's why I need a query: because I was working with denormalized data, I had no control over .

 SELECT customer_number, DISTINCT(customer_name, customer_address) FROM unprocessed_invoices GROUP BY customer_number 

Hope this helps clarify the original meaning of the question.

EDIT: Good / Bad Data Examples

To clarify: customer_name and customer_address must be unique for a particular customer_number .

  customer_number | customer_name | customer_address ---------------------------------------------------- 1 | 'Bob' | '123 Street' 1 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' 3 | 'Fred' | '456 Avenue' 3 | 'Fred' | '789 Crescent' 

The first two lines are exact because they are the same customer_name and customer_address for customer_number 1.

The middle two rows are exact because they are the same customer_name and customer_address for customer_number 2 (although the other customer_number has the same customer_name and customer_address ).

The last two lines do not match , because for customer_number 3 there are two different customer_address es.

The query I'm looking for will fail if you run all six of these lines. However, if only the first four lines existed, the view should return:

  customer_number | customer_name | customer_address ---------------------------------------------------- 1 | 'Bob' | '123 Street' 2 | 'Bob' | '123 Street' 

Hope this clarifies what I meant by "conflicting customer_name and customer_address ". They must be unique to customer_number .

I appreciate those that explain how to correctly import data from external systems. In fact, I already do most of this already. I deliberately hid all the details of what I am doing to make it easier to focus on the issue. This request should not be the only form of verification. I just thought it would be a nice touch (last defense, so to speak). This question was simply designed to examine only what was possible with SQL. :)

+4
source share
8 answers

The scalar subquery should only return one row (for each row of the result set ...) so you can do something like:

  select distinct
        customer_number
        (
        select distinct
               customer_address
          from customers c2
         where c2.customer_number = c.customer_number
        ) as customer_address
   from customers c
+2
source

Your approach is wrong. You do not want the data that was successfully saved, then throwing an error when choosing - this is a surface mine that should happen, and means that you never know when the choice can fail.

What I recommend is that you add a unique key to the table and slowly begin to modify the application to use this key, instead of relying on any combination of significant data.

Then you can stop worrying about duplicate data, which is not really duplicated in the first place. It is possible that two people with the same name will have the same address.

You will also get productivity gains from this approach.

As a side, I strongly recommend that you normalize your data, which breaks the name into FirstName and LastName (optionally MiddleName too) and breaks the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip or something else)

Update: If I understand your situation correctly (I'm not sure that I am doing this), you want to prevent duplicate name and address combinations from ever occurring in the table (although this is a possible occurrence in real life). This is best done with a unique constraint or index for these two fields to prevent data from being inserted. That is, catch the error before that you insert. This will tell you the import file, or your resulting application logic is bad and you can take appropriate action.

I still claim that the error that occurs when the request is too late in the game to do something.

+3
source

Fulfilling a request can be complicated ...

This will show you if the table has duplicate entries:

 select customer_number, customer_name, customer_address from customers group by customer_number, customer_name, customer_address having count(*) > 1 

If you simply add a unique index for all three fields, no one can create a duplicate record in the table.

0
source

The defacto key is the name + address, so you need to group.

 SELECT Customer_Name, Customer_Address, CASE WHEN Count(DISTINCT Customer_Number) > 1 THEN 1/0 ELSE 0 END as LandMine FROM Customers GROUP BY Customer_Name, Customer_Address 

If you want to do this from the point of view of Customer_Number, that’s good too.

 SELECT *, CASE WHEN Exists(( SELECT top 1 1 FROM Customers c2 WHERE c1.Customer_Number != c2.Customer_Number AND c1.Customer_Name = c2.Customer_Name AND c1.Customer_Address = c2.Customer_Address )) THEN 1/0 ELSE 0 END as LandMine FROM Customers c1 WHERE Customer_Number = @Number 
0
source

If you want it to work, you will need an index. If you do not want to have an index, you can simply create a temporary table to do all this.

 CREATE TABLE #temp_customers (customer_number int, customer_name varchar(50), customer_address varchar(50), PRIMARY KEY (customer_number), UNIQUE(customr_name, customer_address)) 

)

 INSERT INTO #temp_customers SELECT DISTINCT customer_number, customer_name, customer_address FROM customers SELECT customer_number, customer_name, customer_address FROM #temp_customers DROP TABLE #temp_customers 

This will not work if problems arise, but will prevent problems with duplicate recordings.

0
source

If you have dirty data, I cleaned it first.

Use this to find duplicate customer records ...

 Select * From customers Where customer_number in (Select Customer_number from customers Group by customer_number Having count(*) > 1) 
0
source

Put data in temp table or table with your separate query

 select distinct customer_number, customer_name, customer_address, IDENTITY(int, 1,1) AS ID_Num into #temp from unprocessed_invoices 

Personally, I will add extra to unpublished accounts, if possible. I never do import without creating a staging table that has an identity column, because it is easier to delete duplicate records.

Now ask the table to find your problem records. I assume that you will want to know what causes the problem, and not just not to let them.

 Select t1.* from #temp t1 join #temp t2 on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address where t1.customer_number <> t2.customer_number select t1.* from #temp t1 join (select customer_number from #temp group by customer_number having count(*) >1) t2 on t1.customer_number = t2.customer_number 

You can use a variant of these queries to remove problem records from #temp (depending on whether you want to save one or delete all possible problems), and then paste from #temp into your production table. You can also send problem reports to those who provide you with data that will be fixed at the end.

0
source
 Select t1.* from #temp t1 join #temp t2 on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address where t1.customer_number <> t2.customer_number select t1.* from #temp t1 join (select customer_number from #temp group by customer_number having count(*) >1) t2 on t1.customer_number = t2.customer_number 
-1
source

Source: https://habr.com/ru/post/1286146/


All Articles