Best practice for SQL lookup tables

I am new to SQL, so if these questions sound weird.

I keep going on the issue of bad data. For example, London could be stored as LON, London UK, London England, etc. Before using SQL, I had many Excel lookup tables in which in the first column I would have the original, and then the 2n revised version. As an example:

Name Name_1 London, UK London Lon London LON London London London London, England London LND London 

Is there an easy way to do this in SQL, I'm currently trying to create lookup tables and then use joins. This becomes complicated because I do not always have corrections for each instance, so in some scenarios (most) my lookup tables have fewer elements than what I join.

I taught myself stored procedures, and I wondered if this could fit what I needed. The problem is that my search on the topic of search tables looks empty.

Any advice or pointers would be greatly appreciated. Even if it is simply impossible to do.

Thank you, as always, for your help and addiction to the long message.

+6
source share
4 answers

You do not need to do any other things, just return the original if you do not have a translation for it.

 SELECT t1.FirstName, t1.LookupField, case when t2.Name_1 is null then t1.lookupfield else t2.name_1 end Name_1 FROM People as t1 LEFT INNER JOIN TableLookupCities as t2 ON t1.LookupField = t2.Name 
+1
source

You can join the lookup table and better use the value specified here. If not found, use the original:

 SELECT t1.FirstName, LookupField = ISNULL(t2.Name_1, t1.LookupField) FROM People as t1 LEFT INNER JOIN TableLookupCities as t2 ON t1.LookupField = t2.Name 

Make sure that for each name there is no more than one match in TableLookupCities , otherwise the join will produce several results. Create a unique index on TableLookupCities.Name :

 CREATE UNIQUE (CLUSTERED) INDEX djgndkg ON TableLookupCities (Name) INCLUDE (Name_1) 
+1
source

The bottom line ... Bad data is bad data, and it takes a lot of work to work with bad data or pure bad data. // p>

UPDATE AFTER APPLICATION

Create your own ETL process (Extract, Transform, Load) to handle all the input options. Your ETL process is likely to change with each new batch of data that you receive, because you will have to trap the new Bad Data options.

Import data into ALL VARCHAR table
Run the ETL process

  • Good data gets into real data tables
  • Bad data falls into the exception table

Repeat
ETL process change
Launch ETL Process
No additional exceptions so far.

- End of update

If you use LEFT JOIN, you can easily identify missing values.

 SELECT t1.FirstName, t1.LookupField, t2.Name_1 FROM People as t1 LEFT INNER JOIN TableLookupCities as t2 ON t1.LookupField = t2.Name 

Anywhere t2.Name_1 returns NULL, you know that you need to add this “LookupField” to your lookup table. Here is a good book to explore database design Database design for mere mortals

 -- Group By to Find Missing Unique Values t1.LookupField, t2.Name_1 FROM People as t1 LEFT INNER JOIN TableLookupCities as t2 ON t1.LookupField = t2.Name GROUP BY t1.LookupField, t2.Name_1 
0
source

As mentioned above, bad data is its own problem. Data cleaning is an industry in itself, so you have a huge selection of possibilities for these kinds of problems, from simple and simple to complex, to fix all calls and whistles. Which is "better" depends on your situation and needs.

Of course, you can continue to expand this lookup table to satisfy the growing number of standard errors / variations, but if it is a constant stream of information, there is overhead for maintenance. This may be adequate for your needs, so do not shy away from it just because there are more favorable alternatives.

This is a fairly common place to trade in the reliability of manual human intervention for scalability of automated approaches; it is much easier to maintain and grow, but (depending on the nature of your problem) may be wrong.

For example: 1. Use a template-based approach (Contains, LIKE, RegEx) to find something that seems reasonable. This can be great in some situations, for example, when Name_1 is a static, well-understood list, so you can make sure that the results are usually pretty good. + easy to configure / understand + more flexible than the complete list - some maintenance is still required - hopeless in difficult / poorly understood situations.

For example, 2. In a more general case, you can use the text search capabilities offered by the database to “evaluate” how well one value matches the other, and choose the best match option. Again, this is not flawless proof or security in all contexts, and it is a bit more setup work, but it is much more reliable. This is a bit more intense, so the size of the data sets used, the time frame of your work, and the available infrastructure are also considerations. + pretty good success rates - slow setup - high performance overhead

eg 3. Another option is something more than a specific domain. In this case, this is spatial data, so you can use a third-party geocoding service as a means of verification. + high success rate + able to cope with huge ranges of values ​​- additional costs are possible - the most difficult / slowest setting

0
source

Source: https://habr.com/ru/post/956166/


All Articles