MySQL string comparison

I asked a similar question a couple of months ago. Located here: MySQL Query Based on a String

The problem I am facing is that it only works in one specific order, and in some cases it works too well.

Here is a piece of data that this request filters out (duplicates are intended, actual data):

- BELLMORE - ATLANTIC BCH - ATLANTIC BEACH - E HILLS - EAST HILLS - EAST ROCKAWAY - FAR ROCKAWAY - FLORAL PARK - FLORAL PARK - HIGHLAND HEIGHTS - N HIGHLAND HGTS - NORTH HIGHLAND HEIGHTS 

One query that helped in my last question ( row-based MySQL Query ) worked well for one instance and failed for another instance. Here is the request:

 select names from tablename group by substring_index(names," ",1) 

What returns:

 - BELLMORE - ATLANTIC BEACH - EAST HILLS - FAR ROCKAWAY - FLORAL PARK - HIGHLAND HEIGHTS - N HIGHLAND HGTS - NORTH HIGHLAND HEIGHTS 

The problem with this is that since you can see that he deleted the city that he should not have, because he used only the first word to group it. The one he deleted was:

 - EAST ROCKAWAY 

It was GROUP'ed BY EAST.

As I continue to write this, I feel that it is almost impossible, because the position of the static city name and variable parts always changes. If you can not compare a certain number of characters. Which is far from perfect. If someone thinks that they may have some kind of understanding, or they worked, and I achieved this, I will be grateful for the feedback and recommendations. The end result will be:

 - BELLMORE - ATLANTIC BEACH - EAST HILLS - EAST ROCKAWAY - FAR ROCKAWAY - FLORAL PARK - HIGHLAND HEIGHTS 
+6
source share
2 answers

My suggestion will be an expensive request, but I hope you can use this type of operation to periodically "clean" your data so that it is not required every time you request this data.

You might consider looking for the Levenshtein distance formula ... which is a string metric for measuring the sum of the difference between two sequences.

To avoid the need to calculate the distance for the Cartesian product of your table, you could first narrow down the set of cities and addresses that will be compared with a faster performance check ... for example, they start with the same letter and have a similar length.

Initially, you could start by only returning records with a very small Levenshtein distance ... Then you could choose one variant of matches returned for application to other records in order to normalize your data.

Then you can gradually increase the distance until you start too many false positives.

Here's the implementation directly in MySql :

 CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END; 
+2
source

a nut ...

Of course, I would have taken advantage of Michaelโ€™s offer and thrown the opportunity to save unique geographical names in the database.

This will allow you to use line distance calculation when adding new places. Then you can manage the places by assigning associate_id to the places that levenshtein identifies.

Perhaps you could use some other data (such as geolocation) to further customize how you link places. Perhaps the shot just taken using the name of the place may not be the best solution to your problem ...

+1
source

Source: https://habr.com/ru/post/911810/


All Articles