MySQL - how to find the word with the most similar beginning

How to find varchar -word that has the most similar beginning of a specified word in a MySQL database?

For instance:

 +-------------------+ | word_column | +-------------------+ | StackOferflow | | StackExchange | | MetaStackExchange | | .... | +-------------------+ 

request: call get_with_similar_begin('StackExch_bla_bla_bla');
output: 'StackExchange'

request: call get_with_similar_begin('StackO_bla_bla_bla');
output: 'StackOferflow'


UPDATE:

Select * from words where word_column like 'StackExch_bla_bla_bla' will not give the correct result, because 'StackExchange' does not match this filter.

Additional info: I have a BTREE-index on word_column and I would like to use it word_column possible

+5
source share
5 answers

In SQL Server, we can use the CTE query similar to below to achieve what you want:

 declare @search nvarchar(255) = 'StackExch_bla_bla_bla'; -- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ..., `S`} with cte(part, lvl) as ( select @search, 1 union all select substring(@search, 1, len(@search) - lvl), lvl + 1 from cte where lvl < len(@search) ), t as ( -- Now below cte will find match level of each word_column select t.word_column, min(cte.lvl) matchLvl from yourTable t left join cte on t.word_column like cte.part+'%' group by t.word_column ) select top(1) word_column from t where matchLvl is not null -- remove non-matched rows order by matchLvl; 

SQL Server Script Demo

I need more time to find a way in MySQL for it, hope some MySQL experts will answer faster;).

My best attempt at MySQL is this:

 select tt.word_column from ( select t.word_column, min(lvl) matchLvl from yourTable t join ( select 'StackExch_bla_bla_bla' part, 1 lvl union all select 'StackExch_bla_bla_bl', 2 union all select 'StackExch_bla_bla_b', 3 union all select 'StackExch_bla_bla_', 4 union all select 'StackExch_bla_bla', 5 union all select 'StackExch_bla_bl', 6 union all select 'StackExch_bla_b', 7 union all select 'StackExch_bla_', 8 union all select 'StackExch_bla', 9 union all select 'StackExch_bl', 10 union all select 'StackExch_b', 11 union all select 'StackExch_', 12 union all select 'StackExch', 13 union all select 'StackExc', 14 union all select 'StackEx', 15 union all select 'StackE', 16 union all select 'Stack', 17 union all select 'Stac', 18 union all select 'Sta', 19 union all select 'St', 20 union all select 'S', 21 ) p on t.word_column like concat(p.part, '%') group by t.word_column ) tt order by matchLvl limit 1; 

I think by creating a stored procedure and using a temporary table to store values ​​in p sub-select, you can achieve what you want --HTH;).

MySQL Fiddle Demo

+2
source

This is a slight deviation from @ shA.t's answer. Aggregation is not required:

 select t.*, p.lvl from yourTable t join (select 'StackExch_bla_bla_bla' as part, 1 as lvl union all select 'StackExch_bla_bla_bl', 2 union all select 'StackExch_bla_bla_b', 3 union all select 'StackExch_bla_bla_', 4 union all select 'StackExch_bla_bla', 5 union all select 'StackExch_bla_bl', 6 union all select 'StackExch_bla_b', 7 union all select 'StackExch_bla_', 8 union all select 'StackExch_bla', 9 union all select 'StackExch_bl', 10 union all select 'StackExch_b', 11 union all select 'StackExch_', 12 union all select 'StackExch', 13 union all select 'StackExc', 14 union all select 'StackEx', 15 union all select 'StackE', 16 union all select 'Stack', 17 union all select 'Stac', 18 union all select 'Sta', 19 union all select 'St', 20 union all select 'S', 21 ) p on t.word_column like concat(p.part, '%') order by matchLvl limit 1; 

A faster way is to use case :

 select t.*, (case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla' when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl' when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b' . . . when t.word_column like concat('S', '%') then 'S' else '' end) as longest_match from t order by length(longest_match) desc limit 1; 

None of them will use the index effectively.

If you want a version using an index, run a loop at the application level and retry the query as follows:

 select t.* from t where t.word_column like 'StackExch_bla_bla_bla%' limit 1; 

Then stop when you hit the first match. MySQL should use an index to compare like .

You can get closer to this using union all :

 (select t.*, 'StackExch_bla_bla_bla' as matching from t where t.word_column like 'StackExch_bla_bla_bla%' limit 1 ) union all (select t.*, 'StackExch_bla_bla_bl' from t where t.word_column like 'StackExch_bla_bla_bl%' limit 1 ) union all (select t.*, 'StackExch_bla_bla_b' from t where t.word_column like 'StackExch_bla_bla_b%' limit 1 ) union al . . . (select t.*, 'S' from t where t.word_column like 'S%' limit 1 ) order by length(matching) desc limit 1; 
+2
source

Create a table / insert data.

 CREATE DATABASE IF NOT EXISTS stackoverflow; USE stackoverflow; DROP TABLE IF EXISTS word; CREATE TABLE IF NOT EXISTS word( word_column VARCHAR(255) , KEY(word_column) ) ; INSERT INTO word (`word_column`) VALUES ('StackOverflow'), ('StackExchange'), ('MetaStackExchange') ; 

This decision depends on creating a large list of numbers. We can do this with this query. This query generates numbers from 1 to 1000. I make this query support searching up to 1000 characters.

Query

 SELECT @row := @row + 1 AS ROW FROM ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row1 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row2 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row3 CROSS JOIN ( SELECT @row := 0 ) AS init_user_param 

result

  row -------- 1 2 3 4 5 6 7 8 9 10 ... ... 990 991 992 993 994 995 996 997 998 999 1000 

Now we use the last query as the supplied table in combination with DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number]) to find a unique list of words.

Query

 SELECT DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word FROM ( SELECT @row := @row + 1 AS ROW FROM ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row1 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row2 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row3 CROSS JOIN ( SELECT @row := 0 ) AS init_user_param ) ROWS 

Result

 word ----------------------- S St Sta Stac Stack StackE StackEx StackExc StackExch StackExch_ StackExch_b StackExch_bl StackExch_bla StackExch_bla_ StackExch_bla_b StackExch_bla_bl StackExch_bla_bla StackExch_bla_bla_ StackExch_bla_bla_b StackExch_bla_bla_bl StackExch_bla_bla_bla 

Now you can join and use REPLACE(word_column, word, '') and CHAR_LENGTH(REPLACE(word_column, word, '')) to create a list.

Query

 SELECT * , REPLACE(word_column, word, '') AS replaced , CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace FROM ( SELECT DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word FROM ( SELECT @row := @row + 1 AS row_number FROM ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row1 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row2 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row3 CROSS JOIN ( SELECT @row := 0 ) AS init_user_param ) ROWS ) words INNER JOIN word ON word.word_column LIKE CONCAT(words.word, '%') 

Result

 word word_column replaced chars_afterreplace ---------- ------------- ------------- -------------------- S StackExchange tackExchange 12 S StackOverflow tackOverflow 12 St StackExchange ackExchange 11 St StackOverflow ackOverflow 11 Sta StackExchange ckExchange 10 Sta StackOverflow ckOverflow 10 Stac StackExchange kExchange 9 Stac StackOverflow kOverflow 9 Stack StackExchange Exchange 8 Stack StackOverflow Overflow 8 StackE StackExchange xchange 7 StackEx StackExchange change 6 StackExc StackExchange hange 5 StackExch StackExchange ange 4 StackExch_ StackExchange StackExchange 13 

Now we can clearly see that we need the word with the smallest chars_afterreplace. So we want to do ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1

Query

 SELECT word.word_column FROM ( SELECT DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word FROM ( SELECT @row := @row + 1 AS row_number FROM ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row1 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row2 CROSS JOIN ( SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 ) row3 CROSS JOIN ( SELECT @row := 0 ) AS init_user_param ) ROWS ) words INNER JOIN word ON word.word_column LIKE CONCAT(words.word, '%') ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1 

results

 word_column --------------- StackExchange 
+2
source

The following solutions need a table containing serial numbers from 1 to (at least) the length of your word_column . Assuming word_column VARCHAR(190) you need a table with numbers from 1 to 190. If you use MariaDB with a sequence plugin, you can use the table seq_1_to_190 . If you do not have it, there are many ways to create it. One easy way is to use the information_schema.columns table:

 create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key) select null as seq from information_schema.columns limit 190; 

You can also create it on the fly in a subquery, but that will complicate your queries.

I will use the @word session @word to store the search string.

 set @word = 'StackExch_bla_bla_bla'; 

But you can replace all its occurrences with a constant search string.

Now we can use the sequence table to create all the substrings using

 select seq as l, left(@word, seq) as substr from seq_1_to_190 s where s.seq <= char_length(@word) 

http://rextester.com/BWU18001

and use it for the LIKE condition when you join it to the words table:

 select w.word_column from ( select seq as l, left(@word, seq) as substr from seq_1_to_190 s where s.seq <= char_length(@word) ) s join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%') order by sl desc limit 1 

http://rextester.com/STQP82942

Note that _ is a placeholder, and you need to escape it in the search bar with \_ . You should also do this for % if your line can contain it, but I will skip this part in my answer.

A request can also be written without a subquery:

 select w.word_column from seq_1_to_190 s join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%') where s.seq <= char_length(@word) order by s.seq desc limit 1 

http://rextester.com/QVZI59071

These queries do the job, and in theory they should also be fast. But MySQL (in my case, MariaDB 10.0.19) creates a poor execution plan and does not use the index for the ORDER BY . Both queries are executed after approximately 1.8 seconds with a data set of 100 thousand lines.

The best I could do to improve performance with a single query is

 select ( select word_column from words w where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%') limit 1 ) as word_column from seq_1_to_190 s where s.seq <= char_length(@word) having word_column is not null order by s.seq desc limit 1 

http://rextester.com/APZHA8471

It's faster, but it still takes 670 ms. Note that the CORD request for Gordons is completed in 125 ms, although this requires a full table / index scan and file server.

However, I managed to get the engine to use the index for the ORDER BY with an indexed temporary table:

 drop temporary table if exists tmp; create temporary table tmp( id tinyint unsigned auto_increment primary key, pattern varchar(190) ) engine=memory select null as id, left(@word, seq) as pattern from seq_1_to_190 s where s.seq <= char_length(@word) order by s.seq desc; select w.word_column from tmp force index for order by (primary) join words w on w.word_column >= tmp.pattern and w.word_column < concat(tmp.pattern, char(127)) order by tmp.id asc limit 1 

http://rextester.com/OOE82089

This query is "instantaneous" (less than 1 ms) in my test pattern of rows of 100 thousand rows. If I delete FORCE INDEX or use the LIKE clause, it will be slow again.

Note that char(127) seems to work for ASCII strings. You may need to find another character according to your character set.

After all this, I have to say that my first thought was to use the UNION ALL query that Gordon Linoff also suggested. However, here is just the SQL solution:

 set @subquery = '( select word_column from words where word_column like {pattern} limit 1 )'; set session group_concat_max_len = 1000000; set @sql = ( select group_concat( replace( @subquery, '{pattern}', replace(quote(concat(left(@word, seq), '%')), '_', '\_') ) order by s.seq desc separator ' union all ' ) from seq_1_to_190 s where s.seq <= char_length(@word) ); set @sql = concat(@sql, ' limit 1'); prepare stmt from @sql; execute stmt; 

http://rextester.com/OPTJ37873

He is also "instant."

If you like stored procedures / functions, here is the function:

 create function get_with_similar_begin(search_str text) returns text begin declare l integer; declare res text; declare pattern text; set l = char_length(search_str); while l > 0 and res is null do set pattern = left(search_str, l); set pattern = replace(pattern, '_', '\_'); set pattern = replace(pattern, '%', '\%'); set pattern = concat(pattern, '%'); set res = (select word_column from words where word_column like pattern); set l = l - 1; end while; return res; end 

Use it like

 select get_with_similar_begin('StackExch_bla_bla_bla'); select get_with_similar_begin('StackO_bla_bla_bla'); 

http://rextester.com/CJTU4629

This is most likely the fastest way. Although for long lines, some sort of split and conquer algorinthm can reduce the average number of searches. But it can also be just redundant.

If you want to test your queries in a large table, I used the following code to create my test table (for MariaDB with a sequence plugin):

 drop table if exists words; create table words( id mediumint auto_increment primary key, word_column varchar(190), index(word_column) ); insert into words(word_column) select concat('Stack', rand(1)) as word_column from seq_1_to_100000; insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange'); 
0
source

You can assign a like operator to enter a value

 select * from words where 'your_input_value' like concat(word_column,'%') 
-1
source

Source: https://habr.com/ru/post/1272803/


All Articles