Storing n-grams in a database in <n number of tables

If I were writing a piece of software that tried to predict which word the user was going to type next, using the two previous words that the user typed, I would create two tables.

Same:

== 1-gram table ==
Token | NextWord | Frequency
------+----------+-----------
"I"   | "like"   | 15
"I"   | "hate"   | 20

== 2-gram table ==
Token    | NextWord   | Frequency
---------+------------+-----------
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

In accordance with this implementation example, the user enters “I”, and the software, using the database above, predicts that the next word that the user types is “hate”. If the user enters “hate,” then the software predicts that the next word the user types is “tomatoes.”

n-, . , 5 6 , 5-6 n-.

, n-, ?

+3
3

-

phrase, frequency

"" , . "is not" to "is not".

MD5, CRC32 .

+2

?

Token    | NextWord   | Frequency
---------+------------+-----------
"I"      | "like"     | 15
"I"      | "hate"     | 20
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

, "", (.. ). , , , ( + 1 - )

+2

, , . , . , . Ad infinitum.

So you can put all 1 gram, 2 gram, etc. in the field Tokenand no one will ever collide.

+1
source

Source: https://habr.com/ru/post/1745710/


All Articles