Effective data structure / transliteration-based word search algorithm

I am looking for an efficient data structure / algorithm for storing and searching transliteration-based searches (e.g. google do: http://www.google.com/transliterate/ , but I'm not trying to use the google transliteration API). Unfortunately, the natural language I'm trying to work on does not have any sound effects, so I myself.

For an open source project, I currently use simple arrays to store a list of words and dynamically generate regular expressions (based on user input) to match them. It works fine, but the regex is too powerful or resource intensive than I need. For example, I’m afraid that this solution will drain the battery too much if I try to transfer it to handheld devices, because searching through thousands of words with a regular expression is too expensive.

There must be a better way to accomplish this for complex languages, how, for example, does the Pinyin input method work? Any suggestion on where to start?

Thanks in advance.


Edit: if I understand correctly, this is suggested by @Dialecticus -

I want to transliterate from Language1 , which has 3 characters a,b,c to Language2 , which has 6 characters p,q,r,x,y,z . As a result of differences in the number of characters each language has and their phones, it is often impossible to define a one-to-one mapping.

Suppose phonetically it is assumed that this is our associative array / transliteration table:

 a -> p, q b -> r c -> x, y, z 

We also have valid word lists in simple arrays for Language2 :

 ... px qy ... 

If the user types ac , the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2, we must perform another search in the actual list of words and eliminate all of them except px and qy .


What I am currently doing is no different from the above approach. Instead of making possible combinations using the transliteration table, I build the regular expression [pq][xyz] and match it with my current word list, which provides the output of px and qy .

I really want to know if there is any better way than this.

+6
source share
2 answers

From what I understand, you have the input string S in the alphabet (allows you to call it A1), and you want to convert it to the string S ', which is its equivalent in the other alphabet A2. Actually, if I understand correctly, you want to generate a list of [S'1, S'2, ..., S'n] output lines that could potentially be equivalent to S.

One approach that comes to mind is that each word in the list of valid words in A2 generates a list of strings in A1 that matches. Using the example in your editing, we have

 px->ac qy->ac pr->ab 

(clarity for clarity added the extra valid word pr )

Now that we know which possible series of input characters will always be mapped to a real word, we can use our table to create a Trie .

Each node will contain a pointer to a list of valid words in A2 that map to a sequence of characters in A1 that form the path from the Trie root to the current node.

So for our example, Trie would look something like this:

  Root (empty) | a | V +---Node (empty)---+ | b | c | | VV Node (px,qy) Node (pr) 

Starting from the root of the node, when characters are consumed, transitions are made from the current node to its child element, marked with the character consumed until we read the entire line. If a transition is not defined for this symbol at any point, the string entered does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is a list of valid words onto which the input line is entered.

Besides the initial cost of building a trie (a trie can be sent in advance if we never want the list of valid words to be changed), it takes O (n) from the input length to find a display list of actual words.

Using Trie also provides the advantage that you can also use it to find a list of all valid words that can be generated by adding more characters to the end of the input — that is, a prefix match. For example, if you supply the input character "a", we can use trie to search for all valid words that can begin with "a" ('px', 'qr', 'py'). But doing this is not as fast as finding an exact match.

Here's a quick hack solution (in Java):

 import java.util.*; class TrieNode{ // child nodes - size of array depends on your alphabet size, // her we are only using the lowercase English characters 'a'-'z' TrieNode[] next=new TrieNode[26]; List<String> words; public TrieNode(){ words=new ArrayList<String>(); } } class Trie{ private TrieNode root=null; public void addWord(String sourceLanguage, String targetLanguage){ root=add(root,sourceLanguage.toCharArray(),0,targetLanguage); } private static int convertToIndex(char c){ // you need to change this for your alphabet return (c-'a'); } private TrieNode add(TrieNode cur, char[] s, int pos, String targ){ if (cur==null){ cur=new TrieNode(); } if (s.length==pos){ cur.words.add(targ); } else{ cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ); } return cur; } public List<String> findMatches(String text){ return find(root,text.toCharArray(),0); } private List<String> find(TrieNode cur, char[] s, int pos){ if (cur==null) return new ArrayList<String>(); else if (pos==s.length){ return cur.words; } else{ return find(cur.next[convertToIndex(s[pos])],s,pos+1); } } } class MyMiniTransliiterator{ public static void main(String args[]){ Trie t=new Trie(); t.addWord("ac","px"); t.addWord("ac","qy"); t.addWord("ab","pr"); System.out.println(t.findMatches("ac")); // prints [px,qy] System.out.println(t.findMatches("ab")); // prints [pr] System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything } } 

This is a very simple program, without compression or acceleration, and works only on lowercase English characters for the input language. But it can be easily modified for other character sets.

+4
source

I would build a transliterated sentence for one character at a time, instead of one word at a time. For most languages, each character can be transliterated independently of the other characters in the word. You can still have exceptions as whole words that need to be transliterated as complete words, but the table for transliterating characters and exceptions will undoubtedly be smaller than the table for transliterating all existing words.

The best structure for a transliteration table is a kind of associative array , possibly using hash tables . In C ++ there is std::unordered_map , and in C # you should use Dictionary .

+2
source

Source: https://habr.com/ru/post/897955/


All Articles