An effective way to create vocabulary graphics

Question

An effective way to create vocabulary graphics

What is the most effective way to create a graph of words in a dictionary with distance from interference = 1 ??

+4

c ++ dictionary algorithm graph distance

copperhead Dec 17 '10 at 8:52

2 answers

I would suggest an adjacency map<string, list<string> to represent the graph. It maps the node graph (a word in our case) to a list of neighboring nodes (i.e., All words within a distance == 1).

How to fill this card? First, put the dictionary into sets of words with the same length set<string> wordset[MAX_WORD_LENGTH] and fill in the map as follows.

 map<string, list<string>> adjacency_map for each wordset in wordset array for each w1 in wordset for each w2 in wordset if one_char_off(w1, w2) { // if the distance between w1 and w2 == 1 update(adjacency_map, w1, w2) update(adjacency_map, w2, w1) }

The one_char_off(string w1, string w2) function checks if the distance between w1 and w2 is 1.

 int diff = 0 for i in [0, w1.length) // both w1 and w2 have the same length if w1[i] != w2[i] if ++diff > 1 return false return diff == 1

The update(map<string, list<string>> adjacency_map, string w1, string w2) function update(map<string, list<string>> adjacency_map, string w1, string w2) adds a pair of adjacent words w1 and w2 to the adjacent map.

0

Michael Dec 17 '10 at 11:13

source share

Nick johnson · Accepted Answer · 2010-12-17T10:59:07+0000

Hamming distance is determined only for words of equal length, so you will actually have one disjoint graph for each word length in the dictionary. If you meant levenshtein a distance that allows you to insert and delete, then you really will have one graph.

One option is to build a BK-tree from your dictionary. Although this does not mean that this is a graph, it allows you to ask the same questions (getting a list of elements with a given distance) and taking out the time O (n log n).

Another option is brute force: for each word, check its distance to all candidate words. You can narrow down the words of the candidate to the same lengths (or one length is shorter or longer for levenshtein). This is O (n ^ 2) in the worst case, but it may be acceptable if you don't build the graph more than once.

Theoretically, there probably exists an O (n log n) plotting method — in the trivial case, building a BK tree and then plotting from O (mn log n), where m is the average number of edges per node — but I don’t know about elegant.

An effective way to create vocabulary graphics

More articles: