Morse code translation without spaces

I have a Morse code that has lost spaces between letters, my task is to find out what the message says. So far, I have been lost due to the large number of combinations that may be.

Here is all the information about the messages that I have.

  • The output will be in English
  • There will always be a translation that makes sense
  • Here is an example of a message -..-...-...-...-..-.-.-.-.-..-.-.-.-.-.-.-.-.-.-..-...-.
  • Messages must be no more than 70 characters
  • The morse code was taken from a longer stream, so it’s possible that the first or last groups can be truncated and therefore do not have a valid translation

Does anyone have a smart solution?

+4
source share
3 answers

This is not an easy task because, as Ruach suggested, there are many viable suggestions for this message. For example, "JACK AND JILL WENT UP THE HILL" has the same encoding as "JACK AND JILL WALK CHISELED". Since these words are both grammatical sentences and the words in each of them are common, it is not clear how to choose one or the other (or any other of 40141055989476564163599 different sequences of English words that have the same encoding as this message), without delving into natural language processing.

In any case, here we propose a dynamic programming solution to the problem of finding the shortest sentence (with the smallest characters, if there is a connection). He can also calculate the total number of sentences that have the same encoding as this message. He needs a dictionary of English words in a file.

The following improvements should be the best indicator of how likely the sentence is: possible word frequencies, false positive rates in morse (for example, β€œI” is a common word, but it is often found as part of other morse code sequences). The hard part will be to formulate a good evaluation function, which can be expressed in such a way that it can be calculated using dynamic programming.

 MORSE = dict(zip('ABCDEFGHIJKLMNOPQRSTUVWXYZ', [ '.-', '-...', '-.-.', '-..', '.', '..-.', '--.', '....', '..', '.---', '-.-', '.-..', '--', '-.', '---', '.--.', '--.-', '.-.', '...', '-', '..-', '...-', '.--', '-..-', '-.--', '--..' ])) # Read a file containing AZ only English words, one per line. WORDS = set(word.strip().upper() for word in open('dict.en').readlines()) # A set of all possible prefixes of English words. PREFIXES = set(word[:j+1] for word in WORDS for j in xrange(len(word))) def translate(msg, c_sep=' ', w_sep=' / '): """Turn a message (all-caps space-separated words) into morse code.""" return w_sep.join(c_sep.join(MORSE[c] for c in word) for word in msg.split(' ')) def encode(msg): """Turn a message into timing-less morse code.""" return translate(msg, '', '') def c_trans(morse): """Construct a map of char transitions. The return value is a dict, mapping indexes into the morse code stream to a dict of possible characters at that location to where they would go in the stream. Transitions that lead to dead-ends are omitted. """ result = [{} for i in xrange(len(morse))] for i_ in xrange(len(morse)): i = len(morse) - i_ - 1 for c, m in MORSE.iteritems(): if i + len(m) < len(morse) and not result[i + len(m)]: continue if morse[i:i+len(m)] != m: continue result[i][c] = i + len(m) return result def find_words(ctr, i, prefix=''): """Find all legal words starting from position i. We generate all possible words starting from position i in the morse code stream, assuming we already have the given prefix. ctr is a char transition dict, as produced by c_trans. """ if prefix in WORDS: yield prefix, i if i == len(ctr): return for c, j in ctr[i].iteritems(): if prefix + c in PREFIXES: for w, j2 in find_words(ctr, j, prefix + c): yield w, j2 def w_trans(ctr): """Like c_trans, but produce a word transition map.""" result = [{} for i in xrange(len(ctr))] for i_ in xrange(len(ctr)): i = len(ctr) - i_ - 1 for w, j in find_words(ctr, i): if j < len(result) and not result[j]: continue result[i][w] = j return result def shortest_sentence(wt): """Given a word transition map, find the shortest possible sentence. We find the sentence that uses the entire morse code stream, and has the fewest number of words. If there are multiple sentences that satisfy this, we return the one that uses the smallest number of characters. """ result = [-1 for _ in xrange(len(wt))] + [0] words = [None] * len(wt) for i_ in xrange(len(wt)): i = len(wt) - i_ - 1 for w, j in wt[i].iteritems(): if result[j] == -1: continue if result[i] == -1 or result[j] + 1 + len(w) / 30.0 < result[i]: result[i] = result[j] + 1 + len(w) / 30.0 words[i] = w i = 0 result = [] while i < len(wt): result.append(words[i]) i = wt[i][words[i]] return result def sentence_count(wt): result = [0] * len(wt) + [1] for i_ in xrange(len(wt)): i = len(wt) - i_ - 1 for j in wt[i].itervalues(): result[i] += result[j] return result[0] msg = 'JACK AND JILL WENT UP THE HILL' print sentence_count(w_trans(c_trans(encode(msg)))) print shortest_sentence(w_trans(c_trans(encode(msg)))) 
+7
source

I don't know if this is smart, but I would try a width search (as opposed to a depth search implicit in the idea of ​​Regex BRPocock). Suppose your line looks like this:

 .---.--.-.-.-.--.-...---...-...-.. JACKANDJILL 

You run in state ('', 0) ( '' what you have decoded so far; 0 is your position in the Morse code line). Starting at zero, the possible starting characters are:. . E , .- A , .-- W , .--- J and .---- 1 . So, press the states ('E', 1) , ('A', 2) , ('W', 3) , ('J', 4) and ('1', 5) in turn. After deleting the state ('E', 1) you queue the states ('ET', 2) , ('EM', 3) and ('EO', 4) .

Now your turn of possible states will grow rapidly - both of { . , - } are letters, like all { .. , .- , -. , -- } and all { ... , ..- , .-. , .-- , -.. , -.- , --. , --- }, so in each pass, your number of states will increase by at least three times; therefore, you must have a user feedback mechanism. In particular, you need to somehow ask your user: β€œIs it possible that this line starts with EOS3AIOSF ?”, And if the user says β€œno”, you need to abandon the state ("EOS3AIOSF", 26) from your queue. It would be ideal to present the user with a graphical interface that so often displays all current states and allows him to choose which ones should be continued. (The β€œUser” will also be for you, of course. English has a lack of pronouns: if β€œyou” refers to a program, then what pronoun refers to a user programmer ?!)

0
source

Maintain 3 things: the word list is still S, the current word is still W, and the current character is C.

  • S should only be good words, for example. FAST
  • W must be a valid word prefix, for example. ['BRO']
  • C must be a valid prefix of some letter, for example. '.-'

Now, given the new character, say '-', we continue C with it (in this case we get '.--'). If C is a full letter (in this case, the letter "W"), we have the choice to add it to W, or to continue expanding the letter by adding more characters. If we continue W, we have a choice to add it to S (if it is a real word) or to continue its further progress.

This is a search, but most paths end quickly (as soon as you have W not a valid prefix of any word that you can stop, and once C is not a prefix of any letter you can stop).

To improve your work efficiency, you can use dynamic programming to avoid redundant work and use attempts to effectively test prefixes.

What does the code look like? Omitting the is_word functions, which check if a string is an English word, and the 'is_word_prefix' function, which checks if a string is the beginning of any real word, is something like this:

 morse = { '.-': 'A', '-...': 'B', etc. } def is_morse_prefix(C): return any(k.startswith(C) for k in morse) def break_words(input, S, W, C): while True: if not input: if W == C == '': yield S return i, input = input[0], input[1:] C += i if not is_morse_prefix(C): return ch = morse.get(C, None) if ch is None or not is_word_prefix(W + ch): continue for result in break_words(input, S, W + ch, ''): yield result if is_word(W + ch): for result in break_words(input, S + ' ' + W + ch, '', ''): yield result for S in break_words('....--', [], '', ''): print S 
0
source

Source: https://habr.com/ru/post/1384222/


All Articles