How can I uniquely shorten the list of strings so that they have no more than x characters

I am looking for an algorithm that takes line vector v1 and returns a similar line vector v2 , where each line less than x characters is long and unique. Lines in v1 may not be unique.

While I need to accept ASCII in v1 , I would prefer to insert alphanumeric characters ( [A-Za-z0-9] ) when you need to insert new characters.

Obviously, there are three caveats here:

  • For some values ​​of v1 and x there is no single v2 . For example, when v1 has 37 elements and x == 1 .

  • “Similar,” as stated in the question, is subjective. Lines will be displayed by the user and, presumably, short phrases in natural language (for example, "number of colors"). I want a person to be able to compare the original with a shortened string as easily as possible. This probably means the use of heuristics such as disemvoweling . Since there is probably no objective measure for my similarity construct (the length of the string will probably not be the most useful here, although it may be) my judgment on what is good will be arbitrary. The method should be suitable for the English language - other languages ​​do not matter.

Obviously, this (programming) language is an agnostic problem, but I would positively look at the implementation in python (because I find my string processing language straightforward).

+6
source share
3 answers

some notes / pointers about this in python.

  • Use the bisect module to save an array of results to easily spot potential non-unique capabilities. This is useful even if v1 already sorted (e.g. name and enemy will collide after disenvoweling)
  • Disemvoweling can be achieved by simply calling .translate(None, "aeiouyAEIOUY") on the line.
  • In the case of duplicates, you can try to resolve conflicts first by reducing all the results and using swapcase as a “bitmask”, ie multiple occurrences of aaa become ["aaa", "aaA", "aAa", "aAA"] , etc., and if this is not enough, "increment" the characters, starting from the end, until a counter identifier is found, for example . ["aa"]*7 will become [ "aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
+1
source

Sketch -

Develop a list of functions that reduce the size of the English string. Order functions from smallest to shading itself.

For each line in v1 , the hide function is repeatedly applied until it can no longer reduce the size of the line, and then move on to the next function.

When the desired size x been reached, check that the given string is unique with respect to the lines already in v2 . If yes, add it to v2 , if not, continue to use masking functions.

Below are some ideas for downsizing functions that are subjectively ordered from smallest to most obscure. (Random selections are intended to increase the likelihood that a given string is unique.)

  • Replace the random appearance of two space characters in the same space
  • Replace the random occurrence of punctuation followed by a space with one space.
  • Delete one random word that is also a member of the kill list (for example, "I", "a")
  • Delete a random two-character word that is also a member of the kill list (for example, "an", "of")
  • Delete the three-digit word in random order, which is also a member of the kill list (for example, "the", "and")
  • Replace five or more symbolic words with a word consisting of the first and last characters (for example, “number” becomes “numr”, “colors” becomes “colrs”).
  • Remove vowel randomly
  • Delete a word that occurs in a large number of lines in v1. The idea is that very common words have a low meaning.
  • Translate the word / phrase into the shorter word "license plate" based on the dictionary (thesaurus) (for example, http://www.baac.net/michael/plates/index.html )

(Note: the last two functions will require access to the original unchanged string and the correspondence between unchanged and changed words.)

+1
source
 def split_len(seq, length): return [seq[i:i+length] for i in range(0, len(seq), length)] newListOfString=[] for item in listOfStrings: newListOfString.append(split_len(item,8)[0]) 

this returns the first eight characters.

-1
source

Source: https://habr.com/ru/post/912201/


All Articles