Save sentence as server file name

I save a record of a set of sentences to the corresponding set of audio files.

Suggestions include:

Ich weiß es nicht! ¡No lo sé! Ég veit ekki! 

How would you recommend converting the sentence into a user-readable file name, which will later be sent to the online server. I am not sure now in what languages ​​I can deal in the future.

UPDATE:

Please note that the two sentences cannot conflict with each other. For instance:

 É bär icke dej. E bår icke dej. 

cannot resolve a single file name since they overwrite each other. This is a problem with the slugify function mentioned here: Turn the string into a valid file name?

The best I came up with is to use urllib.parse.quote. However, I think that the result obtained is harder to read than I would have hoped. Any suggestions?:

 Ich%20wei%C3%9F%20es%20nicht%21 %C2%A1No%20lo%20s%C3%A9%21 %C3%89g%20veit%20ekki%21 
+5
source share
3 answers

What about unidecode ?

 import unidecode a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!'] for s in a: print(unidecode.unidecode(s).replace(' ', '_')) 

This produces clean ASCII strings that can be easily processed if they still contain unwanted characters. Keeping spaces of various shapes as underscores helps with readability.

 Ich_weiss_es_nicht! !No_lo_se! Eg_veit_ekki! 

If uniqueness is a problem, a hash or something like that can be added to the strings.

Edit:

Some clarification seems to be required regarding hashing. Many hash functions are explicitly designed to produce very different outputs for close input. For example, the python built-in hash function gives:

 In [1]: hash('¡No lo sé!') Out[1]: 6428242682022633791 In [2]: hash('¡No lo se!') Out[2]: 4215591310983444451 

With this you can do something like

 unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10] 

to get not too long lines. Even with such reduced hashes, collisions are pretty unlikely.

+1
source

you should probably try to convert spaces to another character, creating your string as É-bär-icke-dej.

if you use python, I would do it as follows.

  • Replace spaces with another character of type (-) or (/)
 mystring.replace ('', '-')

  • Define character encoding using chardet python package that detects encoding.

  • Decode your string using pythons

 mystring.decode (* the detected encoding *)

  • Check if the name of the file already using the python OS package is in your directory. sort of
 files = os.listdir (* path to directory *) 
// get how many times the file name has been repeated
redundance = 0
for name in files: if mystring in name: redundance + = 1
  • add redundancy to string
 if redundance! = 0:
     mystring = mystring + redundance

  • Use the ur string as the file name!

Hope this helps!

0
source

The only illegal characters in traditional Unix / Linux file names are the slash ( / U + 002F ) and the null character ( U + 0000 ). There is no need to convert your human readable examples for anything else.

If you need to make files available for systems that do not use the same encoding of the file name, for example, for downloading via FTP or from a web server, you might want to expose them as explicitly UTF-8. In most modern U * xes, this should still be the default. This will be consistent with the results you get from quoting urllib , where percent encoding is a safe and reasonably standard way to get a machine-readable and unique representation of the encoding. If you embed them in an HTML snippet or something else, you can keep the displayed text human-readable and just save the link on the typewriter.

 <a href="%C3%89g%20veit%20ekki%21">Ég veit ekki!</a> 
0
source

Source: https://habr.com/ru/post/1273714/


All Articles