How to convert strings in any language and character set to valid file names in Java?

I need to generate file names from user-entered names. These names can be in any language. For instance:

  • "John Smith"
  • "高 岡 和 子"
  • "محمد سعيد بن عبد العزيز الفلسطيني"

These are the entered values ​​that are used, so I cannot guarantee that the names do not contain characters that are not valid for the file name.

Users will download these files from their browser, so I need to make sure that the file names are valid on all operating systems in all configurations.

I am currently doing this for English-speaking countries by simply deleting all non-alphanumeric characters with a simple regular expression:

string = string.replaceAll("[^a-zA-Z0-9]", ""); string = string.replaceAll("\\s+", "_") 

Some conversion examples:

  • "John Smith" → "John_Smith.ext"
  • "John O'Henry" → "John_OHenry.ext"
  • "John van Smith III" → "John_van_Smith_III.ext"

Obviously, this does not work internationally.

I considered searching / creating a blacklist of all characters that are invalid for all file systems and deprive them of names. I could not find an exhaustive list.

I would rather use existing code in a shared library if possible. I think this has already been resolved, but I can’t find a solution that works internationally.

The file name for the user uploading the file, not for me. I am not going to store these files. These files are dynamically generated by the server upon request from the data in the database. File names are for the convenience of the user downloading the file.

+6
source share
6 answers

Regex [^a-zA-Z0-9] will filter non-ASCII characters that will skip Unicode characters or characters above 128 code points.

Assuming you want to filter out user input for valid file names by replacing invalid file names such as ? \ / : | < > * ? \ / : | < > * ? \ / : | < > * using underscore ( _ ):

 import java.io.UnsupportedEncodingException; public class ReplaceI18N { public static void main(String[] args) { String[] names = { "John Smith", "高岡和子", "محمد سعيد بن عبد العزيز الفلسطيني", "|J:o<h>n?Sm\\it/h*", "高?岡和\\子*", "محمد /سعيد بن عبد ?العزيز :الفلسطيني\\" }; for(String s: names){ String u = s; try { u = new String(s.getBytes(), "UTF-8"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } u = u.replaceAll("[\\?\\\\/:|<>\\*]", " "); //filter ? \ / : | < > * u = u.replaceAll("\\s+", "_"); System.out.println(s + " = " + u); } } } 

Exit:

 John Smith = John_Smith高岡和子 = 高岡和子محمد سعيد بن عبد العزيز الفلسطيني = محمد_سعيد_بن_عبد_العزيز_الفلسطيني |J:o<h>n?Sm\it/h* = _J_o_h_n_Sm_it_h_高?岡和\子* = 高_岡和_子_ محمد /سعيد بن عبد ?العزيز :الفلسطيني\ = محمد_سعيد_بن_عبد_العزيز_الفلسطيني_ 

Valid file names even with Unicode characters will be displayed on any web page that supports UTF-8 encoding with the correct Unicode font.

In addition, each of them will be the correct name for the file in any OS file system that supports Unicode (tested OK in Windows XP, Windows 7).

i18n filenames

But if you want to pass each valid file name as a string of URLs, make sure it is encoded correctly using URLEncoder , and then decodes each encoded URL using URLDecoder .

+3
source

Permission to enter a file name without proper disinfection seems to be prone to security attacks. You can use the hash function (SHA-1, MD5) to generate the correct file name. Just keep in mind that you cannot get the original name from the hash.

In addition, if you can have a simple lookup table, you can assign special identifiers for names (for example, sequential numbers or GUIDs) and use the identifier as the file name.

Another thing, have you thought about homonyms?

0
source

Encode the file name as UTF-8, and then URL encode the result.

 '高岡和子' -> '%E9%AB%98%E5%B2%A1%E5%92%8C%E5%AD%90' 
0
source

Windows appears to support Unicode file names , I know what Linux does, and apparently OS X too. Presumably, well-written ones would fix invalid characters in the file name before saving.

Looks like you should just use Unicode file names. Is there any OS or browser on which this does not work?

0
source

My advice would be to make this requirement that your application run on a platform that supports Unicode file names. Most of them these days.

I don’t think it is possible to map from Unicode to an (unspecified) character set with restrictions, while maintaining human readability AND the original value AND avoiding collisions. In fact, this is not even possible using Latin-1 in ASCII.

If your application needs to run on platforms that do not support Unicode file names, in some cases you will have to sacrifice readability and / or value in file names. Also, consider whether (for example) ASCII-Chinese characters or Cyrillic or discounted letters are acceptable to your end users.


What I would do is offer the user two choices:

  • An option that uses Unicode file names for uploaded files. This should be the default, as most user machines will support this.

  • An alternative that uses generated names that are not associated with the source lines / text.

In fact, if the user machine does not support Unicode, they will have huge problems with text names that are not encoded using the machine’s own encoding. There is no absolutely reliable way to find out what it is. Even if you have a semi-reliable way to understand that ... on the server side ... the problem of matching all Unicode with this encoding is unsolvable.

It is better to recommend the user to upgrade their operating system to Unicode compatible.

0
source

To summarize and rephrase @eee's answer ...

 String sanitizeFilename(String unsanitized) { return unsanitized .replaceAll("[\\?\\\\/:|<>\\*]", " ") // filter out ? \ / : | < > * .replaceAll("\\s", "_"); // white space as underscores } 

(without combining several spaces into one!)

0
source

Source: https://habr.com/ru/post/913179/


All Articles