Get file name as UTF-8? (ä, ü, ö ... always "?")

I need to read the name of some files and put them in a list as a string. It is not that difficult. I have some problems with some characters, such as ä, ö, ü ... they are always like "?" in my line.

What is the problem? Well coding. Well, that should be easy ... that's what I thought. So I tried to use features like:

new String(insert.getBytes("UTF-8") or new String(insert.getBytes("ISO-8859-1"), "UTF-8") because most files have ISO-8859-1

Does not help. This is my code:

 ... File[] fileList = dir.listFiles(); String insert; for(File f : fileList) { ... insert=f.getName().substring(0,f.getName().length()-4); insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " "); ... System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping ... names.add(insert); } 

At the end there are many lines with '?' characters on my list. How to solve a problem? And what is the best way if there are more than just ISO-8859-1 files? (let's say there are many unknown encoded files)

Thanks!

+4
source share
5 answers

Given the extended comments in this question, it looks like this is a font problem or (perhaps more likely) a problem with the encoding of the file name.

I asked Lissy to run the following command to let us know what the problem was. If she is sure that the file name contains "ä" in them, but this character does not appear when she ls the file name, then this command will tell us if this is a font or encoding problem.

 touch filenäme ls filen*me 

If this shows "filenäme" at the output of ls , then we know that the problem is with creating / copying files to this system. This can happen if the program that created the files did not understand what the encoding of the file system was, or was too dumb to do the right thing. The convmv program is likely to be the best way to fix this.

 convmv -f ENCODING -t utf8 -r . 

The question is what is the correct encoding. Features include UTF-16, cp850, or possibly iso8859-1. convmv --list will show you a list of currently known (for your system) encodings. Since the above command only shows you what it can do, you can safely run it several times with different encodings until you find the one that works for all files.

If this is a font problem, we will need to study what

+3
source

Encoding the contents of a file name has nothing to do with the encoding of the file name itself.

You should get the correct results from System.out.println(insert)

If you do not, this means that the shell has a different character encoding, which is encoded by default for your system (this rarely happens, this is usually the result of an explicit command to switch encodings in the shell).

If the file names are displayed correctly when you specify the directory in the shell, I expect them to be displayed correctly without specifying the encoding in your Java program.


If the shell is unable to display the character (it replaces the replacement character 0xFFFD (& # xFFFD;) for these non-printable characters), you cannot do anything from your Java application to change this. You need to change the character encoding of the terminal, install the correct fonts, etc .; this is an operating system problem, not a Java problem.

At the same time, even if your terminal cannot display the correct results, the Java program must correctly process character encodings without your intervention.

The API File library calculates the correct character encoding for your system and makes the necessary character decoding. Likewise, the database driver must negotiate with the database to determine the correct encoding and do any necessary encoding in bytes on behalf of your application.

+1
source

Unexpected question marks, spalts, etc. in String are a sign that something somewhere does not recognize a particular character when converting from one character set to another.

In your case, the problem can occur in several places:

  • This can happen when your Java program reads file names from a directory (in a call to dir.listFiles() ).

  • This can happen when you print characters to the console stream.

In any case, the main reason, most likely, is the discrepancy between what Java considers the locale settings and the settings of the operating system and / or shell.

As an experiment, try listing a directory containing problematic file names from the command line. Do you see question marks or other signs there?

The second experiment that needs to be done is to modify your Java program to discard one of the problem lines as a sequence of numbers representing the character codes for each character. Do you see character codes for ASCII / Unicode '?' .

+1
source

In the comment you wrote:

@mdrg: well, here is the problem. I have to read the name of the files and then put them in the database. And there is a lot of “?” That shouldn't be ... - Lissy 27 minutes ago

I assume that the column in which you insert the file names indicates US-ASCII as an encoding and replaces characters outside this range with a replacement character, which in your case is a question mark.

So, you need to find out the encoding for the column in the database table where the file names are stored. Different products have different syntaxes for extracting this information.

0
source

In Java 1.6, you can use System.console () instead of System.out.println () to display the selected characters in the console.

 public class Test { public static void main(String args[]){ String s = "caractères français : à é \u00e9"; // Unicode for "é" System.console().writer().println(s); } } 

and the way out is

 C:\temp>java Test caractères français : à é é 
0
source

Source: https://habr.com/ru/post/889373/


All Articles