Convert HTML character encoding to Java

We are trying to download the source of web pages, but we cannot see some specific characters - for example, ü, ö, ş, ç-propoerly due to character encoding. We tried the following code to convert the encoding of a string (a "text" variable):

byte[] xyz = text.getBytes(); text = new String(xyz,"windows-1254"); 

We noticed that if the encoding is utf-8, we still cannot see the pages correctly. What should we do?

+4
source share
2 answers

Tell String constructor to use UTF-8 encoding to interpret bytes if you know that the page encodes its contents as UTF-8.

However, I am not sure if this is the extent of your problem. You have a “text” before trying to “convert” it. This means that something has already tried to interpret the page bytes as strings, according to some encoding. If it was the wrong encoding, then you won’t do anything later, it can be fixed.

Instead, you need to fix this upstream.

 byte[] bytesOfThePage = ...; String text = new String(bytesOfThePage, "UTF-8"); 
+2
source

The problem is most likely where you read, write and / or show these characters.

If you are reading these characters using Reader , you first need to create an InputStreamReader using a 2 argument, in which you can pass the correct encoding (thus UTF-8 ) as the second argument. For instance.

 reader = new InputStreamReader(url.openStream(), "UTF-8"); 

If you, for example, write these characters to a file, you need to build an OutputStreamWriter using a 2-argument constructor, in which you can pass the correct encoding (thus UTF-8 ) as the second argument. For instance.

 writer = new OutputStreamWriter(new FileOutputStream("/page.html"), "UTF-8"); 

If you, for example, write all the standard vanilla in stdout (for example, System.out.println(line) , etc., then you need to make sure that stdout itself uses the correct encoding (thus, UTF-8 ). IDE, for example Eclipse, you can configure it using the window> Settings> General> Workspace> Encoding.

0
source

Source: https://habr.com/ru/post/1299446/


All Articles