Problem with java utf-8

I use an HTML parser called HTMLCLEANER to parse an HTML page. The problem is that each page has a different encoding than the other. my question

Can I change any character encoding to UTF-8?

+3
source share
4 answers

Where do you get the HTML page from? If you get it from a servlet request, you can use getReader () and pass this clean(). This will use the correct encoding. If you get it from the download, pass the input stream clean(). If you get it through the http client, you need to check the response header Content-Typeusing getResponseCharSet().

+1
source

You cannot "convert" from encoding X to encoding Y without knowing the encoding of X in advance. Just check the header of the HTTP response that it uses (if you get these HTML pages via HTTP), and then use the appropriate encoding in the HTML parser tool.

+3
source

UTF-8?

, UTF-8.

HTML-: "charset", ,

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

you need to update this tag to match the actual encoding.

+1
source
public void arreglarString(String cadena) {
    for (int i = 161; i < 256; i++) {
        char car =  (char) i;
        cadena = cadena.replaceAll(car + "", "&#" + i); 
    }

    return cadena;
}
0
source

Source: https://habr.com/ru/post/1733004/


All Articles