Need help getting an HTML website in Java

Question

Need help getting an HTML website in Java

I have some code from java httpurlconnection slicing html and I have almost the same code for extracting html from Java websites. Except for one specific website that I cannot work with this code:

I am trying to get HTML from this site:

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

But I keep getting garbage characters. Although it works great with any other site, such as http://www.google.com .

And this is the code I'm using:

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

I do not understand why it does not work with the above URL.

Any help would be appreciated.

+3

java html httpurlconnection

bits Aug 4 '10 at 14:02

source share

1

BalusC · Accepted Answer · 2010-08-04T14:06:46+0000

. gzip , (Accept-Encoding: gzip). , GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

, InputStreamReader. Content-Type .

. URLConnection HTTP-? , , - / HTML, HTML-, , Jsoup.

Need help getting an HTML website in Java

More articles: