Ideal Java library to clean html and eliminate bad snippets

Question

Ideal Java library to clean html and eliminate bad snippets

I have some HTML files that need to be parsed and cleaned, and sometimes they have content with special characters like <,>, "etc. that have not been properly escaped.

I tried to run the files through jTidy, but the best I can do is just omit the content that it sees as distorted html. Is there another library that just comes out of the wrong fragments instead of skipping them? If not, what are the best recommendations for which library to change?

Clarification:

Input Example: <p> blah blah <M + 1> blah </p>

Required Output: <p> blah blah & lt; M + 1 & blah </p>

+3

java html parsing

Tyler Mar 01 '10 at 19:12

source share

4 answers

I eventually resolved this by first executing the regex and unmodified second of TagSoup.

Here is my regex code to avoid unknown tags like <M+1>

private static String escapeUnknownTags(String input) {
    Scanner scan = new Scanner(input);

    StringBuilder builder = new StringBuilder();

    while (scan.hasNext()) {

        String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);

        if (s == null) {
            builder.append(escape(scan.next(".*")));
        } else {

            processMatch(s, builder);
        }

    }

    return builder.toString();
}

private static void processMatch(String s, StringBuilder builder) {

    if (!isKnown(s)) {
        String escaped = escape(s);

        builder.append(escaped);
    }
    else {
        builder.append(s);
    }

}

private static String escape(String s) {
    s = s.replaceAll("<", "&lt;");
    s = s.replaceAll(">", "&gt;");
    return s;
}

private static boolean isKnown(String s) {
    Scanner scan = new Scanner(s);
    if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {

        return false;
    }

    MatchResult mr = scan.match();

    try {

        String tag = mr.group(1).toLowerCase();

        if (HTML.getTag(tag) != null) {
            return true;
        }
    }
    catch (Exception e) {
        // Should never happen
        e.printStackTrace();
    }

    return false;
}

0

Tyler Mar 03 '10 at 22:39

source share

HTML-

HtmlCleaner - HTML , Java. HTML, , . , . HTML- HtmlCleaner XML. , , - . .

0

Fakrudeen 16 . '10 10:11

, , . , .

javax.swing.text.html.HTML

0

Chris 16 . '10 10:34

Adam batkin · Accepted Answer · 2010-03-01T19:17:09+0000

You can also try TagSoup . TagSoup emits the usual old SAX events, so you end up with what looks like a well-formed XML document.

I was very lucky with TagSoup, and I always wonder how well it handles poorly designed HTML files.

Ideal Java library to clean html and eliminate bad snippets

More articles: