Java removes HTML from a string without regular expressions

Question

Java removes HTML from a string without regular expressions

I am trying to remove all HTML elements from a string. Unfortunately, I cannot use regular expressions because I am developing on the Blackberry platform and regular expressions are not yet supported.

Is there any other way to remove HTML from a string? I read somewhere that you can use the DOM Parser, but I could not find much on it.

HTML text:

<![CDATA[As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (<a href="http://www.netflix.com/RoleDisplay/Billy_Bob_Thornton/20000303">Billy Bob Thornton</a>) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (<a href="http://www.netflix.com/RoleDisplay/Bruce_Willis/99786">Bruce Willis</a>) -- the world finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task. <a href="http://www.netflix.com/RoleDisplay/Ben_Affleck/20000016">Ben Affleck</a> and <a href="http://www.netflix.com/RoleDisplay/Liv_Tyler/162745">Liv Tyler</a> co-star.]]>

Text without HTML:

When a massive asteroid crashes toward Earth, NASA head Dan Truman (Billy Bob Thornton) brings out a plan to split a deadly stone in two before he destroys the entire planet, calling on Harry Stamper (Bruce Willis) - the world’s best oil driller - to lead the mission. Over time, quickly ends, Stamper collects a team of cracks and explodes into space to try a treacherous task. Ben Effleck and Liv Tyler with the stars.

Thanks!

+4

java html parsing

littleK Mar 21 '10 at 22:17

source share

4 answers

I can't use regular expressions because I'm developing on the Blackberry Platform

You cannot use regular expressions because HTML is a recursive language and regular expressions cannot process them.

You need a parser.

+4

Ejp Mar 22 '10 at 9:25

source share

If you can add external banks, you can try these two small libraries:

tagsoup , this is a sax parser
jericho html , another small html parser

they both let you undress everything.

I used jericho many times so that you define the extractor as you like:

 class HTMLStripExtractor extends TextExtractor { public HTMLStripExtractor(Source src) { super(src) src.setLogger(null) } public boolean excludeElement(StartTag startTag) { return startTag.getName() != HTMLElementName.A } }

+1

Jack Mar 21 '10 at 23:10

source share

I would try to solve this problem the other way around, create a DOM tree from HTML, and then extract a line from the tree:

Use a library such as TagSoup to parse HTML code, flushing it to be closer to XHTML.
As you stream the cleared XHTML, extract the text you want.

+1

Jim ferrans Mar 21 '10 at 23:14

source share

tucuxi · Accepted Answer · 2010-03-21T23:24:31+0000

There are many nuances for parsing HTML in the wild, one of the funniest is that many pages out there do not conform to any standard. This suggests that if your HTML is as simple as your example, more than enough:

  char[] cs = s.toCharArray(); StringBuilder sb = new StringBuilder(); boolean tag = false; for (int i=0; i<cs.length; i++) { switch(cs[i]) { case '<': if ( ! tag) { tag = true; break; } case '>': if (tag) { tag = false; break; } case '&': i += interpretEscape(cs, i, sb); break; default: if ( ! tag) sb.append(cs[i]); } } System.err.println(sb);

Where interpretEscape() needs to know how to convert HTML screens like > , in their symbolic copies and skip all the characters to the end ; .

Java removes HTML from a string without regular expressions

More articles: