How to get text inside html / text content?

Question

How to get text inside html / text content?

hi everyone I have html / text something like:

<html><head><style type="text/css">
</style></head>
<body><div style="font-family:times new roman,new york,times,serif;font-size:14pt">first text<br><div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 14pt;"><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">one:</span></b> second text<br><b><span style="font-weight: bold;">two:</span></b> third text<br><b><span style="font-weight: bold;">three:</span></b> fourth text<br><b><span style="font-weight: bold;">five:</span></b> fifth text<br></font><br>

and I want to extract text with the name "first text" in the above html content Note: this html content is not static, it is dynamic, so the general idea is to get the first plain text in the html text

+3

java html-parsing jsoup

Mahmoud saleh Feb 10 '11 at 15:43

source share

3 answers

You can use a SAX-style HTML parser such as TagSoup .

DefaultHandler, , characters(...) .

http://sax.sourceforge.net/quickstart.html - .

+1

Edwin Buck 10 . '11 15:59

If you need something fairly simple, look at my PageScraper class , which was designed for use on Java ME platforms and therefore will work pretty much anywhere. Nothing unusual, but an easy way to convert a text stream into tags and not tags. Makes lazy loading of attributes, so it’s pretty fast to use if you mostly ignore tags.

0

Eric giguere Feb 10 '11 at 15:54

source share

Balusc · Accepted Answer · 2011-02-10T21:00:36+0000

You marked jsoup , so you're using Jsoup . This is a good choice;)

Here you can do it with Jsoup:

String html = "<html><head><style type=\"text/css\"></style></head><body><div style=\"font-family:times new roman,new york,times,serif;font-size:14pt\">first text<br><div><br></div><div style=\"font-family: times new roman,new york,times,serif; font-size: 14pt;\"><br><div style=\"font-family: times new roman,new york,times,serif; font-size: 12pt;\"><font size=\"2\" face=\"Tahoma\"><hr size=\"1\"><b><span style=\"font-weight: bold;\">one:</span></b> second text<br><b><span style=\"font-weight: bold;\">two:</span></b> third text<br><b><span style=\"font-weight: bold;\">three:</span></b> fourth text<br><b><span style=\"font-weight: bold;\">five:</span></b> fifth text<br></font><br>";
Document document = Jsoup.parse(html);
String firstText = document.select(":containsOwn(text)").first().ownText();
System.out.println(firstText);

Result:

first text

How to get text inside html / text content?

See also:

More articles: