Cut HTML from web page and calculate word frequency?

Question

Cut HTML from web page and calculate word frequency?

In Groovy, how can I capture a webpage and remove HTML tags, etc., leaving only the text of the document? I would like the results to be dumped into the collection, so I can build a word frequency counter.

Lastly, let me say again that I would like to do this in Groovy.

+4

java html groovy html-content-extraction text-extraction

anon Oct 16 '08 at 4:02

source share

3 answers

If you need a collection of tokenized words from HTML, can't you just parse it as XML (must be valid XML) and capture all the text between tags? How about something like this:

def records = new XmlSlurper().parseText(YOURHTMLSTRING) def allNodes = records.depthFirst().collect{ it } def list = [] allNodes.each { it.text().tokenize().each { list << it } }

+1

mbrevoort Oct 16 '08 at 16:08

source share

You can use Lynx Web Browser to wipe the text of a document and save it.

Do you want to do this automatically? Do you want this standalone application? Or do you want to copy it to your application? What platforms (Windows desktop, web server, etc.) will work?

0

moogs Oct 16 '08 at 4:12

source share

Jay · Accepted Answer · 2008-10-16T04:35:58+0000

Assuming you want to do this with Groovy (guessing based on the Groovy tag), your approaches are likely to be either highly shell-script oriented or using Java libraries. In case of shell-scripting, I would agree with moogs, using Lynx or Elinks is probably the easiest way to do this. Otherwise, look at HTMLParser and see Processing each word in a File (scroll down to find the corresponding code fragment)

You are probably stuck looking for Java libraries to use with Groovy for parsing HTML, since it doesn't display, there are Groovy libs for it. If you are not using Groovy, then please post the language you want, as there is a lot of HTML for text tools , depending on which language you work in.

Cut HTML from web page and calculate word frequency?

More articles: