Generic Lisp package for parsing invalid HTML?

Question

Generic Lisp package for parsing invalid HTML?

As a training exercise, I am writing a web scraper in Common Lisp. (Rough) plan:

Use Quicklisp to manage dependencies
Use Drakma to load pages.
Parse xmls pages

I just came across the fact that the website I am scraping doesn't always get valid XHTML. This means that step 3 (analyze pages with xmls) does not work. And I just do not want to use a regular expression as the guy : -)

So, can anyone recommend a generic Lisp package for parsing invalid XHTML? I present something similar to HTML Agility Pack for .NET ...

+3

web scraping common-lisp quicklisp

Duncan bayne Jan 05 '11 at 0:46

source share

3 answers

For the following visitors: today we have Plump: https://shinmera.imtqy.com/plump

Plump - HTML/XML, . , , , , , .. DOM . , .

, lquery (jquery-like) CLSS ( CSS) .

Common Lisp Cookbook: https://lispcookbook.imtqy.com/cl-cookbook/web-scraping.html

. Common Lisp wiki: http://www.cliki.net/Web

+2

Ehvince 18 . '16 22:14

Duncan, so far I have successfully used Clozure Common Lisp under Ubuntu Linux and Windows (7 and XP), so if you are looking for an implementation that will work anywhere, you can try this.

0

Razvanp Apr 13 '11 at 14:55

source share

Xach · Accepted Answer · 2011-01-05T01:11:36+0000

The clos-html project (available in Quicklisp) will recover from fictitious HTML and produce something you can work with. I use clos-html along with CXML to handle arbitrary web pages, and it works well. http://common-lisp.net/project/closure/closure-html/

Generic Lisp package for parsing invalid HTML?

More articles: