Generic Lisp package for parsing invalid HTML?

As a training exercise, I am writing a web scraper in Common Lisp. (Rough) plan:

I just came across the fact that the website I am scraping doesn't always get valid XHTML. This means that step 3 (analyze pages with xmls) does not work. And I just do not want to use a regular expression as the guy : -)

So, can anyone recommend a generic Lisp package for parsing invalid XHTML? I present something similar to HTML Agility Pack for .NET ...

+3
source share
3 answers

The clos-html project (available in Quicklisp) will recover from fictitious HTML and produce something you can work with. I use clos-html along with CXML to handle arbitrary web pages, and it works well. http://common-lisp.net/project/closure/closure-html/

+10
source

For the following visitors: today we have Plump: https://shinmera.imtqy.com/plump

Plump - HTML/XML, . , , , , , .. DOM . , .

, lquery (jquery-like) CLSS ( CSS) .

Common Lisp Cookbook: https://lispcookbook.imtqy.com/cl-cookbook/web-scraping.html

. Common Lisp wiki: http://www.cliki.net/Web

+2

Duncan, so far I have successfully used Clozure Common Lisp under Ubuntu Linux and Windows (7 and XP), so if you are looking for an implementation that will work anywhere, you can try this.

0
source

Source: https://habr.com/ru/post/1783530/


All Articles