Node.Js module for extracting web page content?

Can someone recommend a Node.Js module or a Javascript library (not based on Readability) that can be used to extract content from web pages and RSS feeds?

I found a good PHP library that can do the job - http://fivefilters.org/content-only/ - but is looking for a Node.Js module that does the same thing.

Thank!

+4
source share
3 answers

I wrote the Node.js module just for this purpose, called "unfluff":

https://github.com/ageitgey/node-unfluff

Hope this solves your problem.

Unfluff "python-goose" "goose" (Scala), .

+12

extract-main-text can also extract content from HTML. node-unfluffis not stable for Japanese (possibly CJK) content in my case.

+1
source

Source: https://habr.com/ru/post/1532875/


All Articles