How to implement a similar html-page scrubber, for example, to read or install Arc90?

I am interested in learning how to clean an html page and present it beautifully - remove all the mess and reformat the main text into a very readable format - for example, http://lab.arc90.com/experiments/readability or Instapaper.

This is a simple analysis of pages and removal of elements that are not included in

?

Was it discussed elsewhere?

+3
source share
3 answers

https://github.com/jiminoc/goose/wiki does something like what you ask, the source code is openly available along with unit tests

+4
source

- , , , , .

api (http://www.readability.com/publishers/api)

, , , . , API.

https://www.readability.com/api/content/v1/parser?url= {URL- } & token = { api-}

:

HTTP/1.0 200 OK {      "domain": "blog.readability.com",      "": " ",      "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",

"short_url": "http://rdd.me/kbgr5a1k",
"title": "Step Up & Be Heard: Readability Ideas", 
"total_pages": 1, 
"word_count": 175, 
"content": "<div>\n  \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", 
"date_published": "2011-02-22 00:00:00", 
"next_page_id": null, 
"rendered_pages": 1 }

, checkout nodeJS, ruby ​​ python http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html

+7

If the web page or website in question makes good use of semantic elements and structure, you can simply use a different CSS stylesheet that can radically change the layout and render completely.

+1
source

Source: https://habr.com/ru/post/1765046/


All Articles