I am interested in learning how to clean an html page and present it beautifully - remove all the mess and reformat the main text into a very readable format - for example, http://lab.arc90.com/experiments/readability or Instapaper.
This is a simple analysis of pages and removal of elements that are not included in
?
Was it discussed elsewhere?
https://github.com/jiminoc/goose/wiki does something like what you ask, the source code is openly available along with unit tests
- , , , , .
api (http://www.readability.com/publishers/api)
, , , . , API.
https://www.readability.com/api/content/v1/parser?url= {URL- } & token = { api-}
:
HTTP/1.0 200 OK { "domain": "blog.readability.com", "": " ", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/","short_url": "http://rdd.me/kbgr5a1k", "title": "Step Up & Be Heard: Readability Ideas", "total_pages": 1, "word_count": 175, "content": "<div>\n \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }
HTTP/1.0 200 OK { "domain": "blog.readability.com", "": " ", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",
"short_url": "http://rdd.me/kbgr5a1k", "title": "Step Up & Be Heard: Readability Ideas", "total_pages": 1, "word_count": 175, "content": "<div>\n \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }
, checkout nodeJS, ruby ββ python http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html
If the web page or website in question makes good use of semantic elements and structure, you can simply use a different CSS stylesheet that can radically change the layout and render completely.
Source: https://habr.com/ru/post/1765046/More articles:Qt metatype declaration issue - c ++Visual Studio 2010: search Team Team query results without changing the query - tfsNeed a reality check. Is my VB6 Blowfish error analysis correct? - debuggingConfiguring an OpenID Provider Using Subdomain Identifiers Using DotNetOpenAuth - c #https://translate.googleusercontent.com/translate_c?depth=1&pto=aue&rurl=translate.google.com&sl=ru&sp=nmt4&tl=en&u=https://fooobar.com/questions/1765045/open-source-how-to-motivate-translators-to-bring-localizations-from-70-to-100-before-release&usg=ALkJrhh28jh-nWlaOsfwdNqVsFqx6RDBPQEasy way to export @ 2x images for retina display? - iphoneprintln (String s) vs println (Object o) - javased extracts several possible (?) values ββfrom a file - shellwill restart apache, clear memcache contents? - apacheWPF loads serialized image - serializationAll Articles