Get only the relevant part of the website

How does the Evernote Web Clipper plugin or Announce plugin get only the corresponding article / message / part of the page content? Here is a screenshot from the evernote plugin:

enter image description here

No matter which website you visit, which is completely different from other layouts, they can always get you an article / article / content on a page.

Each website has different layouts, some of them have a sidebar, some not, different tags for the main / article / part of the content, some use <article> or <section> for other HTML5 <h1> > <p> , some use <h2> > <p> and others are not used at all. Thus, there are various combinations of tags, as well as site layouts.

Can someone suggest a solution for getting the main article article / post / content via Javascript or PHP?

+6
source share
2 answers

You can do a simple DOM analysis and search for <div> and <p> containing more text ( text! Not HTML code! ). However, regardless of the smart method that you choose to determine where the content is located, you should start by parsing the DOM , so let's take a look at the DOM by analyzing the PHP libraries.

Anyway, you can start with this:

http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/

It looks good and gives technical explanations if you want to write something of your own.

+7
source

Most blog engines report that the div is an identifier for "content."

  • In javascript you will just do $ ('# content')
  • In php, you would do DOMDocument :: getElementById ('content').
0
source

Source: https://habr.com/ru/post/908021/


All Articles