Intelligently capture the first paragraph / opening text

I would like to have a script where I can enter the URL, and it will intelligently grab the first paragraph of the article ... I'm not sure where to start, other than just pulling text from <p>tags. Do you know any tips / tutorials on how to do this?

Update

For further clarification, I create a section of my site where users can send links, for example, to Facebook, it will capture an image from its site, as well as text, to follow the link. I am using PHP and trying to determine the best way to do this.

I say “reasonably” because I would like to try to get the content on this page, which is important not only in the first paragraph, but also in the first paragraph of the most important content.

+3
source share
3 answers

If the page you want to capture is foreign or even if it is local, but you do not know its structure in advance, I would say that it would be best to achieve this using php DOM Functions .

function get_first_paragraph($url)
{
  $page = file_get_contents($url);
  $doc = new DOMDocument();
  $doc->loadHTML($page);
  /* Gets all the paragraphs */
  $p = $doc->getElementsByTagName('p');
  /* extracts the first one */
  $p = $p->items(0);
  /* returns the paragraph content */
  return $p->textContent;
}
+1
source

Short answer: you cannot.

In order for the PHP script to “intelligently” retrieve the “most important” content from the page, the script had to understand the content on the page. PHP is not a natural language processor, and it is not a trivial area of ​​study. There may be some NLP tools for PHP, but I still doubt it will be easy.

, , HTML, , . hAtom. .

+1

Python script, -. , , .

Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you could do this. You can also look at similar past questions on this.

+1
source

Source: https://habr.com/ru/post/1784758/


All Articles