Find important text in arbitrary HTML using PHP?

I have some random HTML layouts containing important text that I would like to extract. I can't just strip_tags() , as this will leave a bunch of extra garbage from the sidebar / footer / header / etc.

I found a method built into Python , and I was wondering if there was anything like that in PHP.

The concept is quite simple: use information about the density of text versus HTML code for development, if a line of text is worth output. (This is not a new idea, but it works!) The main process works as follows:

  • Parse HTML code and track the number of bytes processed.
  • Save textual output based on a line or each paragraph.
  • Associate with each text string the number of HTML bytes needed to describe this.
  • Calculate the text density of each line by calculating the text ratio t> o bytes.
  • Then determine if the string is part of the content using the neural network.

You can get pretty good results only if you check the line density above a fixed threshold (or average), but the system makes fewer mistakes if you use machine learning - not to mention what to implement!

Update:. I started generosity for the response, which could pull the main content from an arbitrary HTML template. Since I cannot share the documents that I will use, just select any random blog sites and try to extract the body text from the layout. Remember that the title, sidebar, and footer may also contain text. See the link above for ideas.

+4
source share
5 answers
  • phpQuery - server-side, one-piece, CSS3 selector Document Object Model (DOM) API based on jQuery JavaScript Library.

UPDATE 2

  • many blogs use CMS ;
  • The html HTML structure is almost the same.
  • avoid regular selectors like #sidebar, #header, #footer, #comments, etc..
  • avoid any widget by script, iframe tag name script, iframe
  • Know the content clearly, for example:
    • /\d+\scomment(?:[s])/im
    • /(read the rest|read more).*/im
    • /(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
    • /[^a-z0-9]+/im

find well-known classes and identifiers:

  • typepad.com .entry-content
  • wordpress.org .post-entry .entry .post
  • movabletype.com .post
  • blogger.com .post-body .entry-content
  • drupal.com .content
  • tumblr.com .post
  • squarespace.com .journal-entry-text
  • expressionengine.com .entry
  • gawker.com .post-body

  • Link: Blogging Platforms of Choice Among the Top 100 Blogs


 $selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content'); $doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div'); 

the search is based on a general html structure that looks like this:

 <div> <h1|h2|h3|h4|a /> <p|div /> </div> 

 $doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div'); 
+5
source

Domdocument can be used to parse html documents, which can then be requested through PHP.

Edit: wikied

+3
source

I worked on a similar project a while ago. It is not as complicated as a Python script, but it will work well. Check out a simple PHP PHP parser

http://simplehtmldom.sourceforge.net/

+2
source

Depending on your HTML structure, and if you have an identifier or classes, you can get a little complicated and use preg_match () to get certain information between specific start and end tags. This means that you need to know how to write regular expressions.

You can also learn the PHP class of browser emulation. I did this to clean the pages, and it works quite well, depending on how well the DOM is formatted. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

+1
source

I developed an HTML parser and a PHP filter that can be used for this purpose.

It consists of a set of classes that can be combined together to perform a series of parsing, filtering, and conversion operations in HTML / XML code.

It was designed to work with real-world pages, so it can deal with distorted tag and data structures, so it can save the original document as much as possible.

One of the filter classes that it comes with can perform a DTD check. Another may drop insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all the links to the documents.

All of these filter classes are optional. You can bind them together the way you want, if you need something.

So, to solve your problem, I don’t think that PHP already has any specific solution, but a special class of filters can be developed for it. Take a look at the package. It is fully documented.

If you need help, just check my profile and write to me, and I can even develop a filter that does exactly what you need, eventually inspired by any solutions that exist for other languages.

+1
source

Source: https://habr.com/ru/post/1344318/


All Articles