HTML text analysis

I have a crawler that collects articles from the Internet and stores the title and body in a database. Until now, the programmer had to create a set of rules for each source (usually XPath, and sometimes regular expressions) to indicate the title of the article and sections of the body of the web page. Now I'm trying to go one step further, and the program automatically determines the title and text of the article. My first approach adds weight to each element based on some common criteria. For instance:

//@x-weight = 1.0

//h1/@x-weight * 2.0

//h2/@x-weight * 1.8

There are many more rules, but you understand. After assigning weights based on markup, I take into account some other aspects, such as the similarity with /head/titleand the number of keywords. This approach, while getting decent results for most web pages (thanks to SEO experts: P), it catastrophically does not work for some others. I think about the possibility of using an artificial neural network, but I can not find enough evidence that I am getting much better results. Another option is to take the CSS into the game and adjust the weight to fit the font.

Question (s):

  • Which way to choose?
  • Did I miss something?
  • Is there a better way to do this?

PS: I know that there is no perfect solution for such a problem.

+3
source share
2 answers

- CSS, h1, h2, h3, -. , , , , .

, , , .

, , , -, , ( , . -, , ).

, .

+1

/, - css. , , :

  • html, , h1, h2 ..
  • title.
  • css / (i. * )
  • ( 1/3 ).

. .

+1

Source: https://habr.com/ru/post/1793253/


All Articles