Find the HTML element that contains the most references to a given word

I have an HTML document and I would like to find an HTML element that is the closest wrapper to the largest cluster of references to this word.

In the following HTML:

<body> <p> Hello <b>foo</b>, I like foo, because foo is the best. <p> <div> <blockquote> <p><strong>Foo</strong> said: foo foo!</p> <p>Smurfs ate the last foo and turned blue. Foo!</p> <p>Foo foo.</p> </blockquote> </div> </body> 

I would like to have a function

 find_largest_cluster_wrapper(html, word='foo') 

... which will parse the DOM tree and return the <blockquote> element to me, since it contains the highest density of references to foo and is the closest shell.

The first <p> contains foo 3 times, <b> only once, the internal <p> contains foo 3 times, twice and twice, <strong> only once. But <blockquote> contains foo 4 times. <div> does the same thing, but it's not the closest shell. The <body> element has the most references, but it is too sparse for the cluster.

A simple implementation without clustering would always give me <html> or <body> or something like that, because such elements always have the most references mentioned and are probably their closest shell. However, I need to take something with the largest cluster, since I am only interested in the part of the web page with the highest word density.

I am not very interested in knowing the parsing part, it can be well solved using beautifulsoup4 or other libraries. I'm curious about an efficient algorithm for clustering. I searched googled for a while and I think the clustering package in scipy might be useful, but I have no idea how to use it. Can someone recommend me a better solution and direct me in the right direction? Examples would be awesome.


Well, it would be difficult to answer this question at all, because the conditions, as you indicated, were uncertain. So more specifically:

Typically, a document will probably contain only one such cluster . My intention is to find such a cluster and get its shell so that I can manipulate it. This word can be mentioned somewhere else on the page, but I'm looking for a cluster of such remarkable ones . If there are two noticeable clusters or more, then I must use an external prejudice to decide (examine headings, page title, etc.). What does it mean that a cluster is noticeable? This means exactly what I just presented - that there are no "serious" competitors. If the participant is serious or not, I could provide some amount (attitude), for example. if there is a cluster of 10 and a cluster of 2, the difference will be 80%. I could say if there is a cluster with a difference of more than 50%, that would be wonderful. This means that if it is a cluster of 5 and another of 5, the function will return None (cannot solve).

+4
source share
3 answers

So here is the approach:

 |fitness(node, word) = count of word in node text if node is a leaf |fitness(node, word) = sum(fitness(child, word) for child in children) / count of overall elements in node tree 

Here he is:

 import lxml.html node = """<html><body> <p> Hello <b>foo</b>, I like foo, because foo is the best. <p> <div> <blockquote> <p><strong>Foo</strong> said: foo foo!</p> <p>Smurfs ate the last foo and turned blue. Foo!</p> <p>Foo foo.</p> </blockquote> </div> </body></html>""" node = lxml.html.fromstring(node) def suitability(node, word): mx = [0.0, None] _suitability(node, word, mx) return mx[1] def _suitability(node, word, mx): children = node.getchildren() sparsity = 1 result = float(node.text_content().lower().count(word)) for child in children: res, spars = _suitability(child, word, mx) result += res sparsity += spars result /= sparsity current_max, max_node = mx if current_max < result: mx[0] = result mx[1] = node return result, sparsity print suitability(node, 'foo') 

He gives us the blockquote element as the fittest. And by setting the rating function, you can change the parameters of the desired cluster.

+3
source

So this is not a library, but I have an idea.

What if you create an HTML parsing tree and then annotate each node with two things:

  • The total number of words it contains.
  • The number of times it contains your target word.

You can then do a tree search for the node, which maximizes target_count / total_count . This will give you the property of the smallest containing element, since the element located higher in the tree will contain more words. In fact, this gives you the node with the highest density of the target word.

You may find that simple separation does not give you enough results that you want. For example, if a node contains only one copy of the target word, it will have a very high density, but may not correspond to the concept of a cluster that you have in mind. In this case, I would define some function that maps the number of words contained in an element to size. If you want the cluster to have a certain size and to accumulate clusters that are too large (for example )<body> ), it could be something like this:

 def size(num_words): num_words = max(num_words, 40) # where 40 is the min size of a cluster if num_words > 1000: # This is too large, so we penalize num_words *= 1.5 return num_words 

Now you can do target_count / size(total_count) .

Re: scipy clustering

This clustering works on vectors. So, to use this package, you will need to come up with a way to turn the occurrences of your target word into vectors. I can’t think of a good way to do it from my head, but that doesn’t mean that such a way does not exist.

+1
source

The cluster package will not have much support, because it is not a cluster analysis .

This is more in the lines of frequent unpacking patterns . You might want to learn this.

+1
source

Source: https://habr.com/ru/post/1445547/


All Articles