I have an HTML document and I would like to find an HTML element that is the closest wrapper to the largest cluster of references to this word.
In the following HTML:
<body> <p> Hello <b>foo</b>, I like foo, because foo is the best. <p> <div> <blockquote> <p><strong>Foo</strong> said: foo foo!</p> <p>Smurfs ate the last foo and turned blue. Foo!</p> <p>Foo foo.</p> </blockquote> </div> </body>
I would like to have a function
find_largest_cluster_wrapper(html, word='foo')
... which will parse the DOM tree and return the <blockquote> element to me, since it contains the highest density of references to foo and is the closest shell.
The first <p> contains foo 3 times, <b> only once, the internal <p> contains foo 3 times, twice and twice, <strong> only once. But <blockquote> contains foo 4 times. <div> does the same thing, but it's not the closest shell. The <body> element has the most references, but it is too sparse for the cluster.
A simple implementation without clustering would always give me <html> or <body> or something like that, because such elements always have the most references mentioned and are probably their closest shell. However, I need to take something with the largest cluster, since I am only interested in the part of the web page with the highest word density.
I am not very interested in knowing the parsing part, it can be well solved using beautifulsoup4 or other libraries. I'm curious about an efficient algorithm for clustering. I searched googled for a while and I think the clustering package in scipy might be useful, but I have no idea how to use it. Can someone recommend me a better solution and direct me in the right direction? Examples would be awesome.
Well, it would be difficult to answer this question at all, because the conditions, as you indicated, were uncertain. So more specifically:
Typically, a document will probably contain only one such cluster . My intention is to find such a cluster and get its shell so that I can manipulate it. This word can be mentioned somewhere else on the page, but I'm looking for a cluster of such remarkable ones . If there are two noticeable clusters or more, then I must use an external prejudice to decide (examine headings, page title, etc.). What does it mean that a cluster is noticeable? This means exactly what I just presented - that there are no "serious" competitors. If the participant is serious or not, I could provide some amount (attitude), for example. if there is a cluster of 10 and a cluster of 2, the difference will be 80%. I could say if there is a cluster with a difference of more than 50%, that would be wonderful. This means that if it is a cluster of 5 and another of 5, the function will return None (cannot solve).