I am working on a web application where I display HTML from other sites. Before displaying the final version, I would like to get rid of the ads.
Any ideas, suggestions on how to do this? it should not be a super-efficient filtering tool, I thought of porting the filters part defined by adblockplus to Ruby and returning the processed document using Nokogiri .
Let's say I use ad super-permutation filter. This is not an official ad unit, but for simplicity I will use it here. Then the idea would be to remove all elements for which any of the attributes matches the filter, for example: src="http://ad.foo.com?my-ad.gif" href="http://ad.foo.com" class="annoying-ad" etc.
The Nokogiri command for this filter will be:
doc.xpath("//*[@*[contains(., 'ad')]]").each { |element| element.remove }
I applied a filter to this page :

And the result:

Not so bad, note that the global substitution filter also got rid of valid elements such as headers because they have attributes like id="masthead" .
So, I think this approach is suitable for my case, now the question will be , which filters to use? they have a huge list of filters, and I donβt want them to be repeated. I'm going to go in for the top 10-20 and analyze the documents based on this, is there a list with the most popular? If so, I could not find him.
Thanks!
source share