I want to classify many websites (millions). I can use Nutch to scan them and get content on sites, but I'm looking for the best (and cheapest or free) tool to categorize them.
One option is to create regular expressions that search for specific keywords and classify sites, but they also use high-performance tools like LSI, such as Autonomy. Are there any open source tools or cheaper tools that will perceive text from a web page / website and classify it for me? I need customization for the types of categories used. As part of the categorization, I would like to be able to recognize “fake” sites that are actually just parked pages, or domain owners who place ads on the pages, as well as just old categories, such as news, sports, science, health, nutrition entertainment etc.
source
share