In fact, there are many freely available open source natural language processing programs. Here is a short list organized in what language the toolkit is implemented in:
If you donβt know who to go with, I would recommend starting with NLTK . The package is quite easy to use and has excellent online documentation, including a free book .
You should be able to use NLTK to easily perform the NLP tasks you listed, for example. Recognized Person Name (NER) , retrieving tags for documents, and categorizing a document .
What alchemy people call structured data mining , it looks like it's just HTML debugging that is robust against changes in basic HTML if the page still visually displays the same. Therefore, this is not an NLP task.
To extract text from HTML, simply use boilerpipe . It is fast, good and free.
source share