I use Lucene.NEt for full-text search. So far I have indexed PDF documents, but now I have some web pages that I need to index. What is the best / easiest way to index HTML documents to add to my Lucene index? I am using .NET / C #
I am currently working on this issue, the best answer I have found so far is HTML Agility Pack to get plain text content from HTML.
Google may index your content for you.
Source: https://habr.com/ru/post/1725915/More articles:Rails 'script/server ะดะปั PHP-ะฟัะพะตะบัะฐ - phpDynamically adding Combo-box - javascriptReading a file through open (). Read () vs saving it in a variable - pythonCan a Grails domain class inherit from a class that is not a domain class? - inheritanceHow to quickly parse large (> 10 GB) files? - awkDisruption of public user functions in many private member functions - private-membersHow to handle all urls from 1 page using PHP? - phpTesting for a non-null pointer and returning null otherwise - c ++Divide the long string into a substring based on character counting in C # - stringtypeconverter from dll - c #All Articles