How can I index HTML documents?

I use Lucene.NEt for full-text search. So far I have indexed PDF documents, but now I have some web pages that I need to index. What is the best / easiest way to index HTML documents to add to my Lucene index? I am using .NET / C #

+3
source share
2 answers

I am currently working on this issue, the best answer I have found so far is HTML Agility Pack to get plain text content from HTML.

+1
source

Google may index your content for you.

-3

Source: https://habr.com/ru/post/1725915/


All Articles