How can I index HTML documents?

Question

How can I index HTML documents?

I use Lucene.NEt for full-text search. So far I have indexed PDF documents, but now I have some web pages that I need to index. What is the best / easiest way to index HTML documents to add to my Lucene index? I am using .NET / C #

+3

.net indexing full-text-search lucene lucene.net

Prabhu Dec 17 '09 at 1:57

source share

2 answers

Adam pope · Answer 1 · 2010-03-23T09:57:31+0000

I am currently working on this issue, the best answer I have found so far is HTML Agility Pack to get plain text content from HTML.

Pierreten · Answer 2 · 2009-12-17T02:01:33+0000

Google may index your content for you.

How can I index HTML documents?

More articles: