I got lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI ...
When you read that you can be sure that each of the other tools will be mentioned.
I do not expect you to explain all the tools to me - of course. If you could help me narrow this set down to my specific scenario, that would be great. So far, I'm not sure which of the above will fit, and it looks (as always), there is more than one way to do what needs to be done.
Scenario: 500 GB - ~ 20 TB of documents stored in Hadoop. Text documents in several formats: email, doc, pdf, odt. Metadata about these documents stored in SQL db (sender, recipients, date, department, etc.). The main source of documents will be ExchangeServer (emails and attachments), but not only. Now for the search: The user should be able to perform complex full-text search queries on these documents. Basically, he will be presented with a search-config panel (an application for java applications, not a webapp) - he will set the date range, document types, senders / recipients, keywords, etc. - start the search and get the resulting list of documents (and for each information about the document, why it is included in the search results, i.e. which keywords are in the document).
Which tools should I consider and which not? The point is to develop such a solution with the minimum necessary βglueβ code. I own SQLdbs, but itβs rather inconvenient for Apache-related technologies.
The main workflow is as follows: ExchangeServer / another source β conversion from doc / pdf / ... β deduplication β Hadopp + SQL (metadata) β assembly / update of the index <- search for documents (and do it quickly) β submit search results
Thanks!