Search for documents stored in Hadoop - which tool to use?

Question

Search for documents stored in Hadoop - which tool to use?

I got lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI ...

When you read that you can be sure that each of the other tools will be mentioned.

I do not expect you to explain all the tools to me - of course. If you could help me narrow this set down to my specific scenario, that would be great. So far, I'm not sure which of the above will fit, and it looks (as always), there is more than one way to do what needs to be done.

Scenario: 500 GB - ~ 20 TB of documents stored in Hadoop. Text documents in several formats: email, doc, pdf, odt. Metadata about these documents stored in SQL db (sender, recipients, date, department, etc.). The main source of documents will be ExchangeServer (emails and attachments), but not only. Now for the search: The user should be able to perform complex full-text search queries on these documents. Basically, he will be presented with a search-config panel (an application for java applications, not a webapp) - he will set the date range, document types, senders / recipients, keywords, etc. - start the search and get the resulting list of documents (and for each information about the document, why it is included in the search results, i.e. which keywords are in the document).

Which tools should I consider and which not? The point is to develop such a solution with the minimum necessary “glue” code. I own SQLdbs, but it’s rather inconvenient for Apache-related technologies.

The main workflow is as follows: ExchangeServer / another source → conversion from doc / pdf / ... → deduplication → Hadopp + SQL (metadata) → assembly / update of the index <- search for documents (and do it quickly) → submit search results

Thanks!

+6

hadoop lucene solr cloudera carrot2

garret Jul 18 '12 at 18:53

source share

5 answers

Razvan · Answer 1 · 2012-07-19T00:05:06+0000

As a side note, you cannot say that documents are stored in Hadoop, they are stored in a distributed file system (most likely HDFS, since you mentioned Hadoop).

Regarding search / indexing: Lucene is a tool that you can use for your script. You can use it for indexing and searching. This is a java library. There is also a related project (called Solr) that allows you to access the indexing / search system through WebServices. Therefore, you should also take a look at Solr, since it allows you to process various types of documents (Lucene puts the responsibility for translating a document (PDF, Word, etc.) on your shoulders, but you can probably do it already)

Eric Pugh · Answer 2 · 2012-07-19T21:48:02+0000

We did just that for some of our customers, using Solr as a "secondary indexer" for HBase. Updates for HBase are sent to Solr and you can request it. Usually people start with HBase and then search for forwarding. It sounds like you know that search is what you want, so you can probably embed secondary indexing from your pipeline, which passes HBase.

You may find that just using Solr does everything you need.

Animesh Raj Jha · Answer 3 · 2012-07-24T15:35:09+0000

Switching from solr is a good option. I used it for a similar scenario described above. You can use solr for real huge data as your distributed index server.

But in order to get metadata about all the formats of these documents, you must use another tool. Basically your workflow will be like this.

1) Use the hadoop cluster to store data.

2) Retrieving data in a cluster using map / redcue

3) Identification of the document (indicate the type of document)

4) Extract the metadata from this document.

5) Index metadata on the solr server, store other information about getting into the database

6) The Solr server is a distributed indexing server, so for each swallow you can create a new splinter or index.

7) When a search is needed to search all indexes.

8) Solr supports all complex search queries, so you do not need to create your own search engine.

9) He also searches for swap for you.

David · Answer 4 · 2012-08-21T23:48:06+0000

Another project worth paying attention to is Lily, http://www.lilyproject.org/lily/index.html , which has already done the work of integrating Solr with a distributed database.

In addition, I do not understand why you do not want to use a browser for this application. You accurately describe what a facet search is. Although you, of course, can set up a desktop application that communicates with the server (analyzes JSON) and displays the results in a thick client GUI, all this has already been done for you in the browser. And Solr comes with a free grant search engine out of the box: just follow the instructions.

Matan · Answer 5 · 2012-07-19T22:44:06+0000

Migrating from Solr ( http://lucene.apache.org/solr ) is a good solution, but be prepared to deal with some unobvious things. First you plan your indexes correctly. Many terabytes of data will almost certainly require multiple shards on Solr for any level of reasonable performance, and you will manage it yourself. It provides a distributed search (querying from multiple fragments), but this is only half the battle.

ElasticSearch ( http://www.elasticsearch.org/ ) is another popular alternative, but I don't have much experience with scaling. It uses the same Lucene engine, so I expect the search function to be similar.

Another type of solution is something like SenseiDB - open source from LinkedIn - which provides full-text search (also based on Lucene), as well as a proven scale for large amounts of data:

http://senseidb.com

They definitely did a great job of finding there, and my occasional use of this is quite promising.

Assuming all your data is already in Hadoop, you can write some custom MR jobs that pull the data in a schema-consistent format in SenseiDB. SenseiDB already provides the Hadoop MR index, which you can see.

The only caveat is that it’s a little harder to set up, but you will save yourself the trouble of scaling many times, especially with regard to indexing performance and cut features. It also provides clustering support if HA is important to you - back in Alpha for Solr (Solr 4.x - alpha atm).

Hope this helps and good luck!

Update:

I asked a friend who is more knowledgeable about ElasticSearch than I do, and he has the advantage of clustering and rebalancing based on the number of machines and shards you have. This is a definite victory over Solr - especially if you are dealing with TB data. The only drawback is that the current state of ElasticSearch documentation is poor.

Search for documents stored in Hadoop - which tool to use?

More articles: