Hadoop searches for words from one file in another file

Question

Hadoop searches for words from one file in another file

I want to create a hadoop application that can read words from one file and search in another file.

If the word exists, it must be written to one output file. If the word does not exist, it must be written to another output file.

I tried some examples in hadoop. I have two questions.

Two files - approximately 200 MB each. Checking every word in a different file may run out of memory. Is there an alternative way to do this?

How to write data to different files, since the output of the hadoop reduction phase is written to only one file. Is it possible to have a filter to reduce the phase for writing data to different output files?

Thank.

+2

mapreduce hadoop

Boolean Jan 24 '10 at 18:33

source share

3 answers

Hadoop/MapReduce ? Lucene, Hadoop.

Hadoop, :

, MapReduce . - CSV . PDF .. .
MapReduce , MapReduce, Distributed Cache, , . , , ( 200 ), , . MapReduce.

, , . , , , , , . , , , , , .

, - MapReduce. ( ) .

+1

Binary Nerd 24 . '10 23:06

You want to do this in two stages, in my opinion. Run the wordcount program (included in jour) of the examples with two examples, this will give you two files, each of which contains a unique list (with a count) of words in each document. From there, instead of using hadoop, do a simple parsing of two files that should answer your question,

0

dangerstat Jan 24 '10 at 18:39

source share

Leonidas · Accepted Answer · 2010-01-25T09:43:24+0000

How would I do this:

split value in 'map' in words, emit (<word>, <source>) (* 1)
you get "reduce": (<word>, <list of sources>)
check source list (may be long for both / all sources)
if NOT all sources are listed, emit each time (<missingsource>, <word)
job2: job.setNumReduceTasks (<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)

, <missingsources> , . <missingsource> ONCE "", .

(* 1) (0.20):

private String localname;
private Text outkey = new Text();   
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
    super.setup(context);

    localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}

public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException {
...
    outkey.set(...);
    outvalue.set(localname);
    context.write(outkey, outvalue);
}

Hadoop searches for words from one file in another file

More articles: