Hadoop searches for words from one file in another file

I want to create a hadoop application that can read words from one file and search in another file.

If the word exists, it must be written to one output file. If the word does not exist, it must be written to another output file.

I tried some examples in hadoop. I have two questions.

Two files - approximately 200 MB each. Checking every word in a different file may run out of memory. Is there an alternative way to do this?

How to write data to different files, since the output of the hadoop reduction phase is written to only one file. Is it possible to have a filter to reduce the phase for writing data to different output files?

Thank.

+2
source share
3 answers

How would I do this:

  • split value in 'map' in words, emit (<word>, <source>) (* 1)
  • you get "reduce": (<word>, <list of sources>)
  • check source list (may be long for both / all sources)
  • if NOT all sources are listed, emit each time (<missingsource>, <word)
  • job2: job.setNumReduceTasks (<numberofsources>)
  • job2: emit in 'map' (<missingsource>, <word>)
  • job2: emit for each <missingsource> in 'reduce' all (null, <word>)

, <missingsources> , . <missingsource> ONCE "", .

(* 1) (0.20):

private String localname;
private Text outkey = new Text();   
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
    super.setup(context);

    localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}

public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException {
...
    outkey.set(...);
    outvalue.set(localname);
    context.write(outkey, outvalue);
}
+8

Hadoop/MapReduce ? Lucene, Hadoop.

Hadoop, :

  • , MapReduce . - CSV . PDF .. .

  • MapReduce , MapReduce, Distributed Cache, , . , , ( 200 ), , . MapReduce.

, , . , , , , , . , , , , , .

, - MapReduce. ( ) .

+1

You want to do this in two stages, in my opinion. Run the wordcount program (included in jour) of the examples with two examples, this will give you two files, each of which contains a unique list (with a count) of words in each document. From there, instead of using hadoop, do a simple parsing of two files that should answer your question,

0
source

Source: https://habr.com/ru/post/1730135/


All Articles