Hadoop: How can I give each value a globally unique identification number as a key in Mapper?

Here is what I want to do. Now I have text files:

<page> <url>xxx.example.com</url> <title>xxx</title> <content>abcdef</content> </page> <page> <url>yyy.example.com</url> <title>yyy</title> <content>abcdef</content> </page> ... 

And I want to read the file splitting in mapper and convert them to key-value pairs, where each value is content in a single <page > tag.

My problem is with the key. I can use URLs as keys because they are global. However, due to the context of my work, I want to create a globally unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution?

+4
source share
2 answers

If you are going to process such files using MapReduce, I would take the following strategy:

  • Use a common text input format, line by line. This causes each different file to go to a different mapper job.
  • In the build loop, mapper, which reads the next lines in the loop through context.nextKeyValue() instead of a call for each line.
  • The feed lines to some parser (maybe you just need to read 6 non-empty lines, maybe you will use something like libxml but finally you will generate the number of objects.
  • If you intend to pass the objects that you read into the reducer, you need to wrap them in something that implements Writable interaface.
  • To generate keys, I would use the java.util.UUID UUID implementation. Sort of:

    UUID key = UUID.randomUUID ();

    This is sufficient if you do not generate billions of records per second, and your work does not take 100 years .:-)

  • Just note - the UUID should probably be encoded in the ImmutableBytesWritable class, useful for such things.

  • That is all, context.write(object,key) .

OK, your gearbox (if any) and output format is another story. You will definitely need an output format to store your objects if you do not convert them to something like Text during display.

+2
source

Not sure if this will directly answer your question. But I take advantage of the input file format.

You can use NLineInputFormat and set N = 6, since each record includes 6 lines:

 <page> <url>xxx.example.com</url> <title>xxx</title> <content>abcdef</content> </page> . 

With each entry, the cartographer will receive the offset position in the file. This offset will be unique for each record.

PS: It will only work if the circuit is fixed. I doubt that it will work correctly for multiple text input files.

0
source

Source: https://habr.com/ru/post/1482800/


All Articles