Here is what I want to do. Now I have text files:
<page> <url>xxx.example.com</url> <title>xxx</title> <content>abcdef</content> </page> <page> <url>yyy.example.com</url> <title>yyy</title> <content>abcdef</content> </page> ...
And I want to read the file splitting in mapper and convert them to key-value pairs, where each value is content in a single <page > tag.
My problem is with the key. I can use URLs as keys because they are global. However, due to the context of my work, I want to create a globally unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution?
source share