I have several million words that I want to find in the billion words corpus. What would be an effective way to do this.
I'm thinking of trie, but is there an open source version of trie available?
thanks
- Updated -
Let me add a few more details about what exactly is required.
We have a system in which we scanned news sources and received popular words based on word frequency. Maybe a million words.
Our data will look something like this.
Word1 Frequency1 Word2 Frequency2 (Separated Tab)
We also received the most popular words (1 billion) from another source, which also contains data in the above format.
Here is what I would like to receive as a conclusion.
- Words common to both sources
- Words are available only in our source, but not in the original source.
- Words are available only in the original source, but not in our source.
I can use the comm (bash) command for the above information only for words. I do not know how to use comm to compare with only one column, and not with both columns.
The system must be scalable, and we would like to execute it every day and compare the results. I would also like to get an approximate match.
So, I am thinking of writing a paper reduction job. I plan to write a map and reduce the function as shown below, but I have a few questions.
Map For each word output key = word and value = structure{ filename,frequency} done Reduce For each key Iterate through all the values and check if both file1 and file2 are contained. If yes, then write it to appropriate file. If only in file1, write it to file1only file If only in file2, write it to file2only file. Done.
I have two questions. In map reduction, I can specify the directory containing my two files as a directory. I do not know how to get the name of the file from which I read the words. How to get this information? How to write to different output files because the reduction phase is automatically written only to the default file called part-xxxxx. How to write to different output files.
Thanks for reading this.