Effective housing search

I have several million words that I want to find in the billion words corpus. What would be an effective way to do this.

I'm thinking of trie, but is there an open source version of trie available?

thanks

- Updated -

Let me add a few more details about what exactly is required.

We have a system in which we scanned news sources and received popular words based on word frequency. Maybe a million words.

Our data will look something like this.

Word1 Frequency1 Word2 Frequency2 (Separated Tab)

We also received the most popular words (1 billion) from another source, which also contains data in the above format.

Here is what I would like to receive as a conclusion.

  • Words common to both sources
  • Words are available only in our source, but not in the original source.
  • Words are available only in the original source, but not in our source.

I can use the comm (bash) command for the above information only for words. I do not know how to use comm to compare with only one column, and not with both columns.

The system must be scalable, and we would like to execute it every day and compare the results. I would also like to get an approximate match.

So, I am thinking of writing a paper reduction job. I plan to write a map and reduce the function as shown below, but I have a few questions.

Map For each word output key = word and value = structure{ filename,frequency} done Reduce For each key Iterate through all the values and check if both file1 and file2 are contained. If yes, then write it to appropriate file. If only in file1, write it to file1only file If only in file2, write it to file2only file. Done. 

I have two questions. In map reduction, I can specify the directory containing my two files as a directory. I do not know how to get the name of the file from which I read the words. How to get this information? How to write to different output files because the reduction phase is automatically written only to the default file called part-xxxxx. How to write to different output files.

Thanks for reading this.

+4
source share
9 answers

With MapReduce, you should not try to do everything in one step or work. It looks like you should divide this problem into several stages. Since you are generating data stored on HDFS and you need to know the source code, you probably should go in a format something like:

{SOURCE}, {WORD}, {FREQUENCY}

Remember that you are talking about a distributed file system, therefore referring to your inputs, as file1 and file2 are not technically correct. Both your reference data and the source data will be distributed throughout the cluster, with pieces of each of them located on each node.

Next, starting with your pseudo-code example, you will need to create a task that compares the word with the source and its frequency. Your cartographer will work very well, but to reduce it you will need to connect the words with the sources. You will need to create your own Writable object that contains the Map <source>. This will be output to HDFS as intermediate data that your subsequent filter jobs can work with.

You can then use the output from this step as input for three different MapReduce jobs. Where everyone is looking for different combinations of sources. These tasks will be very simple, since the converter will simply transmit the same data, but the reducer will check each value for different combinations of sources.

So, if you take this approach, you will need 4 MapReduce jobs. You do not need to run each of them manually, you can have one task that sequentially performs each task. Alternatively, since the last 3 jobs will use the same input, you can run these three at the same time as the first one is completed. This will likely depend on the amount of data and intermediate data that your cluster can manage, and the number of cards / reducers that will be required for each job.

Hope this suggestion helps.

+2
source

It looks like a job for which the Aho-Corasick string search algorithm was developed. I never coded it myself, but some code should appear to play a little Google.

Rabin-Karp may also work, but I have no idea how it works for multiple templates when they do not have the same length. Note: the pseudocode with multiple templates in the wikipedia article seems to be incorrect. But should give you a starting point.

+1
source

In the spirit of quick and dirty:

 fgrep --mmap -f query-file corpus-file 
+1
source

If I did this in Java, I would use a HashMap. Wikipedia suggests that in some cases, trie is a little better, but I'm not sure if you will see a big difference.

0
source

The data structure used in text search engines is called an inverted index . And, as has been said, a very good open source search engine is Lucene .

0
source

I'm not sure about its performance, but Python nltk was designed for this type of thing: tokenize large text bodies and allow you to make comparisons between them. The book Natural Language Processing with Python uses this toolkit and there are many examples. It is available online for free.

0
source

A tokenizer.c compiled in a.out can tokenize corpus and then use the systemclose shell script for efficient performance

  ./a.out < /live/memory/var/cache/man/whatis | sort | awk {'print $1'} | uniq -c | sort -rn > file.txt 
0
source

A desktop PC can do this. A smaller dataset will fit in memory, and that's all you need.

In Python:

 # Load the words from the small file into one big hash set small_set = set(line.split()[0] for line in open("small.txt", "r")) # Open 3 output files. f1 = open("common.txt", "w") f2 = open("large_only.txt", "w") f3 = open("small_only.txt", "w") # Find all words in the large set that aren't in the small set. for line in open("large.txt", "r"): word = line.split()[0] if word in small_set: f1.write(line) # word is in both sets small_set.remove(word) else: f2.write(line) # word is in large but not small # Everything left over in small_set wasn't in the large_set. for word in small_set: f3.write(word + "\n") 

The cluster can do it faster. But you can try it at home.

0
source

Since you can use comm , I think you should sort the input files.

Here's a program like comm , which looks only at the first column, but outputs output containing the entire input line. It only works when sorting input!

This is a complete program. All you have to do is put this in a text file and run it from the command line.

 #!/usr/bin/env python # # comm.py - Compare 2 sorted files line by line, based on the first column. # Usage: python compare.py FILE1 FILE2 OUTFILE1 OUTFILE2 OUTFILE12 # OUTFILE1 receives all entries that are only in FILE1, etc. import sys def compare(f1, f2, out1, out2, out12): def get(f): line = f.readline() if line == '': return None first, rest = line.rstrip('\n').split('\t', 1) return first, rest, line e1 = get(f1) e2 = get(f2) while e1 and e2: if e1[0] == e2[0]: # common entry out12.write(e1[0] + "\t" + e1[1] + "\t" + e2[1] + "\n") e1 = get(f1) e2 = get(f2) elif e1[0] < e2[0]: # e1 is not in f2 out1.write(e1[2]) e1 = get(f1) else: # e2 is not in f1 out2.write(e2[2]) e2 = get(f2) if e1: buf = e1[2] while buf: out1.write(buf) buf = f1.read(8192) if e2: buf = e2[2] while buf: out2.write(buf) buf = f2.read(8192) compare(open(sys.argv[1], "r"), open(sys.argv[2], "r"), open(sys.argv[3], "w"), open(sys.argv[4], "w"), open(sys.argv[5], "w")) 
0
source

Source: https://habr.com/ru/post/1299276/


All Articles