N-gram, which is the most frequent among all words

I ran into the following programming interview problem:

Task 1: N-grams

N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3 grams: "pil", "ilo" and "lot". For a given set of words and length of n-grams, your task is

β€’ write a function that finds the n-gram that is the most frequent one among all the words β€’ print the result to the standard output (stdout) β€’ if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order) 

Note that your function will receive the following arguments:

 β€’ text β—‹ which is a string containing words separated by whitespaces β€’ ngramLength β—‹ which is an integer value giving the length of the n-gram 

Data limitations

 β€’ the length of the text string will not exceed 250,000 characters β€’ all words are alphanumeric (they contain only English letters az, AZ and numbers 0-9) 

Performance limitations

 β€’ your function is expected to print the result in less than 2 seconds 

Example input text: "aaaab a0a baaab c"

Output aaa ngramLength: 3

Explanation

For the input presented above, 3G grams are sorted by frequency:

 β€’ "aaa" with a frequency of 3 β€’ "aab" with a frequency of 2 β€’ "a0a" with a frequency of 1 β€’ "baa" with a frequency of 1 

If I have only one hour to solve the problem, and I decided to use the C language to solve it: is it useful to use a hash table to calculate the frequency of N-grams with such an amount of time? because there is no hash table implementation in C library ...

If so, I was thinking of implementing a Hash Table using a separate chain with ordered linked lists. These implementations reduce the time needed to solve the problem.

Is it possible that the fastest option?

Thanks!!!

+6
source share
7 answers

If implementation efficiency is important, and you use C, I would initialize an array of pointers to the beginning of n-grams in a string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over this sorted array and count the number of samples .

This should be fast enough, and there is no need to encode any fantastic data structures.

+5
source

Sorry to submit python, but this is what I would do: You can get some ideas for the algorithm. Please note that this program solves an order of magnitude more words.

 from itertools import groupby someText = "thibbbs is a test and aaa it may haaave some abbba reptetitions " someText *= 40000 print len(someText) n = 3 ngrams = [] for word in filter(lambda x: len(x) >= n, someText.split(" ")): for i in range(len(word)-n+1): ngrams.append(word[i:i+n]) # you could inline all logic here # add to an ordered list for which the frequiency is the key for ordering and the paylod the actual word ngrams_freq = list([[len(list(group)), key] for key, group in groupby(sorted(ngrams, key=str.lower))]) ngrams_freq_sorted = sorted(ngrams_freq, reverse=True) popular_ngrams = [] for freq in ngrams_freq_sorted: if freq[0] == ngrams_freq_sorted[0][0]: popular_ngrams.append(freq[1]) else: break print "Most popular ngram: " + sorted(popular_ngrams, key=str.lower)[0] # > 2560000 # > Most popular ngram: aaa # > [Finished in 1.3s]** 
+1
source

So, the main recipe for this problem will be:

  • Find all n-grams in a row
  • Match all duplicate entries to a new structure that has n-grams and the number of times this happens.

Here you can find my C ++ solution: http://ideone.com/MNFSis

Given:

 const unsigned int MAX_STR_LEN = 250000; const unsigned short NGRAM = 3; const unsigned int NGRAMS = MAX_STR_LEN-NGRAM; //we will need a maximum of "the length of our string" - "the length of our n-gram" //places to store our n-grams, and each ngram is specified by NGRAM+1 for '\0' char ngrams[NGRAMS][NGRAM+1] = { 0 }; 

Then for the first step, this is the code:

 const char *ptr = str; int idx = 0; //notTerminated checks ptr[0] to ptr[NGRAM-1] are not '\0' while (notTerminated(ptr)) { //noSpace checks ptr[0] to ptr[NGRAM-1] are isalpha() if (noSpace(ptr)) { //safely copy our current n-gram over to the ngrams array //we're iterating over ptr and because we're here we know ptr and the next NGRAM spaces //are valid letters for (int i=0; i<NGRAM; i++) { ngrams[idx][i] = ptr[i]; } ngrams[idx][NGRAM] = '\0'; //important to zero-terminate idx++; } ptr++; } 

At this point, we have a list of all n-grams. Find the most popular option:

 FreqNode head = { "HEAD", 0, 0, 0 }; //the start of our list for (int i=0; i<NGRAMS; i++) { if (ngrams[i][0] == '\0') break; //insertFreqNode takes a start node, this where we will start to search for duplicates //the simplest description is like this: // 1 we search from head down each child, if we find a node that has text equal to // ngrams[i] then we update it frequency count // 2 if the freq is >= to the current winner we place this as head.next // 3 after program is complete, our most popular nodes will be the first nodes // I have not implemented sorting of these - it an exercise for the reader ;) insertFreqNode(&head, ngrams[i]); } //as the list is ordered, head.next will always be the most popular n-gram cout << "Winner is: " << head.next->str << " " << " with " << head.next->freq << " occurrences" << endl 

Good luck to you!

+1
source

Just for fun, I wrote the SQL version (SQL Server 2012):

 if object_id('dbo.MaxNgram','IF') is not null drop function dbo.MaxNgram; go create function dbo.MaxNgram( @text varchar(max) ,@length int ) returns table with schemabinding as return with Delimiter(c) as ( select ' '), E1(N) as ( select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) )T(N) ), E2(N) as ( select 1 from E1 a cross join E1 b ), E6(N) as ( select 1 from E2 a cross join E2 b cross join E2 c ), tally(N) as ( select top(isnull(datalength(@text),0)) ROW_NUMBER() over (order by (select NULL)) from E6 ), cteStart(N1) as ( select 1 union all select t.N+1 from tally t cross join delimiter where substring(@text,tN,1) = delimiter.c ), cteLen(N1,L1) as ( select s.N1, isnull(nullif(charindex(delimiter.c,@text,s.N1),0) - s.N1,8000) from cteStart s cross join delimiter ), cteWords as ( select ItemNumber = row_number() over (order by l.N1), Item = substring(@text, l.N1, l.L1) from cteLen l ), mask(N) as ( select top(@length) row_Number() over (order by (select NULL)) from E6 ), topItem as ( select top 1 substring(Item,mN,@length) as Ngram ,count(*) as Length from cteWords w cross join mask m where mN <= datalength(w.Item) + 1 - @length and @length <= datalength(w.Item) group by substring(Item,mN,@length) order by 2 desc, 1 ) select ds from ( select top 1 NGram,Length from topItem ) t cross apply (values (cast(NGram as varchar)),(cast(Length as varchar))) d(s) ; go 

which when called with an input pattern is provided by OP

 set nocount on; select s as [ ] from MaxNgram( 'aaaab a0a baaab c aab' ,3 ); go 

gives at will

 ------------------------------ aaa 3 
+1
source

You can convert the trigram to RADIX50 code. See http://en.wikipedia.org/wiki/DEC_Radix-50

In radix50, the output value for a trigram fits into a 16-bit unsigned value.

Subsequently, you can use the radix-encoded trigger as an index in the array.

So your code will look like this:

 uint16_t counters[1 << 16]; // 64K counters bzero(counters, sizeof(counters)); for(const char *p = txt; p[2] != 0; p++) counters[radix50(p)]++; 

After that, just find the maximum value in the array and decode the index into trigrams back.

I used this trick when I implemented the Wilber-Hovayko algorithm for fuzzy searches ~ 10 years ago.

You can download the source code here: http://itman.narod.ru/source/jwilbur1.tar.gz .

0
source

If you are not attached to C, I wrote this Python script after about 10 minutes, which processes a 1.5Mb file containing more than 265,000 words , looking for 3 grams in 0.4 s (except for printing the values ​​on the screen)
The text used for the test is Ulysses of James Joyce, you can find it here for free https://www.gutenberg.org/ebooks/4300

Here's the space separator and carriage return \n

 import sys text = open(sys.argv[1], 'r').read() ngram_len = int(sys.argv[2]) text = text.replace('\n', ' ') words = [word.lower() for word in text.split(' ')] ngrams = {} for word in words: word_len = len(word) if word_len < ngram_len: continue for i in range(0, (word_len - ngram_len) + 1): ngram = word[i:i+ngram_len] if ngram in ngrams: ngrams[ngram] += 1 else: ngrams[ngram] = 1 ngrams_by_freq = {} for key, val in ngrams.items(): if val not in ngrams_by_freq: ngrams_by_freq[val] = [key] else: ngrams_by_freq[val].append(key) ngrams_by_freq = sorted(ngrams_by_freq.items()) for key in ngrams_by_freq: print('{} with frequency of {}'.format(key[1:], key[0])) 
0
source

You can solve this problem in O (nk) time, where n is the number of words and k is the average number of n-grams per word.

You are right in thinking that a hash table is a good solution to the problem.

However, since you have limited time for solution code, I would suggest using open addressing instead of a linked list. Implementation can be simpler: if you reach a collision, you just go further down the list.

Also, be sure to allocate enough memory for your hash table: about two times the expected number of n-grams should be fine. Since the expected number of n-grams is <= 250,000, a hash table of 500,000 should be more than sufficient.

In terms of coding rate, a small input length (250,000) makes sorting and counting a valid option. The fastest way, probably, is to create an array of pointers to each n-gram, sort the array using the appropriate comparator, and then go through it, keeping track of which n-grams appear the most.

0
source

Source: https://habr.com/ru/post/974783/


All Articles