How to find the number of occurrences of a string in a huge string, like a big book

Question

How to find the number of occurrences of a string in a huge string, like a big book

I recently asked this question during an interview session in C #:

How would you effectively find the number of occurrences of a word in a huge text, like a big book (Bible, dictionary, etc.).

I wonder what would be the most efficient data structure for storing the contents of a book. The dirtiest soul I could think of was to store it in a StringBuilder and find the count of substrings, but I'm sure there should be a much better way to do this.

And for a string with a sufficient size, there are several ways to do this using substrings, regular expressions, etc., but for the most complex string in the most efficient way.

Update: I am looking for the following:

Assuming there is a text file, let's say the Bible is 20 MB in size again, and I want to find the number of times the word “Jesus” appears in the text, except loading only 20 MB into a string or StringBuilder and using a substring or regular expression to find a match, is there any other data structure that can be used to store all the contents of the text. Actual searches can be done in several ways, and I'm looking for the most efficient “data structure” for temporary storage.

+4

performance string c #

dsharma Feb 09 '11 at 2:11

source share

4 answers

aquinas · Answer 1 · 2011-02-09T03:53:54+0000

Assuming you don't need substrings, but just complete words, I would use a hash table. It can be built in linear time, and the size is proportional to the number of different words. Dictionary<string,int> in particular. On my machine, it took about 450 ms to load the entire Bible into a hash table and find all the entries for the word "God."

Aryabhatta · Answer 2 · 2011-02-09T02:14:10+0000

Assuming you are doing a complete word match (can be done to work with prefix matches as well).

Create a Bible trie with quantity information.

If you need to request a word, go on trie, get an account.

If you need to perform subscript matches, you can try using the suffix tree (which is basically a trie, but you also include suffixes).

This assumes that the words are asking for change, the bible remains fixed ...

joshcartme · Answer 3 · 2011-02-09T02:18:13+0000

Wikipedia has an interesting article on finding strings: http://en.wikipedia.org/wiki/String_searching_algorithm and according to this article this algorithm is a kind of standard: http://en.wikipedia.org/wiki/Boyer%E2%80 % 93Moore_string_search_algorithm

deepee1 · Answer 4 · 2011-02-09T02:55:49+0000

Something that the size of the bible is not large enough to prevent caching of the entire string in memory, so with the assumption you can ... I used this method before, but it obviously would not have been lightning fast. Strictly speaking, from the point of view of efficiency from a computational point of view, this is not the fastest, but with coding speed and reasonable speed, I think it works until nanoseconds are counted.

  string text = "a set of text to search in. fast to implement."; string key = "to"; MessageBox.Show(text.Split(" ',.".ToCharArray()).Where(a => a == key).Count().ToString());

Edit: Does not solve the final version of the question and may misinterpret the original question. Hang up.

How to find the number of occurrences of a string in a huge string, like a big book

More articles: