What is the C # data structure that most efficiently searches for a couple of lines for substrings?

Question

What is the C # data structure that most efficiently searches for a couple of lines for substrings?

I have a data structure that consists of pairs of values, the first of which is an integer, and the second of them is an alphanumeric string (which can start with numbers):

+--------+-----------------+ | Number | Name | +--------+-----------------+ | 15 | APPLES | | 16 | APPLE COMPUTER | | 17 | ORANGE | | 21 | TWENTY-1 | | 291 | 156TH ELEMENT | +--------+-----------------+

A table of them will contain up to 100,000 rows.

I would like to provide a search function in which the user can search for either a number (as if it were a string) or fragments of a string. Ideally, the search will be “live” as the user enters; after each keystroke (or maybe after a short delay of ~ 250-500 ms), a new search will be performed to find the most likely candidates. So, for example, search by

1 will return 15 APPLES , 16 APPLE COMPUTER , 17 ORANGE and 291 156TH ELEMENT
15 narrow your search to 15 APPLES , 291 156TH ELEMENT
AP will return 15 APPLES and 16 APPLE COMPUTER
(ideally, but not required) ELEM will return 291 156TH ELEMENT .

I was thinking about using two Dictionary<string, string> , since in the end int compared as string - one will be indexed by the integer part, and the other by the string.

But in fact, substring searching should not use a hash function, and it seems wasteful to use twice as much memory, which I think I need.

Ultimately, the question arises: is there any well-executed way to text search two large lists at once for substrings?

Otherwise, how about a SortedDictionary ? May improve performance, but still will not solve the hash problem.

The thought of creating a regex on the fly, but I think it will be terrible.

I'm new to C # (came from the Java world), so I haven't looked at LINQ yet; this is the answer?

EDIT 18:21 EST . None of the lines in the "Name" field will be more than 12-15 characters, if this affects your potential decision.

+6

substring dictionary c # .net search

Tenner Jan 24 '12 at 10:54

source share

3 answers

If possible, I would not load all 100,000 records into memory. I would use either a database or Lucene.Net to index values. Then use the appropriate query syntax to efficiently find results.

+6

Phil bolduc Jan 24 '12 at 23:06

source share

Since you are looking for the beginning of words, key-based collections will not work unless you save all possible parts of the words, such as "a", "ap", "app", "appl", "apple",

My suggestion is to use System.Collections.Generic.List<T> in conjunction with binary search. You will need to provide your own IComparer<T> , which will also find the beginning of words. You would use two data structures.

One List<KeyValuePair<string,int>> containing single words or a number as a key, and a number as a value.

One Dictionary<int,string> containing the entire name.

You would do the following:

Divide your sentence (all name) into separate words.
Add them to the list with the word as a key, and the number as the value of KeyValuePair .
Add the number to the list as a key and as a KeyValuePair value.
When the list is full, sort the list to allow binary search.

Search for the beginning of a word:

Search the list using BinarySearch in conjunction with your IComparer<T> .
The index you get from the search may not be the first one that applies, so go back to the list until you find the first record that matches.
Using the number stored as a value in the list, view the entire name in the dictionary, using this number as the key.

+1

Olivier Jacot-Descombes Jan 24 '12 at 23:22

source share

doblak · Accepted Answer · 2012-01-24T23:10:33+0000

I would consider using the Trie framework.

How to do it? The sheets will represent your “string”, but you will have “two paths” to each memory instance of the “string” (one for the number and one for the name).

Then you can sacrifice your fortune:

 (ideally, but not required) ELEM will return 291 156TH ELEMENT.

Or provide even more paths to string instances.

What is the C # data structure that most efficiently searches for a couple of lines for substrings?

More articles: