C # Effective substring with many inputs

Question

C # Effective substring with many inputs

Assuming I don't want to use external libraries or more than a dozen extra lines of code (i.e. code code, not code code for codes), can I do better than string.Containsto process a set of input lines and a set of keywords for verification?

Obviously, substrings can be used to perform a simple check objString.Contains(objString2). However, there are many well-known algorithms that can do this better than this under special circumstances, especially if you are working with multiple lines. But including such an algorithm in my code is likely to add length and complexity, so I would rather use some kind of shortcut based on an inline function.

eg. the input will be a set of lines, a set of positive keywords and a set of negative keywords. The output would be a subset of the first set of keywords, each of which had at least 1 positive keyword, except for 0 negative keywords.

Oh, and please don't mention regular expressions as suggested solutions.

Perhaps my requirements are mutually exclusive (not a lot of additional code, no external libraries or regular expressions, better than String.Contains), but I thought I would ask.

Edit:

Many people offer only stupid improvements that won't beat a reasonably used call to contain a lot, if you like. Some people try to call Contains more intelligently, which completely misses the point of my question. So, here is an example of a problem that needs to be solved. The decision of L. Bushkin is an example of someone who offers a solution that is probably asymptotically better than the standard:

Suppose you have 10,000 positive keywords 5-15 characters long, 0 negative keywords (this seems to confuse people) and 1,100,000 characters. Check to see if the 1000,000 character string contains at least 1 of the positive keywords.

I believe one solution is to create an FSA. Another is the delimitation of spaces and the use of hashes.

+3

string c # .net .net-2.0

Brian 24 . '09 18:16

7

LBushkin · Answer 1 · 2009-07-24T18:32:21+0000

" " - , .

, , , , , - . ...

( "" - , ) - .

( ), . , , - - O (1) , , , .

:

, :

, , ( )
( )
"-", ,
" ", , ( )

# 2.0:

. string[] List<string>, .

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

, LINQ # 3.0:

. LINQ-y , Union() SelectMany(), LINQ, , .

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

Reed Copsey · Answer 2 · 2009-07-24T18:27:59+0000

:

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

:

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

, , , - LINQ , ... , , , .

BlueMonkMN · Answer 3 · 2009-07-24T21:59:43+0000

, , :

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

"/" ( , , ), hashset "Overlaps", , , :

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }

devuxer · Answer 4 · 2009-07-24T18:25:56+0000

, Split(), . , Split(), . , , Contains().

Nick · Answer 5 · 2009-07-24T18:37:06+0000

, . . , Contains() , , .

plinth · Answer 6 · 2009-07-24T19:19:50+0000

, - ( , ) . , n ( 10 5-15) . -, , . .

, - :

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

Ray · Answer 7 · 2009-07-24T22:49:22+0000

Contains(), ( trie) / .

This should speed things up (O (n) vs O (nm), n = line size, m = average word size), and the code is relatively small and simple.

C # Effective substring with many inputs

More articles: