Efficient C # Substring Validation Method

I have a bunch of files txtcontaining 300k lines. Each row has URL. For example.http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718

In some array string[], I have a list of websites

amazon.com
google.com
ieee.org
...

I need to check if this URLone of the websites contains and updates any counter corresponding to a specific website?

I am currently using the method contains, but it is very slow. There are ~ 900 entries in the array, so the surest case is 900 * 300K (for 1 file). I believe it indexOfwill also be slow.

Can someone help me with a faster approach? Thank you in advance

+4
source share
6

.

  • Hash ( string[], )
  • List<int> (hashes.Add("www.ieee.com".GetHashCode())
  • (hashes.Sort())
  • URL:
    • URL- ( ieee.com http://www.ieee.com/...). new Uri("http://www.ieee.com/...").Host www.ieee.com.
    • , . ( http://www.IEee.COM/ www.ieee.com)
    • hashes. BinarySearch, .
    • ,

Bloom filters. , # CodePlex. , , ( , , ), . , - , .


Dictionary<TKey, TValue> , , .

+3

Dictionary .

URL- ( ), Dictionary .


, , , . URL , trie .

+1

, :

"" #

0

, ,

-

int l = url.length;
int position = 0;
while (position < l)
{
   if (url[i] == website[0])
   {
      //test rest of web site from position in an other loop
      if (exactMatch(url,position, website))
   }
}

, ( 10) (1.2Mb) ( ), 3 < 1 .

0

, , . ( ), , , URL-, - , , Dictionary<string, int>, , :

var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
    var domain = naiveDomainExtractor(line);
    if(tally.ContainsKey(domain)) tally[domain]++;
}

... , .

In truth, your domain extractor may be a little more complicated, but it probably won't be very intense, and if you have multiple cores, you can speed things up with ConcurrentDictionary<string, int>and Parallel.ForEach.

0
source

You will need to check the performance, but you can try converting the URLs to the actual object System.Uri.

Save the list of websites as HashSet<string>- then use the HashSet to search for Uri Host:

IEnumerable<Uri> inputUrls = File.ReadAllLines(@"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
0
source

Source: https://habr.com/ru/post/1528790/


All Articles