Comparing 2 huge lists using C # several times (with a twist)

Hello everyone, great community that you are here. I am an electrical engineer doing some “programming” work on the side to help pay bills. I say this because I want you to consider that I do not have proper computer science training, but I have been encoded for the last 7 years.

I have several excel tables with information (all numeric), basically these are “dialed phone numbers” in one column and the number of minutes for each of these numbers on another. Separately, I have a list of “operator prefix code numbers” for different operators in my country. What I want to do is split all the "traffic" into the carrier. Here is the scenario:

Number of the first number dialed : 123456789ABCD, 100 <- This will be a 13-digit phone number and 100 minutes.

I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check all of them:

Prefix code 1 : 1234567 <- this code has 7 digits.

I need to check the first 7 digits for the dialed number, compare it with the dialed number, if a match is found, I would add the number of minutes for the subtotal for later use. Please note that not all prefix codes are the same length, sometimes they are shorter or longer.

Most of this should be a piece of cake, and I could do it, but I'm always afraid of a huge amount of data; In some cases, lists of dialed numbers consist of 30,000 numbers, and the “carrier prefix code” contains about 13,000 lines, and I usually check 3 operators, which means that I have to make many “matches”.

Does anyone have an idea how to do this effectively with C #? Or any other language, to be kind honest. I need to do this quite often, and developing a tool for this makes sense. I need a good perspective for someone who has this "computer scientist."

Lists do not have to be in Excel sheets, I can export them to a csv file and work from there, I don’t need the MS Office interface.

Thank you for your help.

Update:

Thank you all for your time in answering my question. I think, in my ignorance, I exaggerate the word "effective." I do not complete this task every few seconds. This is what I have to do once a day, and I don’t like doing it with Excel and VLOOKUP etc.

I learned about new concepts from you guys and I hope that I can build solution (s) using your ideas.

+4
source share
7 answers

UPDATE

You can do a simple trick - group the prefixes with your first digits in the dictionary and match numbers only with the correct subset. I tested it with the following two LINQ operations, assuming that each prefix has at least three digis.

const Int32 minimumPrefixLength = 3; var groupedPefixes = prefixes .GroupBy(p => p.Substring(0, minimumPrefixLength)) .ToDictionary(g => g.Key, g => g); var numberPrefixes = numbers .Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)] .First(n.StartsWith)) .ToList(); 

So how fast does this happen? 15,000 prefixes and 50,000 numbers took less than 250 milliseconds. fast enough for two lines of code?

Note that performance is highly dependent on the minimum prefix length (MPL), therefore, on the number of prefix groups that you can create.

  MPL Runtime
     -----------------
      1 10.198 ms
      2 1.179 ms
      3 205 ms
      4 130 ms
      5 107 ms

Just to give a rough idea - I did just one race and did a lot of things.

Original answer

I would not really like the performance - a regular desktop PC can easily cope with database tables with 100 million rows. It may take five minutes, but I suppose you do not want to complete this task every second.

I just did a test. I created a list with 15,000 unique prefixes with 5-10 digits. From these prefixes, I generated 50,000 numbers with a prefix and another 5 to 10 digits.

 List<String> prefixes = GeneratePrefixes(); List<String> numbers = GenerateNumbers(prefixes); 

Then I used the following LINQ to Object query to find the prefix of each number.

 var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList(); 

Well, it took about a minute on my 2.0 GHz Core 2 Duo laptop. Therefore, if a one-minute processing time is acceptable, perhaps two or three, if you turn on aggregation, I would not optimize anything. Of course, it would be nice if the program could complete the task in a second or two, but this will add a bit of complexity and a lot to make a mistake. And it takes time to develop, write, and test. The LINQ statement took only my seconds.

test application

Note that generating many prefixes is very slow and can take a minute or two.

 using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using System.Text; namespace Test { static class Program { static void Main() { // Set number of prefixes and calls to not more than 50 to get results // printed to the console. Console.Write("Generating prefixes"); List<String> prefixes = Program.GeneratePrefixes(5, 10, 15); Console.WriteLine(); Console.Write("Generating calls"); List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50); Console.WriteLine(); Console.WriteLine("Processing started."); Stopwatch stopwatch = new Stopwatch(); const Int32 minimumPrefixLength = 5; stopwatch.Start(); var groupedPefixes = prefixes .GroupBy(p => p.Substring(0, minimumPrefixLength)) .ToDictionary(g => g.Key, g => g); var result = calls .GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)] .First(c.Number.StartsWith)) .Select(g => new Call(g.Key, g.Sum(i => i.Duration))) .ToList(); stopwatch.Stop(); Console.WriteLine("Processing finished."); Console.WriteLine(stopwatch.Elapsed); if ((prefixes.Count <= 50) && (calls.Count <= 50)) { Console.WriteLine("Prefixes"); foreach (String prefix in prefixes.OrderBy(p => p)) { Console.WriteLine(String.Format(" prefix={0}", prefix)); } Console.WriteLine("Calls"); foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration)) { Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration)); } Console.WriteLine("Result"); foreach (Call call in result.OrderBy(c => c.Number)) { Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration)); } } Console.ReadLine(); } private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count) { Random random = new Random(); List<String> prefixes = new List<String>(count); StringBuilder stringBuilder = new StringBuilder(maximumLength); while (prefixes.Count < count) { stringBuilder.Length = 0; for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++) { stringBuilder.Append(random.Next(10)); } String prefix = stringBuilder.ToString(); if (prefixes.Count % 1000 == 0) { Console.Write("."); } if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p))) { prefixes.Add(stringBuilder.ToString()); } } return prefixes; } private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count) { Random random = new Random(); List<Call> calls = new List<Call>(count); StringBuilder stringBuilder = new StringBuilder(); while (calls.Count < count) { stringBuilder.Length = 0; stringBuilder.Append(prefixes[random.Next(prefixes.Count)]); for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++) { stringBuilder.Append(random.Next(10)); } if (calls.Count % 1000 == 0) { Console.Write("."); } calls.Add(new Call(stringBuilder.ToString(), random.Next(1000))); } return calls; } private class Call { public Call (String number, Decimal duration) { this.Number = number; this.Duration = duration; } public String Number { get; private set; } public Decimal Duration { get; private set; } } } } 
+11
source

It seems to me that you need to create a trie from carrier prefixes. In the end, you will get the only three where the end nodes inform you of the carrier for this prefix.

Then create a dictionary from carrier to int or long (total).

Then, for each line of the dialed number, simply swipe your way down three, until you find the carrier. Find the total number of minutes for the medium and add the current line - then go.

+8
source

The simplest data structure that will do this quite effectively is a list of sets. Create a set for each carrier containing all the prefixes.

Now, to associate the call with the carrier:

 foreach (Carrier carrier in carriers) { bool found = false; for (int length = 1; length <= 7; length++) { int prefix = ExtractDigits(callNumber, length); if (carrier.Prefixes.Contains(prefix)) { carrier.Calls.Add(callNumber); found = true; break; } } if (found) break; } 

If you have 10 agents, there will be 70 requests in the set for each call. But the search in the set is not too slow (much faster than the linear search). Thus, this should give you a fairly high linear search speed of brute force.

You can go further and group the prefixes for each carrier according to length. Thus, if the operator has only prefixes of length 7 and 4, you would know that you only need to extract and search for these lengths, each time looking at a set of prefixes of this length.

+1
source

How to dump data into a couple of database tables and then query them using SQL? Easy!

 CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT ) CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) ) -- now populate the tables, create indexes etc -- and then just run your query... SELECT p.prefix, SUM(n.minutes) AS total_minutes FROM dbo.dialled_numbers AS n INNER JOIN dbo.prefixes AS p ON n.number LIKE p.prefix + '%' GROUP BY p.prefix 

(This was written for SQL Server, but it should be very easy to translate for any other DBMS.)

+1
source

It might be easier (not necessarily more efficient) to do this in the database instead of C #.

You can insert rows into the database, and insert to define the media and include it in the record (possibly in the insert trigger).

Then your report will be a request for the amount in the table.

0
source

I would just put entries in a list, sort it, and then use binary search to find matches. Define binary search matching criteria to return the first item that matches, then iterate over the list until you find the one that doesn't match. Binary search takes only about 15 comparisons to search a list of 30,000 items.

0
source

You can use HashTable in C #.

Thus, you have key-value pairs, and your keys can be phone numbers, and your values ​​can be total minutes. If a match is found in the key set, change the total minutes, otherwise add a new key.

Then you just need to change your search algorithm so as not to look at the whole key, but only at the first 7 digits.

0
source

Source: https://habr.com/ru/post/1286777/


All Articles