C # data type for large sorted collection with position?

Question

C # data type for large sorted collection with position?

I am trying to compare two large datasets from an SQL query. Currently, the SQL query is being executed externally, and the results from each dataset are saved in their own csv file. My small C # console application downloads two text / csv files and compares them for differences and saves the differences in a text file.

This is a very simple application that simply loads all the data from the first file into the arraylist and does .compare () on the arraylist, since each line is read from the second csv file. Then saves entries that do not match.

The application works, but I would like to improve performance. I believe that I can significantly improve performance if I can take advantage of the fact that both files are sorted, but I don’t know the data type in C #, which keeps order and allows me to choose a specific position. Theres a basic array, but I don't know how many items will be in each list. I could have over a million records. Is there a data type that I should look at?

+4

comparison c # types sorted

Maxgeek Sep 16 '08 at 21:46

source share

11 answers

David J. Sokol · Answer 1 · 2008-09-16T21:56:47+0000

If the data in both of your CSV files is already sorted and have the same number of records, you can completely skip the data structure and do an in-place analysis.

StreamReader one = new StreamReader("C:\file1.csv"); StreamReader two = new StreamReader("C:\file2.csv"); String lineOne; String lineTwo; StreamWriter differences = new StreamWriter("Output.csv"); while (!one.EndOfStream) { lineOne = one.ReadLine(); lineTwo = two.ReadLine(); // do your comparison. bool areDifferent = true; if (areDifferent) differences.WriteLine(lineOne + lineTwo); } one.Close(); two.Close(); differences.Close();

cranley · Answer 2 · 2008-09-16T21:54:19+0000

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf (string) method, allows you to get the index of this element.

In doing so, you probably just load a couple of bytes [] from the filter and compare the bytes ... don’t even worry about loading this material into a formal data structure such as StringCollection or string []; if all you do is check for differences and you want speed, I would scatter the byte differences where they are.

Jonathan rupp · Answer 3 · 2008-09-16T22:28:35+0000

This is an adaptation of David Sokol's code to work with a different number of lines, displaying lines that are in one file but not in another:

 StreamReader one = new StreamReader("C:\file1.csv"); StreamReader two = new StreamReader("C:\file2.csv"); String lineOne; String lineTwo; StreamWriter differences = new StreamWriter("Output.csv"); lineOne = one.ReadLine(); lineTwo = two.ReadLine(); while (!one.EndOfStream || !two.EndOfStream) { if(lineOne == lineTwo) { // lines match, read next line from each and continue lineOne = one.ReadLine(); lineTwo = two.ReadLine(); continue; } if(two.EndOfStream || lineOne < lineTwo) { differences.WriteLine(lineOne); lineOne = one.ReadLine(); } if(one.EndOfStream || lineTwo < lineOne) { differences.WriteLine(lineTwo); lineTwo = two.ReadLine(); } }

The standard disclaimer for code written on top of my head applies - you may need a special case when the lines end, while the other still has lines, but I think this basic approach should do what you are looking for.

Magickat · Answer 4 · 2008-09-16T21:50:39+0000

Well, there are several approaches that will work. You could write your own data structure that did this. Or you can try using a SortedList. You can also return DataSets to code, and then use .Select () in the table. Of course, you will need to do this on both tables.

Sam · Answer 5 · 2008-09-16T21:50:50+0000

You can easily use SortedList for a quick search. If the data you upload is already sorted, inserts in the SortedList should not be slow.

Mitchel sellers · Answer 6 · 2008-09-16T21:52:24+0000

If you just want to see if all the lines in FileA are included in FileB, you can read it and just compare the threads inside the loop.

File 1 entry1 entry2 Entry3

File 2 entry1 Entry3

You can run a cycle with two counters and find omissions by going through each file in turn and see if you get what you need.

Arno · Answer 7 · 2008-09-16T21:52:47+0000

Perhaps I misunderstand, but ArrayList will keep its elements in the same order in which you added them. This means that you can compare two ArrayLists in just one pass - just increase the two scan indexes according to the comparison.

Shane courtrille · Answer 8 · 2008-09-16T21:53:22+0000

One question that I have is what do you think you have chosen to outsource. There are many good comparison tools that you could just call. I would be surprised if there wasn’t something that allowed you to specify two files and get only the differences. Just a thought.

skb · Answer 9 · 2008-09-16T22:13:31+0000

I think the reason is that everyone has so many different answers, because you have a problem that is not clear enough, which is good enough for the answer. First of all, it depends on what differences you want to track. Do you want the differences to appear as in WinDiff, where the first file is “original” and the second is “changed”, so you can list the changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match two lines as different versions of the same record (when fields other than the primary key are different)? Or is it some kind of reconciliation when you just want your difference output to say something like “RECORD TO FILE 1 AND NOT FILE 2”?

I think that the answers to these questions will help everyone to give you a suitable answer to your problem.

Jason jackson · Answer 10 · 2008-09-16T22:23:30+0000

If you have two files, each of which contains a million lines, as mentioned in your post, you can use a lot of memory. Some performance issue might be what you are changing from disk. If you simply compare line 1 of file A with line one of file B, file line2 A-> line 2 file B, etc., I would recommend a technique that does not store so much in memory. You can either read the write-off of the two file streams as a previously published commentator and write your results “in real time” as you find them. This will not explicitly store anything in memory. You can also flush pieces of each file into memory, say, a thousand lines at a time, into something like a list. It can be finely tuned to meet your needs.

Shane courtrille · Answer 11 · 2008-09-18T14:05:53+0000

To solve issue # 1, I would recommend examining the hash of each line. This way, you can quickly and easily compare hashes with a dictionary.

To solve issue # 2, one quick and dirty solution would be to use IDictionary. Using itemId as your first string type and the rest of the string as your second string type. Then you can quickly find if itemId exists and compare the strings. This of course assumes .Net 2.0+

C # data type for large sorted collection with position?

More articles: