How does the linq function OrderByDescending and OrderBy work for string length? Is it faster than doing this with a loop?

Question

How does the linq function OrderByDescending and OrderBy work for string length? Is it faster than doing this with a loop?

My question is raised based on this question , I posted the answer to this question .. here

This is the code.

var lines = System.IO.File.ReadLines(@"C:\test.txt"); var Minimum = lines[0];//Default length set var Maximum = ""; foreach (string line in lines) { if (Maximum.Length < line.Length) { Maximum = line; } if (Minimum.Length > line.Length) { Minimum = line; } }

and alternative for this code with LINQ (My approach)

 var lines = System.IO.File.ReadLines(@"C:\test.txt"); var Maximum = lines.OrderByDescending(a => a.Length).First().ToString(); var Minimum = lines.OrderBy(a => a.Length).First().ToString();

LINQ is easy to read and implement.

I want to know which is good for performance . And how does Linq work internally for OrderByDescending and OrderBy for ordering by length ?

+6

performance c # .net for-loop linq

sangram parmar Jun 25 '15 at 7:08

source share

4 answers

In my opinion, you need to understand some points in order to decide which is best.

First, think that we want to solve the LINQ problem. Then, to write the most optimized code, you must understand Delayed Execution . Most Linq methods, such as Select , Where , OrderBy , Skip , Take , and some others use DE. So what is delayed execution? This means that these methods will not be executed if the user does not need them. These methods will simply create an iterator. And this iterator is ready to execute when we need them. So how can the user execute them? The answer is with foreach , which is called by GetEnumerator or other Linq methods. For example, ToList() , First() , FirstOrDefault() , Max() and some others.

This process will help us get some performance.
Now back to your problem. File.ReadLines will return an IEnumerable<string> , which means that it will not read lines if we do not need them. In your example, you called the sort method twice for this object, which means that it will sort this collection twice. Instead, you can sort the collection once, and then call ToList() , which will execute the OrderedEnumerable iterator, and then get the first and last elements of the collection that are physically inside our hands.

 var orderedList = lines .OrderBy(a => a.Length) // This method uses deferred execution, so it is not executed yet .ToList(); // But, `ToList()` makes it to execute. var Maximum = orderedList.Last(); var Minimum = orderedList.First();

By the way, you can find OrderBy source code here .

It returns an OrderedEnumerable instance and the sorting algorithm is here:

 public IEnumerator<TElement> GetEnumerator() { Buffer<TElement> buffer = new Buffer<TElement>(source); if (buffer.count > 0) { EnumerableSorter<TElement> sorter = GetEnumerableSorter(null); int[] map = sorter.Sort(buffer.items, buffer.count); sorter = null; for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]]; } }

And now back to another aspect that affects performance. If you see, Linq uses another element to store the sorted collection. Of course, this will require some memory, which tells us that this is not the most efficient way.

I just tried to explain to you how Linq works. But, I very much agree with @Dotctor as a result of your general answer. Remember that you can use File.ReadAllLines , which will not return IEnumerable<stirng> , but string[] . What does it mean? As I tried to explain at the beginning, the difference is that if it is IEnumerable , then .net will read the lines one by one when the enuemrator enumerates the iterator. But if it is string[] , then all the lines in our application memory.

+8

Farhad jabiyev Jun 25 '15 at 7:16

source share

In the second method, you not only sort the lines twice ... You read the file twice. This is because File.ReadLines returns an IEnumerable<string> . This clearly shows why you should never list IEnumerable<> twice unless you know how it was created. If you really want to do this, add .ToList() or .ToArray() , which materialize IEnumerable<> in the collection ... And although the first method has a memory capacity of one line of text (since it reads the file one line at a time), the second method will load the entire file in memory to sort it, so it will have a much larger amount of memory, and if the file is several hundred mb, the difference will be large (note that technically you can have a file with one line of text 1gb long, so this rule is not absolute ... This is for p large files up to several hundred characters long :-))

Now ... Someone will tell you that premature optimization is evil , but I will tell you that ignorance is twice evil .

If you know the difference between the two blocks of code, then you can make an informed choice between them ... Otherwise, you just randomly throw stones until it works. It works here.

+8

xanatos Jun 25 '15 at 7:33

source share

The most efficient approach is to avoid LINQ here, only one enumeration is required for the foreach approach.

If you want to put the whole file in a collection, you can use this:

 List<string> orderedLines = System.IO.File.ReadLines(@"C:\test.txt") .OrderBy(l => l.Length) .ToList(); string shortest = orderedLines.First(); string longest = orderedLines.Last();

In addition, you should read about delayed LINQ execution .

Also note that your LINQ approach not only orders all lines twice to get the longest and shortest, it also needs to read the whole file twice, because File.ReadLines uses StreamReader (unlike ReadAllLines which first reads all lines into an array) .

MSDN :

When you use ReadLines , you can start enumerating a collection of strings before returning the entire collection; when you use ReadAllLines , you must wait for the entire array of strings to return before you can access the array

In general, this can help make your LINQ queries more efficient, i.e. if you filter the lines with Where , but in this case it makes the situation worse.

As Jeppe Stig Nielsen mentions in a comment, since OrderBy needs to create another buffer collection inside (with ToList second), there is another approach that might be more efficient:

 string[] allLines = System.IO.File.ReadAllLines(@"C:\test.txt"); Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length)); string shortest = allLines.First(); string longest = allLines.Last();

The only drawback of Array.Sort is that it executes an unstable type, not OrderBy . Therefore, if two lines have the same length, the order may not be supported.

+7

Tim schmelter Jun 25 '15 at 7:21

source share

dotctor · Accepted Answer · 2015-06-25T07:21:21+0000

You can read the source code for OrderBy .

Stop doing micro-optimization or premature optimization on your code. Try writing code that works correctly, and then if you encounter a performance problem, then profile the application and see where the problem is. If you have a piece of code that has performance issues due to finding the shortest and longest string, start optimizing this part.

We must forget about little efficiency, say, about 97% of the time: premature optimization is the root of all evil. But we must not pass our capabilities in these critical 3% - Donald Knuth

File.ReadLines returns an IEnumerable<string> . This means that if you do a foreach on it, it will return you data one at a time. I think the best performance improvement you can make here is to improve reading a file from disk. If it is small enough to load the entire file into memory, use File.ReadAllLines if it is not trying to read the file in large chunks that fit in memory. Reading a line by line will result in poor performance due to disk I / O. Thus, the problem here is not how LINQ or the loop is executed, the problem is the number of read disks.

How does the linq function OrderByDescending and OrderBy work for string length? Is it faster than doing this with a loop?

More articles: