Why does Enumerable <T> .ToArray () use an intermediate buffer when it can just call Count ()?
I read the question with the question Is it better to call ToList () or ToArray () in LINQ queries? and it turned out why Enumerable.ToArray() wouldn't first just call the Count() method to find the size of the collection, and not use the inner class Buffer{T} , which dynamically resizes. Something like the following:
T[] ToArray<T>(IEnumerable<T> source) { var count = source.Count(); var array = new T[count]; int index = 0; foreach (var item in source) array[index++] = item; return array; } I know that we cannot understand what is happening with the minds of designers and performers, and I am sure that they are much smarter than me. So the best way to ask this question is what is wrong with the above approach? It seems to have less memory and still works in O (n) time.
Firstly, the constructor of the Buffer<T> class also optimizes if the specified sequence can be distinguished to ICollection (for example, an array or list), which has the Count property:
TElement[] array = null; int num = 0; ICollection<TElement> collection = source as ICollection<TElement>; if (collection != null) { num = collection.Count; if (num > 0) { array = new TElement[num]; collection.CopyTo(array, 0); } } else // now we are going the long way ... So, if this is not a collection, the request must be completed in order to receive the total bill. But using Enumerable.Count only to initialize an array with the correct sizes can be very expensive and, more importantly, can have dangerous side effects. Therefore, it is unsafe.
Consider this simple File.ReadLines example:
var lines = File.ReadLines(path); int count = lines.Count(); // executes the query which also disposes the underlying IO.TextReader var array = new string[count]; int index = 0; foreach (string line in lines) array[index++] = line; This will ObjectDisposedException "Unable to read from a private TextReader," since lines.Count() has already completed the request, and meanwhile the reader is in foreach .
The Buffer<T> class has optimizations for the case where the source sequence implements ICollection<T> :
internal Buffer(IEnumerable<TElement> source) { int length = 0; TElement[] array = null; ICollection<TElement> collection = source as ICollection<TElement>; if (collection != null) { length = collection.Count; if (length > 0) { array = new TElement[length]; collection.CopyTo(array, 0); } } else { ... If the sequence does not implement ICollection<T> , the code cannot assume that it is safe to enumerate the sequence twice, so it returns to resizing the array as necessary.
Because Count () lists the source to the end. Thus, it will at least perform 2 iterations, one for counting and one for copying elements.
Now consider that the enumerated question is the database cursor or something else similar, which includes non-trivial operations during iteration. This will be a performance killer.
It is a better idea to simply memcopy and expand the buffer. It may have a small effect on performance, but it is very small and more important for a known quantity .
If IEnumerable<T> and / or IEnumerator<T> included a property to ask if he knows his score, as well as the Count property, it might be useful for ToArray() use such a thing [having Count will be part of IEnumerator<T> , will be useful in cases where, for example, calling GetEnumerator in a thread-safe mutable type will enumerate a snapshot]. The lack of such an ability, even if the code has an ICollection or ICollection<T> , there is no way to find out if the Count call will take more or less time than creating additional time arrays.
If it were said, I would expect that the optimal implementation for something like ToArray probably be related to a linked list of things, each of which contains a certain number of elements, so that each element would be counted into the space that it would occupy before until it is copied to the destination array. The List<T> doubling strategy does not seem particularly appropriate here, since it would be better to have information shared between several small arrays than to create an array of more than 85,000 bytes (since temporary arrays will be useless after it leaves, having them get into the heap large object, especially bad).