Optimized General List Separator

Question

Optimized General List Separator

Read the editing below for more information.

I have the code below, which I use to share a common list of objects when an element is of a specific type.

public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) { List<List<object>> t = new List<List<object>>(); int currentT = 0; t.Add(new List<object>()); foreach (object list in tokens) { if ((list is Token) && (list as Token).TokenType == type) { currentT++; t.Add(new List<object>()); } else if ((list is TokenType) && ((TokenType)list )== type) { currentT++; t.Add(new List<object>()); } else { t[currentT].Add(list); } } return t.ToArray(); }

I don't have a clear question, as I'm curious if anyone knows of any ways that I can optimize this code. I call it many times, and it looks like a monster when there are clock cycles. Any ideas? I can also do this wiki, if anyone is interested, maybe we can track the latest changes.

Update: I am trying to parse specific tokens. This is a list of some other classes and tokens. The token has the TokenType property (enumeration). I need to find all classes of tokens and divide them into each of them.

 {abc T de T fgh T ijkl T m}

split as

 {abc}{de}{fgh}{ijkl}{m}

EDIT UPDATE: It seems that all my speed issues are related to constantly creating and adding shared lists. Does anyone know how I can do without this? This is a profile of what happens if it helps someone.

+4

optimization list generics split c #

Tested Feb 16 '10 at 1:41

source share

9 answers

~~Your code looks good.~~

My only suggestion would be to replace IEnumerable<object> with a non-generic IEnumerable . (In System.Collections )

EDIT

With further verification, you do more time than necessary.

Replace if with the following code:

 var token = list as Token; if (token != null && token.TokenType == type) {

Alternatively, you can get rid of the currentT variable by writing t[t.Count - 1] or t.Last() . This will make the code clearer, but may have a slight negative impact on performance.
In addition, you can save the link to the internal list in a variable and use it directly. (This will slightly improve performance)

Finally, if you can change the return type to List<List<Object>> , you can return t directly; this will avoid copying the array and will be noticeably faster for large lists.

By the way, your variable names are confusing; you must change the names t and list .

+4

SLaks Feb 16 '10 at 1:45

source share

Sample testing and throws can be a killer of performance. If at all possible, your token types should implement a common interface or abstract class. Instead of passing and an object you should pass an IToken that wraps your object.

Here is the code you can use to get started:

 using System; using System.Collections.Generic; namespace Juliet { interface IToken<T> { bool IsDelimeter { get; } T Data { get; } } class DelimeterToken<T> : IToken<T> { public bool IsDelimeter { get { return true; } } public T Data { get { throw new Exception("No data"); } } } class DataToken<T> : IToken<T> { public DataToken(T data) { this.Data = data; } public bool IsDelimeter { get { return false; } } public T Data { get; private set; } } class TokenFactory<T> { public IToken<T> Make() { return new DelimeterToken<T>(); } public IToken<T> Make(T data) { return new DataToken<T>(data); } } class Program { static List<List<T>> SplitTokens<T>(IEnumerable<IToken<T>> tokens) { List<List<T>> res = new List<List<T>>(); foreach (IToken<T> token in tokens) { if (token.IsDelimeter) { res.Add(new List<T>()); } else { if (res.Count == 0) { res.Add(new List<T>()); } res[res.Count - 1].Add(token.Data); } } return res; } static void Main(string[] args) { TokenFactory<string> factory = new TokenFactory<string>(); IToken<string>[] tokens = new IToken<string>[] { factory.Make("a"), factory.Make("b"), factory.Make("c"), factory.Make(), factory.Make("d"), factory.Make("e"), factory.Make(), factory.Make("f"), factory.Make("g"), factory.Make("h"), factory.Make(), factory.Make("i"), factory.Make("j"), factory.Make("k"), factory.Make("l"), factory.Make(), factory.Make("m") }; List<List<string>> splitTokens = SplitTokens(tokens); for (int i = 0; i < splitTokens.Count; i++) { Console.Write("{"); for (int j = 0; j < splitTokens[i].Count; j++) { Console.Write("{0}, ", splitTokens[i][j]); } Console.Write("}"); } Console.ReadKey(true); } } }

Basically, you can create IToken<object> instances so that they are generalized to tokens of several classes.

+3

Juliet Feb 19 '10 at 4:31

source share

A: A full implementation will suffice if you simply iterate over the results in a nested foreach:

 using System; using System.Collections.Generic; public static class Splitter { public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, Predicate<T> match) { using (IEnumerator<T> enumerator = source.GetEnumerator()) { while (enumerator.MoveNext()) { yield return Split(enumerator, match); } } } static IEnumerable<T> Split<T>(IEnumerator<T> enumerator, Predicate<T> match) { do { if (match(enumerator.Current)) { yield break; } else { yield return enumerator.Current; } } while (enumerator.MoveNext()); } }

Use it as follows:

 using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace MyTokenizer { class Program { enum TokenTypes { SimpleToken, UberToken } class Token { public TokenTypes TokenType = TokenTypes.SimpleToken; } class MyUberToken : Token { public MyUberToken() { TokenType = TokenTypes.UberToken; } } static void Main(string[] args) { List<object> objects = new List<object>(new object[] { "A", Guid.NewGuid(), "C", new MyUberToken(), "D", new MyUberToken(), "E", new MyUberToken() }); var splitOn = TokenTypes.UberToken; foreach (var list in objects.Split(x => x is Token && ((Token)x).TokenType == splitOn)) { foreach (var item in list) { Console.WriteLine(item); } Console.WriteLine("=============="); } Console.ReadKey(); } } }

B: If you need to process the results after a while, and you want to do it out of order, or you split it into one thread and then maybe send segments to multiple threads, then this will probably give a good starting point:

 using System; using System.Collections.Generic; using System.Linq; public static class Splitter2 { public static IEnumerable<IEnumerable<T>> SplitToSegments<T>(this IEnumerable<T> source, Predicate<T> match) { T[] items = source.ToArray(); for (int startIndex = 0; startIndex < items.Length; startIndex++) { int endIndex = startIndex; for (; endIndex < items.Length; endIndex++) { if (match(items[endIndex])) break; } yield return EnumerateArraySegment(items, startIndex, endIndex - 1); startIndex = endIndex; } } static IEnumerable<T> EnumerateArraySegment<T>(T[] array, int startIndex, int endIndex) { for (; startIndex <= endIndex; startIndex++) { yield return array[startIndex]; } } }

C: If you really have to return the List <T> -s collection - which I doubt if you obviously do not want to mutate them some time later - try initializing them to the given capacity before copying:

 public static List<List<T>> SplitToLists<T>(this IEnumerable<T> source, Predicate<T> match) { List<List<T>> lists = new List<List<T>>(); T[] items = source.ToArray(); for (int startIndex = 0; startIndex < items.Length; startIndex++) { int endIndex = startIndex; for (; endIndex < items.Length; endIndex++) { if (match(items[endIndex])) break; } List<T> list = new List<T>(endIndex - startIndex); list.AddRange(EnumerateArraySegment(items, startIndex, endIndex - 1)); lists.Add(list); startIndex = endIndex; } return lists; }

D: If this is still not enough, I suggest you run your own lightweight List implementation, which can copy the range directly into the internal array from another instance.

+3

Andras vass Feb 19 '10 at 21:44

source share

My first thought would be instead of searching t[currentT] all the time, just save the currentList and add directly to it.

+2

Anon. Feb 16 '10 at 1:47

source share

I think that for these scenarios there are broken cases when it is assumed that the list items are lowercase letters and the item with the corresponding marker type is T:

{T ab c ...};
{... xyz T};
{... jkl TT mn o ...};
{T}; and
{}

This will lead to:

{{} {ab c ...}};
{{... xyz} {}};
{{... jkl} {} {} {mn o ...}};
{{}}; and
{}

Performing direct refactoring:

 public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) { var outer = new List<List<object>>(); var inner = new List<object>(); foreach (var item in tokens) { Token token = item as token; if (token != null && token.TokenType == type) { outer.Add(inner); inner = new List<object>(); continue; } inner.Add(item); } outer.Add(inner); return outer.ToArray(); }

To fix broken cases (provided that they are really broken), I recommend:

 public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) { var outer = new List<List<object>>(); var inner = new List<object>(); var enumerator = tokens.GetEnumerator(); while (enumerator.MoveNext()) { Token token = enumerator.Current as token; if (token == null || token.TokenType != type) { inner.Add(enumerator.Current); } else if (inner.Count > 0) { outer.Add(inner); inner = new List<object>(); } } return outer.ToArray(); }

This will lead to:

{{ab c ...}};
{{... xyz}};
{{... jkl} {mn o ...}};
{}; and
{}

+1

Ants Feb 16 '10 at 5:14

source share

Using LINQ, you can try the following: (I have not tested it ...)

  public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) { List<List<object>> l = new List<List<object>>(); l.Add(new List<object>()); return tokens.Aggregate(l, (c, n) => { var t = n as Token; if (t != null && t.TokenType == type) { t.Add(new List<object>()); } else { l.Last().Add(n); } return t; }).ToArray(); }

Second attempt:

 public static IEnumerable<object>[] Split(this IEnumerable<object> tokens, TokenType type) { var indexes = tokens.Select((t, index) => new { token = t, index = index }).OfType<Token>().Where(t => t.token.TokenType == type).Select(t => t.index); int prevIndex = 0; foreach (int item in indexes) { yield return tokens.Where((t, index) => (index > prevIndex && index < item)); prevIndex = item; } }

+1

moi_meme Feb 19 '10 at 20:26

source share

Here is one of the possibilities

Token class (may be what it ever was)

 public class Token { public string Name { get; set; } public TokenType TokenType { get; set; } }

Now type enumeration (it could be some other grouping factor)

 public enum TokenType { Type1, Type2, Type3, Type4, Type5, }

Extension Method (declare it anyway)

 public static class TokenExtension { public static IEnumerable<Token>[] Split(this IEnumerable<Token> tokens) { return tokens.GroupBy(token => ((Token)token).TokenType).ToArray(); } }

Usage example (I used a web project for this)

 List<Token> tokens = new List<Token>(); tokens.Add(new Token { Name = "a", TokenType = TokenType.Type1 }); tokens.Add(new Token { Name = "b", TokenType = TokenType.Type1 }); tokens.Add(new Token { Name = "c", TokenType = TokenType.Type1 }); tokens.Add(new Token { Name = "d", TokenType = TokenType.Type2 }); tokens.Add(new Token { Name = "e", TokenType = TokenType.Type2 }); tokens.Add(new Token { Name = "f", TokenType = TokenType.Type3 }); tokens.Add(new Token { Name = "g", TokenType = TokenType.Type3 }); tokens.Add(new Token { Name = "h", TokenType = TokenType.Type3 }); tokens.Add(new Token { Name = "i", TokenType = TokenType.Type4 }); tokens.Add(new Token { Name = "j", TokenType = TokenType.Type4 }); tokens.Add(new Token { Name = "k", TokenType = TokenType.Type4 }); tokens.Add(new Token { Name = "l", TokenType = TokenType.Type4 }); tokens.Add(new Token { Name = "m", TokenType = TokenType.Type5 }); StringBuilder stringed = new StringBuilder(); foreach (Token token in tokens) { stringed.Append(token.Name); stringed.Append(", "); } Response.Write(stringed.ToString()); Response.Write("</br>"); var q = tokens.Split(); foreach (var list in tokens.Split()) { stringed = new StringBuilder(); foreach (Token token in list) { stringed.Append(token.Name); stringed.Append(", "); } Response.Write(stringed.ToString()); Response.Write("</br>"); }

So, everything I do uses Linq, feel free to add or remove, you can actually go crazy about it and group by many other properties.

+1

Oakcool Feb 19 '10 at 23:27

source share

Do you need to convert it to an array? You can use LINQ and deferred execution to return results.

EDIT:
With a clarified question, it would be difficult to bend LINQ so that it returns the results the way you want. If you still want the execution of each loop to be delayed, you could write your own counter.

I recommend first checking this against other options to see if there is an increase in performance for your scenario if you try to use this approach. This can lead to more overhead for managing the iterator, which would be bad for cases with small data.

Hope this helps.

 // This is the easy way to make your own iterator using the C# syntax // It will return sets of separated tokens in a lazy fashion // This sample is based on the version provided by @Ants public static IEnumerable<IEnumerable<object>> Split(this IEnumerable<object> tokens, TokenType type) { var current = new List<object>(); foreach (var item in tokens) { Token token = item as Token; if (token != null && token.TokenType == type) { if( current.Count > 0) { yield return current; current = new List<object>(); } } else { current.Add(item); } } if( current.Count > 0) yield return current; }

Warning. This is a compilation, but may still have hidden errors. Now it's getting late.

 // This is doing the same thing but doing it all by hand. // You could use this method as well to lazily iterate through the 'current' list as well // This is probably overkill and substantially more complex public class TokenSplitter : IEnumerable<IEnumerable<object>>, IEnumerator<IEnumerable<object>> { IEnumerator<object> _enumerator; IEnumerable<object> _tokens; TokenType _target; List<object> _current; bool _isDone = false; public TokenSplitter(IEnumerable<object> tokens, TokenType target) { _tokens = tokens; _target = target; Reset(); } // Cruft from the IEnumerable and generic IEnumerator public IEnumerator<IEnumerable<object>> GetEnumerator() { return this; } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return GetEnumerator(); } public IEnumerable<object> Current { get { return _current; } } public void Dispose() { } #region IEnumerator Members object System.Collections.IEnumerator.Current { get { return Current; } } // See if there is anything left to get public bool MoveNext() { if (_isDone) return false; FillCurrent(); return !_isDone; } // Reset the enumerators so that you could reuse this structure if you wanted public void Reset() { _isDone = false; _enumerator = _tokens.GetEnumerator(); _current = new List<object>(); FillCurrent(); } // Fills the current set of token and then begins the next set private void FillCurrent() { // Try to accumulate as many tokens as possible, this too could be an enumerator to delay the process more bool hasNext = _enumerator.MoveNext(); for( ; hasNext; hasNext = _enumerator.MoveNext()) { Token token = _enumerator.Current as Token; if (token == null || token.TokenType != _target) { _current.Add(_enumerator.Current); } else { _current = new List<object>(); } } // Continue removing matching tokens and begin creating the next set for( ; hasNext; hasNext = _enumerator.MoveNext()) { Token token = _enumerator.Current as Token; if (token == null || token.TokenType != _target) { _current.Add(_enumerator.Current); break; } } _isDone = !hasNext; } #endregion }

0

smaclell Feb 16 '10 at 1:47

source share

Grant peters · Accepted Answer · 2010-02-22T03:44:03+0000

This is the best thing I could do to exclude as much distribution time as possible for the function (you need to allocate it only when it moves in capacity, which should be no more than what is required to create the largest subscriptions in the results). I tested this implementation and it works the way you described.

Please note that the results of the previous auxiliary list are destroyed when accessing the next list in the group.

 public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type) { ArrayList currentT = new ArrayList(); foreach (object list in tokens) { Token token = list as Token; if ((token != null) && token.TokenType == type) { yield return currentT; currentT.Clear(); //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance } else if ((list is TokenType) && ((TokenType)list) == type) { yield return currentT; currentT.Clear(); //currentT = new ArrayList(); <-- Use this instead of 'currentT.Clear();' if you want the returned lists to be a different instance } else { currentT.Add(list); } } }

EDIT Here is another version that does not use another list at all (should not make any allocations). Not sure how good this will compare, but it works on demand (also I have no idea how this will happen if you try to cache an auxiliary array).

In addition, both of them require the expression "using System.Collections" (in addition to the Generic namespace).

 private static IEnumerable SplitInnerLoop(IEnumerator iter, TokenType type) { do { Token token = iter.Current as Token; if ((token != null) && token.TokenType == type) { break; } else if ((iter.Current is TokenType) && ((TokenType)iter.Current) == type) { break; } else { yield return iter.Current; } } while (iter.MoveNext()); } public static IEnumerable<IEnumerable> Split(this IEnumerable tokens, TokenType type) { IEnumerator iter = tokens.GetEnumerator(); while (iter.MoveNext()) { yield return SplitInnerLoop(iter, type); } }

Optimized General List Separator

More articles: