Group by unknown initial prefix

Let's say I have the following array of strings as input:

foo-139875913 foo-aeuefhaiu foo-95hw9ghes barbazabejgoiagjaegioea barbaz8gs98ghsgh9es8h 9a8efa098fea0 barbaza98fyae9fghaefag bazfa90eufa0e9u bazgeajga8ugae89u bazguea9guae aifeaufhiuafhe 

Three different prefixes are used here: "foo-", "barbaz" and "baz", however these prefixes are not known in advance (they can be completely different).

How could you establish what are the different common prefixes so that they can be grouped? This is a bit complicated, because in the data I provided there are two that start with "bazg" and one that starts "bazf", where, of course, "baz" is the prefix.

What I have tried so far, sorts them in alphabetical order, and then iterates over them in order and counts how many characters in the string are identical to the previous ones. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is that it falls on the "bazg" and "bazf" problems that I mentioned earlier, and splits them into two different groups (one with only one element)

Edit: Ok, let's throw some more rules in:

  • Longer potential groups should generally be preferred over shorter ones unless there is a closely comparable group with a difference in length of less than X characters. (So ​​where X is 2, baz will be preferable to bazg)
  • A group must have at least Y elements in it or not be a group at all
  • Everything is in order to simply throw away elements that do not correspond to any of the β€œgroups” within the framework of the above rules.

To clarify the first rule regarding the second, if X is 0 and Y is 2, then two bazg entries will be in the group, and bazf will be thrown out, because of itself.

+4
source share
3 answers

Well, here is a quick hack, possibly O(something_bad) :

 IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1) { // TODO: error checking return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize); } IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize) { if(source.Any()) { var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source); if (tuple.Item2.Count() >= minGroupSize) yield return tuple; foreach (var element in GuessGroups(source, minNameLength, minGroupSize)) yield return element; } } Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source) { return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable()); } IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source) { while (source.Any() && source.Peek().StartsWith(prefix)) yield return source.Pop(); } string GetBestGroup(IEnumerable<string> source, int minNameLength) { var s = new Stack<string>(source); var counter = new DictionaryWithDefault<string, int>(0); while(s.Any()) { var g = GetCommonPrefix(s); if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength) counter[g]++; s.Pop(); } return counter.OrderBy(c => c.Value).Last().Key; } string GetCommonPrefix(IEnumerable<string> coll) { return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse() let possibleMatch = coll.First().Substring(0, len) where coll.All(f => f.StartsWith(possibleMatch)) select possibleMatch).FirstOrDefault(); } public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue> { TValue _default; public TValue DefaultValue { get { return _default; } set { _default = value; } } public DictionaryWithDefault() : base() { } public DictionaryWithDefault(TValue defaultValue) : base() { _default = defaultValue; } public new TValue this[TKey key] { get { return base.ContainsKey(key) ? base[key] : _default; } set { base[key] = value; } } } 

Usage example:

 string[] input = { "foo-139875913", "foo-aeuefhaiu", "foo-95hw9ghes", "barbazabejgoiagjaegioea", "barbaz8gs98ghsgh9es8h", "barbaza98fyae9fghaefag", "bazfa90eufa0e9u", "bazgeajga8ugae89u", "bazguea9guae", "9a8efa098fea0", "aifeaufhiuafhe" }; GuessGroups(input, 3, 2).Dump(); 

enter image description here

+4
source

Well, as discussed, the problem was not initially clearly defined, but here's how I do it.

 Create a tree T Parse the list, for each element: for each letter in that element if a branch labeled with that letter exists then Increment the counter on that branch Descend that branch else Create a branch labelled with that letter Set its counter to 1 Descend that branch 

This gives you a tree in which each leaf represents a word at your entrance. Each of the non-leaf nodes has a counter representing how many sheets (eventually) are attached to this node. Now you need a formula to weight the length of the prefix (depth node) by the size of the prefix group. Until:

 S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour 

So now you can iterate over each of the non-leaf nodes and assign them points S. Then, to work out your groups, you

 For each non-leaf node Assign score S Insertion sort the node in to a list, so the head is the highest scoring node Starting at the root of the tree, traverse the nodes If the node is the highest scoring node in the list Mark it as a prefix Remove all nodes from the list that are a descendant of it Pop itself off the front of the list Return up the tree 

This should give you a list of prefixes. The last part seems that some smart data structures or algorithms can speed it up (the last part of deleting all the children feels especially weak, but if you enter the size is small, I think the speed is not too important).

+1
source

I am wondering if your requirements are not turned off. It seems that you are looking for a specific grouping size, rather than specific key size requirements. I have a program below that, based on a given group size, will also break the lines into the largest groups, including the group size. Therefore, if you specify the size of group 5, then it will group the elements on the smallest key to make the group size 5. In your example, it groups foo- like f, since there is no need to make a more complex key, because the identifier.

 using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace ConsoleApplication2 { class Program { /// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks> public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items) { var smallItems = from item in items where item.Length < keySize select item; var largeItems = from item in items where keySize < item.Length select item; var largeItemsq = (from item in largeItems let key = item.Substring(0, keySize) group item by key into x select new { Key = x.Key, Items = x.ToList() } into aGrouping group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2 select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items)); if (smallItems.Any()) { var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length)); var smallItemsq = (from item in smallItems let key = item.Substring(0, smallestLength) group item by key into x select new { Key = x.Key, Items = x.ToList() } into aGrouping group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2 select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items)); return Combine(smallItemsq, largeItemsq); } return largeItemsq; } static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) { var x = new Dictionary<bool,Dictionary<string,List<string>>> { { true, null }, { false, null } }; foreach(var condition in new bool[] { true, false }) { var hasA = a.ContainsKey(condition); var hasB = b.ContainsKey(condition); x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value) : hasA ? a[condition] : hasB ? b[condition] : new Dictionary<string, List<string>>(); } return x; } public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize) { var toReturn = new Dictionary<string, List<string>>(); var both = Split(maxGroupSize, keySize, items); if (both.ContainsKey(false)) foreach (var key in both[false].Keys) toReturn.Add(key, both[false][key]); if (both.ContainsKey(true)) { var keySize_ = keySize + 1; var xs = from needsFix in both[true] select needsFix; foreach (var x in xs) { var fixedGroup = Group(maxGroupSize, x.Value, keySize_); toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value); } } return toReturn; } static Random rand = new Random(unchecked((int)DateTime.Now.Ticks)); const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"; static readonly int maxAllowed = allowedChars.Length - 1; static IEnumerable<string> GenerateText() { var list = new List<string>(); for (int i = 0; i < 100; i++) { var stringLength = rand.Next(3,25); var chars = new List<char>(stringLength); for (int j = stringLength; j > 0; j--) chars.Add(allowedChars[rand.Next(0, maxAllowed)]); var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString(); list.Add(newString); } return list; } static void Main(string[] args) { // runs 1000 times over autogenerated groups of sample text. for (int i = 0; i < 1000; i++) { var s = GenerateText(); Go(s); } Console.WriteLine(); Console.WriteLine("DONE"); Console.ReadLine(); } static void Go(IEnumerable<string> items) { var dict = Group(3, items, 1); foreach (var key in dict.Keys) { Console.WriteLine(key); foreach (var item in dict[key]) Console.WriteLine("\t{0}", item); } } } } 
0
source

Source: https://habr.com/ru/post/1479353/


All Articles