Insert an item into a case-sensitive list in Python

I have a list of strings that are already sorted in a case-insensitive manner. I would like to insert a new line into the list. One way to do this is to add an item and then sort the list, for example:

myList.append('Something') myList.sort(key=lambda s: s.lower()) 

But I was wondering if there is a way to simply insert an element into the correct position without sorting it all again.

I found this question: Insert an element into a sorted list in Python . It points to the Python bisect module. But this module does not look like it can support case insensitivity.


Change I checked a few answers given here.

  • Adding an item to the end and sorting the entire list (as suggested in the original question) was the slowest.
  • Moinuddin Quadri's answer was faster than sorting the entire list, but it was still quite slow due to lower() running for each item in the list.
  • Stefan Pohman's answer was an order of magnitude faster than sorting the entire list.
  • Jared Goguen's answer was the fastest for repeated inserts. The first time, however, it runs lower() for each element.

It was a close call to accept the answer. In the end, I went with Stefan Pohmann because it was the best for a one-time insertion, and access to the resulting list does not require access to the member variable. However, the use cases vary, so be sure to study all the answers.

+6
source share
6 answers

This is a good opportunity to use binary search again (or just copy and paste & modify bisect.insort , which I did):

 def insort_case_insensitive(a, x): key = x.lower() lo, hi = 0, len(myList) while lo < hi: mid = (lo + hi) // 2 if key < a[mid].lower(): hi = mid else: lo = mid + 1 a.insert(lo, x) 

Demo:

 myList = ['a', 'b', 'c', 'd', 'e'] for x in 'A', 'B', 'C', 'D', 'E': insort_case_insensitive(myList, x) print myList 

Print

 ['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E'] 

This is O (n), just like append + sort will, but only because of a.insert(lo, x) at the end. Which is dead-simple and made in C, so it is very fast. Binary search, of course, only takes O (log n) steps, so it's super fast as well. The append + sort method will call .lower() for all elements and compare them, both of which are much slower. @MoinuddinQuadri the first solution is also much slower due to calling .lower() for all elements.

See my other answer for a comparative comparison.

+2
source

You can use bisect.bisect in a sorted list with a bottom line like:

 from bisect import bisect my_list = ["aa", "bb", "Dd", "ee"] insert_string = "CC" # convert all the items in list to lower case for # v finding the correct location via. bisect index = bisect([i.lower() for i in my_list], insert_string.lower()) # bisect based on lower-cased string for ^ # case-insensitive behavior my_list.insert(index, insert_string) 

where the updated my_list content will be:

 ['aa', 'bb', 'CC', 'Dd', 'ee'] 
+3
source

You can create your own type to encapsulate this behavior (in combination with bisect as suggested in another answer).

 from bisect import bisect class CaseInsensitiveSortedList: def __init__(self, iterable): self.with_case = list(sorted(iterable, key=lambda s: s.lower())) self.without_case = [s.lower() for s in self.with_case] def insert_in_order(self, s): s_lower = s.lower() index = bisect(self.without_case, s_lower) self.without_case.insert(index, s_lower) self.with_case.insert(index, s) test_list = CaseInsensitiveSortedList(['a', 'B', 'cd', 'E', 'fff']) test_list.insert_in_order('D') print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'fff'] test_list.insert_in_order('ee') print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'ee', 'fff'] 

You can directly expand the list and make it a little more β€œnatural” or do whatever you want. It is just an idea to avoid calling str.lower for each item for each insert.

+3
source

I have never worked with bisect, but here is my hit. I take the first function directly from the bisect page that you linked:

 def index(a, x): 'Locate the leftmost value exactly equal to x' i = bisect_left(a, x) if i != len(a) and a[i] == x: return i raise ValueError def insert_into_sorted(char, my_list): marker = chr(ord(char) + 1) spot = index(my_list, marker) my_list[spot:spot] = char return my_list x = ['a', 'b', 'd', 'e'] insert_into_sorted('c', x) >>['a', 'b', 'c', 'd', 'e'] 
+1
source

Based on comments from OP, as it’s normal to maintain an intermediate list for storing strings with a lower window (single operation); this can be achieved as:

 from bisect import bisect my_list = ["aa", "bb", "Dd", "ee"] lower_list = [i.lower() for i in my_list] # list of lower-cased strings. # one time operation insert_string = "CC" # word to insert # find index based on lower-cased list index = bisect(lower_list, insert_string.lower()) my_list.insert(index, insert_string) # insert word in original list lower_list.insert(index, insert_string.lower()) # insert lower-cased word # in lower-cased list 

where the final value of my_list will be lower_list will be:

 >>> my_list # original list ['aa', 'bb', 'CC', 'Dd', 'ee'] >>> lower_list ['aa', 'bb', 'cc', 'dd', 'ee'] 

Here we divide in half the list of words with a bottom box to find the index and based on the index by inserting a string into the original list.

+1
source

I see that you have added test results to you. I also did some tests and got a similar picture:

 Insorting 20000 words: 80.224 seconds with insort_sorting 0.166 seconds with insort_own_binary_search 70.294 seconds with insort_lower_all 0.094 seconds with insort_keep_lower 

However, you are a little mistaken in the fast two. With a lot of inserts mine is getting faster. About twice as fast:

 Insorting 1000000 words: 92.712 seconds with insort_own_binary_search 173.577 seconds with insort_keep_lower 

This is because the time O (log n) for index search becomes negligible, and during time O (n) for insert calls, time O (n) predominates. And my solution has only one of them, while the other solution has two.

Another difference is the complexity of the space, so the extra list of shortened versions of all lines is not so good.

Here is my benchmarking:

 import random, string, time #-------------------------------------------------------------- def insort_sorting(a, x): a.append(x) a.sort(key=str.lower) #-------------------------------------------------------------- def insort_own_binary_search(a, x): key = x.lower() lo, hi = 0, len(myList) while lo < hi: mid = (lo + hi) // 2 if key < a[mid].lower(): hi = mid else: lo = mid + 1 a.insert(lo, x) #-------------------------------------------------------------- from bisect import bisect def insort_lower_all(a, x): index = bisect([i.lower() for i in a], x.lower()) a.insert(index, x) #-------------------------------------------------------------- from bisect import bisect def insort_keep_lower(a, x, lower=[]): x_lower = x.lower() index = bisect(lower, x_lower) a.insert(index, x) lower.insert(index, x_lower) #-------------------------------------------------------------- # Generate random words words = [''.join(random.choice(string.ascii_letters) for _ in range(10)) for _ in range(20000)] # for _ in range(1000000)] # Compare the solutions print 'Insorting', len(words), 'words:' reference = None for insort in insort_sorting, insort_own_binary_search, insort_lower_all, insort_keep_lower: #for insort in insort_own_binary_search, insort_keep_lower: t0 = time.time() myList = [] for word in words: insort(myList, word) print '%7.3f seconds with %s' % (time.time() - t0, insort.__name__) if reference is None: reference = myList else: assert myList == reference 
+1
source

Source: https://habr.com/ru/post/1014501/


All Articles