The pythonic way to have a "dimensional safe" cut

Question

The pythonic way to have a "dimensional safe" cut

Here is a quote from the https://stackoverflow.com/users/893/greg-hewgill answer to Python snippet notation notation .

Python is pleased with the programmer if there are fewer items than you ask for. For example, if you request [: - 2], and only one contains an element, you get an empty list instead of an error. Sometimes you would prefer a mistake, so you should know that this can happen.

So, when an error is preferable, what is the Putin way? Is there a more pythonic way to rewrite this example?

class ParseError(Exception): pass def safe_slice(data, start, end): """0 <= start <= end is assumed""" r = data[start:end] if len(r) != end - start: raise IndexError return r def lazy_parse(data): """extract (name, phone) from a data buffer. If the buffer could not be parsed, a ParseError is raised. """ try: name_length = ord(data[0]) extracted_name = safe_slice(data, 1, 1 + name_length) phone_length = ord(data[1 + name_length]) extracted_phone = safe_slice(data, 2 + name_length, 2 + name_length + phone_length) except IndexError: raise ParseError() return extracted_name, extracted_phone if __name__ == '__main__': print lazy_parse("\x04Jack\x0A0123456789") # OK print lazy_parse("\x04Jack\x0A012345678") # should raise ParseError

edit: the example was easier to write using byte strings, but my real code uses lists.

+6

python slice

kasyc Nov 10 '11 at 9:53

source share

4 answers

Here is one way that is perhaps more Pythonic. If you want to parse a byte string, you can use the struct module, which is provided for this purpose:

 import struct from collections import namedtuple Details = namedtuple('Details', 'name phone') def lazy_parse(data): """extract (name, phone) from a data buffer. If the buffer could not be parsed, a ParseError is raised. """ try: name = struct.unpack_from("%dp" % len(data), data)[0] phone = struct.unpack_from("%dp" % (len(data)-len(name)-1), data, len(name)+1)[0] except struct.error: raise ParseError() return Details(name, phone)

What I still consider irregular is to discard the useful trace of struct.error in order to replace ParseError with what it is: the original tells you what is wrong with the string, the latter only tells you that something is wrong.

+5

Duncan Nov 10 '11 at 10:19

source share

Here is a more python, more general rewriting code:

 class ParseError(Exception): pass def safe_slice(data, start, end, exc=IndexError): """0 <= start <= end is assumed""" r = data[start:end] if len(r) != end - start: raise exc() return r def lazy_parse(data): """extract (name, phone) from a data buffer. If the buffer could not be parsed, a ParseError is raised.""" results = [] ptr = 0 while ptr < len(data): length = ord(data[ptr]) ptr += 1 results.append(safe_slice(data, ptr, ptr + length, exc=ParseError)) ptr += length return tuple(results) if __name__ == '__main__': print lazy_parse("\x04Jack\x0A0123456789") # OK print lazy_parse("\x04Jack\x0A012345678") # should raise ParseError

Most of the changes are in the body of lazy_parse - now it will work with several values, not two, and the correctness of the whole thing still depends on whether the last element can be parsed accurately.

Also, instead of safe_slice raising an IndexError that lazy_parse changes to ParseError , lazy_parse gives the desired exception for safe_slice to safe_slice in case of an error ( lazy_parse defaults to IndexError if nothing is passed to it).

Finally, lazy_parse not - it processes the entire line at once and returns all the results. "Lazy" in Python means doing just what is needed to return the next snippet. In the case of lazy_parse this would mean returning the name and then to a later call returning the phone. With a minor modification, we can make lazy_parse lazy:

 def lazy_parse(data): """extract (name, phone) from a data buffer. If the buffer could not be parsed, a ParseError is raised.""" ptr = 0 while ptr < len(data): length = ord(data[ptr]) ptr += 1 result = (safe_slice(data, ptr, ptr + length, ParseError)) ptr += length yield result if __name__ == '__main__': print list(lazy_parse("\x04Jack\x0A0123456789")) # OK print list(lazy_parse("\x04Jack\x0A012345678")) # should raise IndexError

lazy_parse now a generator that returns one piece at a time. Note that we had to put list() around the lazy_parse call in the main section to get lazy_parse , to give us all the results to print them out.

Depending on what you are doing, this may not be the way we would like, as it may be more difficult to recover due to errors:

 for item in lazy_parse(some_data): result = do_stuff_with(item) make_changes_with(result) ...

By the time the ParseError is raised, you may have made changes that are difficult or impossible to undo. The solution in this case would be the same as in the print main part:

 for item in list(lazy_parse(some_data)): ...

Calling list completely consumes lazy_parse and gives us a list of results, and if an error has been raised, we will find out about this before processing the first element in the loop.

+2

Ethan furman Nov 14 '11 at 18:21

source share

Here is the full SafeSlice class SafeSlice https://stackoverflow.com/users/107660/duncan and https://stackoverflow.com/users/190597/unutbu . The class is quite large because it has full fragment support (start, stop, and step). This may be unnecessary for the simple work done in the example, but for a more complete real problem with the implementation, it may be useful.

 from __future__ import division from collections import MutableSequence from collections import namedtuple from math import ceil class ParseError(Exception): pass Details = namedtuple('Details', 'name phone') def parse_details(data): safe_data = SafeSlice(bytearray(data)) # because SafeSlice expects a mutable object try: name_length = safe_data.pop(0) name = safe_data.popslice(slice(name_length)) phone_length = safe_data.pop(0) phone = safe_data.popslice(slice(phone_length)) except IndexError: raise ParseError() if safe_data: # safe_data should be empty at this point raise ParseError() return Details(name, phone) def main(): print parse_details("\x04Jack\x0A0123456789") # OK print parse_details("\x04Jack\x0A012345678") # should raise ParseError SliceDetails = namedtuple('SliceDetails', 'first last length') class SafeSlice(MutableSequence): """This implementation of a MutableSequence gives IndexError with invalid slices""" def __init__(self, mutable_sequence): self._data = mutable_sequence def __str__(self): return str(self._data) def __repr__(self): return repr(self._data) def __len__(self): return len(self._data) def computeindexes(self, ii): """Given a slice or an index, this method computes what would ideally be the first index, the last index and the length if the SafeSequence was accessed using this parameter. None indexes will be returned if the computed length is 0. First and last indexes may be negative. This means that they are invalid indexes. (ie: range(2)[-4:-3] will return first=-2, last=-1 and length=1) """ if isinstance(ii, slice): start, stop, step = ii.start, ii.stop, ii.step if start is None: start = 0 elif start < 0: start = len(self._data) + start if stop is None: stop = len(self._data) elif stop < 0: stop = len(self._data) + stop if step is None: step = 1 elif step == 0: raise ValueError, "slice step cannot be zero" length = ceil((stop - start) / step) length = int(max(0, length)) if length: first_index = start last_index = start + (length - 1) * step else: first_index, last_index = None, None else: length = 1 if ii < 0: first_index = last_index = len(self._data) + ii else: first_index = last_index = ii return SliceDetails(first_index, last_index, length) def slicecheck(self, ii): """Check if the first and the last item of parameter could be accessed""" slice_details = self.computeindexes(ii) if slice_details.first is not None: if slice_details.first < 0: # first is *really* negative self._data[slice_details.first - len(self._data)] else: self._data[slice_details.first] if slice_details.last is not None: if slice_details.last < 0: # last is *really* negative self._data[slice_details.last - len(self._data)] else: self._data[slice_details.last] def __delitem__(self, ii): self.slicecheck(ii) del self._data[ii] def __setitem__(self, ii, value): self.slicecheck(ii) self._data[ii] = value def __getitem__(self, ii): self.slicecheck(ii) r = self._data[ii] if isinstance(ii, slice): r = SafeSlice(r) return r def popslice(self, ii): """Same as pop but a slice may be used as index.""" self.slicecheck(ii) r = self._data[ii] if isinstance(ii, slice): r = SafeSlice(r) del self._data[ii] return r def insert(self, i, value): length = len(self._data) if -length <= i <= length: self._data.insert(i, value) else: self._data[i] if __name__ == '__main__': main()

+2

kasyc Nov 15 '11 at 17:25

source share

unutbu · Accepted Answer · 2011-11-10T11:24:35+0000

Using a function like safe_slice will be faster than creating an object just to perform a slice, but if speed is not a bottleneck and you are looking for a more convenient interface, you can define a class with __getitem__ to perform checks before returning the slice.

This allows you to use nice notation instead of passing start and stop arguments to safe_slice .

 class SafeSlice(object): # slice rules: http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange def __init__(self,seq): self.seq=seq def __getitem__(self,key): seq=self.seq if isinstance(key,slice): start,stop,step=key.start,key.stop,key.step if start: seq[start] if stop: if stop<0: stop=len(seq)+stop seq[stop-1] return seq[key] seq=[1] print(seq[:-2]) # [] print(SafeSlice(seq)[:-1]) # [] print(SafeSlice(seq)[:-2]) # IndexError: list index out of range

If speed is a problem, I suggest just checking the endpoints instead of doing arithmetic. Access to items for Python lists is O (1). The version of safe_slice below also allows passing 2,3 or 4 arguments. With only two arguments, the second will be interpreted as a stop value (similar to range ).

 def safe_slice(seq, start, stop=None, step=1): if stop is None: stop=start start=0 else: seq[start] if stop<0: stop=len(seq)+stop seq[stop-1] return seq[start:stop:step]

The pythonic way to have a "dimensional safe" cut

More articles: