Breaking through a python regex string using named groups and wildcard

I have a special precedent that I do not yet know how to cover. I want to parse a string based on field_name / field_length. To do this, I define the regex as follows:

'(?P<%s>.{%d})' % (field_name, field_length) 

And this is repeated for all fields.

I also have a regex to remove spaces to the right of each field:

 self.re_remove_spaces = re.compile(' *$') 

Thus, I can get each field as follows:

 def dissect(self, str): data = { } m = self.compiled.search(str) for field_name in self.fields: value = m.group_name(field_name) value = re.sub(self.re_remove_spaces, '', value) data[field_name] = value return data 

I need to do this processing for millions of rows, so it should be efficient.

It annoys me that I would rather remove the dissection + space in one step, using compiled.sub instead of compiled.search , but I don't know how to do it.

In particular, my question is:

How to perform regular expression substitution by combining it with named groups in Python regular expressions?

+4
source share
2 answers

I understand that each field is next to each other in a row, for example, in a table, for example:

 name description license python language opensource windows operating system proprietry 

So, assuming you know the length of each field in advance, you can do it much easier without using a regular expression at all. (btw, str not a good name for a variable since it encounters the built-in type str )

 def dissect(text): data = {} for name, length in fields: data[name] = text[:length].rstrip() text = text[length:] return data 

Then, if fields = [('lang', 9), ('desc', 19), ('license', 12)] :

 >>> self.dissect('python language opensource') {'lang': 'python', 'license': 'opensource', 'desc': 'language'} 

Is that what you are trying to do though?

+4
source

Why is it even possible to use sub when you can directly match the part you need?

You can use something like:

 (?P<name>.{0,N}(?<! )) 

But if the matches must be exactly N long, you can use lookahead, for example:

 (?=(?P<name>.{0,N}(?<! ))).{N} 

If this is better than using extra cropping, it is doubtful. You can try and let us know.

These expressions will not work if the match is spaces, as well as a character before it is space. If you need this case to work, you can add | at the end of the group:

 (?P<name>.{0,N}(?<! )|) 
0
source

Source: https://habr.com/ru/post/1400962/


All Articles