Most efficient way to define substring in string in python?

I need to find a rather long string for CPV codes (general purchases).

I'm currently doing this with simple loops and str.find ()

the problem is that if the CPV code was specified in a slightly different format, this algorithm will not find it.

What is the most efficient way to find all the different iterations of code within a string? Is this just a case of reformatting each of up to 10,000 CPV codes and using str.find () for each instance?

An example of various formatting may be as follows:

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

and etc.

Thank:)

+3
source share
3 answers

Try regex:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it closely matches the CPV in your data.)

+4

re ( Python). . docs.

, , re.findall - . , CPV, (, , , Google?)

+1
cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

applied to your sample data, returns

30124120-1
30124120-1
30124120-1
30124120-1
30124120-1

The regular expression can be read as

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

Hope this helps!

+1
source

Source: https://habr.com/ru/post/1785038/


All Articles