Python - how to split a string into non-alpha characters

I am trying to use python to parse lines of C ++ source code. The only thing that interests me is to include directives.

    #include "header.hpp"

I want it to be flexible and still work with weak coding styles, such as:

          #   include"header.hpp"  

I got to the point that I can read lines and trim spaces before and after #. However, I still need to find out which directive is to read the line until a non-alpha character is found, regardless of the weather, this is a space, quote, tab or angle bracket.

So basically my question is: how can I split a line starting with alpha until there is non-alpha?

I think I could do this with a regex, but I did not find anything in the documentation that looks the way I want.

Also, if someone has tips on how to get the file name inside quotation marks or angle brackets, that would be a plus.

+8
source share
9 answers

You can do this with regex. However, you can also use a simple loop while.

def splitnonalpha(s):
   pos = 1
   while pos < len(s) and s[pos].isalpha():
      pos+=1
   return (s[:pos], s[pos:])

Test:

>>> splitnonalpha('#include"blah.hpp"')
('#include', '"blah.hpp"')
+5
source

Your regex instinct is correct.

import re
re.split('[^a-zA-Z]', string_to_split)

The part [^a-zA-Z]means "non-alphabetic characters."

+19
source

, , , , re.split re.findall:

>>> import re
>>> re.split(r'\W+', '#include "header.hpp"')
['', 'include', 'header', 'hpp', '']
>>> re.findall(r'\w+', '#include "header.hpp"')
['include', 'header', 'hpp']

:

>>> setup = "import re; word_pattern = re.compile(r'\w+'); sep_pattern = re.compile(r'\W+')"
>>> iterations = 10**6
>>> timeit.timeit("re.findall(r'\w+', '#header foo bar!')", setup=setup, number=iterations)
3.000092029571533
>>> timeit.timeit("word_pattern.findall('#header foo bar!')", setup=setup, number=iterations)
1.5247418880462646
>>> timeit.timeit("re.split(r'\W+', '#header foo bar!')", setup=setup, number=iterations)
3.786440134048462
>>> timeit.timeit("sep_pattern.split('#header foo bar!')", setup=setup, number=iterations)
2.256173849105835

, re.split . , re.findall:

>>> filter(bool, re.split(r'\W+', '#include "header.hpp"'))
['include', 'header', 'hpp']
+4

. \W ( , -). A-Z, A-Z, 0-9 _. , [\W_].

>>> import re
>>> line = '#   include"header.hpp"  ' 
>>> m = re.match(r'^\s*#\s*include\W+([\w\.]+)\W*$', line)
>>> m.group(1)
'header.hpp'
+2
import re
s = 'foo bar- blah/hm.lala'
print(re.findall(r"\w+",s))

: ['foo', 'bar', 'blah', 'hm', 'lala']

+1

:

import re

test_str = '    #   include "header.hpp"'

match = re.match(r'\s*#\s*include\s*("[\w.]*")', test_str)
if match:
    print match.group(1)
0

, ,

(?m)^\h*#\h*include\h*["<](\w[\w.]*)\h*[">]

, (? m) - , \h - (aka [^\S\r\n]).

0

:

pattern = re.compile('\W+') # '\W' will match any non-word character, and the '+' will match one or more times, as many times as possible (greedy)
string = 'digital camera, LCD TV, books, DVD, low prices, video games, pc games, software, electronics, home, garden, video, amazon'

result = re.split(pattern, string)

print(result)
>>> ['digital', 'camera', 'LCD', 'TV', 'books', 'DVD', 'low', 'prices', 'video', 'games', 'pc', 'games', 'software', 'electronics', 'home', 'garden', 'video', 'amazon']
0

import re re.split ('[^ a-zA-Z0-9]', string_to_split)

for all! (alphanumeric) characters

-1
source

Source: https://habr.com/ru/post/1627492/


All Articles