Remove everything that follows the first character that is not a letter in a string in python

There are several questions about excluding non-alphanumeric characters from a string using regex. I want to remove every character, including letters, after the first character, which is not a letter or one space (this includes numbers and double spaces).

For instance:

My string is #not very beautiful 

should become

My string is

or

Are you 9 years old?

should become

Are you

and

this is the last  example

should become

this is the last

How to do it?

+4
source share
6 answers

How about spliton [^A-Za-z ]|and take the first element? You can trim possible white spaces later:

import re
re.split("[^A-Za-z ]|  ", "My string is #not very beautiful")[0].strip()
# 'My string is'

re.split("[^A-Za-z ]|  ", "this is the last  example")[0].strip()
# 'this is the last'

re.split("[^A-Za-z ]|  ", "Are you 9 years old?")[0].strip()
# 'Are you'

[^A-Za-z ]| , - , , ; ; , , .

+5

, - :

import itertools
import string

def rstrip(s, whitelist=None):
    if whitelist is None:
        whitelist = set(string.ascii_letters + ' ')  # set the whitelist to a default of all letters A-Z and a-z and a space
    # split on double-whitespace and take the first split (this will work even if there no double-whitespace in the string)
    # use `itertools.takewhile` to include the characters that in the whitelist
    # use `join` to join them inot one single string

    return ''.join(itertools.takewhile(whitelist.__contains__, s.split('  ', 1)[0]))
+1
import re
str1 = "this is the last  example"
regex = re.compile(r"(([a-zA-Z]|(\s[a-zA-Z]))+)")
capture = re.match(regex, str1)
res = capture.group(1)

, , , . , , , , , .

+1

def truncate_nonalpha_space(s):
    return s[:next((x for x, a in enumerate(s.split("  ")[0]) if not a.isalpha() and not a == " "), len(s))].rstrip()

:

  • , .isalpha() " "

  • s " ",

  • ( )

  • The first of these values ​​is used to slice s at, otherwise all s are s[:len(s)]returned without a right space.rstrip()

0
source
^.+?(?=[^A-Za-z ]|$|\s{2})

You can just capture the output using this.Use re.findallto capture the output.

See the demo.

https://regex101.com/r/INzotJ/1

0
source

Hacky, but uses yield :

import string

li_test = [
    ("My string is #not very beautiful","My string is"),
    ("Are you 9 years old?","Are you "),
    ("this is the last  example","this is the last "),
]

tolerated = string.ascii_letters

def rstrip_(s_in):
    last = None
    for char in s_in:
        if char in tolerated:
            last = char
            yield char
        elif char == ' ':
            if last == ' ':
                raise StopIteration()
            last = char
            yield char
        else:                    
            raise StopIteration()

for input_, exp in li_test:
    got = "".join(rstrip_(input_))
    msg = ":%s:<>:%s:" % (exp, got)
    print (":%s:=>:%s:" % (input_, got))
    #cheating a bit because I dunno if the last space is wanted.
    assert exp.rstrip() == got.rstrip(), msg

output:

 :My string is #not very beautiful:=>:My string is :
 :Are you 9 years old?:=>:Are you :
 :this is the last  example:=>:this is the last :

And yes, I had to wrap it all in a second function and join the characters there ...

-1
source

Source: https://habr.com/ru/post/1665946/


All Articles