Run 4th capital letter of string in Python?

Question

Run 4th capital letter of string in Python?

How can I parse lines of text from the 4th occurrence of a capital letter? For example, for strings:

adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ

I would like to commit:

 `ZsdalkjgalsdkjTlaksdjfgasdkgj` `PlsdakjfsldgjQ`

I'm sure there is probably a better way than regular expressions, but I tried to make a non-greedy match; something like that:

 match = re.search(r'[AZ].*?$', line).group()

+4

python

drbunsen Jan 18 '12 at 16:25

source share

11 answers

In any case, the use of regular expressions will not be too verbose - although at the byte code level this very simple algorithm works and therefore is easy.

It is possible that regexpsare is faster because they are implemented in native code, but “one obvious way to do this”, although boring, certainly surpasses any suitable regular expression in readability:

 def find_capital(string, n=4): count = 0 for index, letter in enumerate(string): # The boolean value counts as 0 for False or 1 for True count += letter.isupper() if count == n: return string[index:] return ""

+4

jsbueno Jan 18 '12 at 16:40

source share

This has been found to be simpler using a regular expression to split the string and then slicing the resulting list:

 import re text = ["adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj", "oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ"] for t in text: print "".join(re.split("([AZ])", t, maxsplit=4)[7:])

Conveniently, this gives you an empty string if there are not enough capital letters.

+3

kindall Jan 18 '12 at 18:49

source share

A good one line solution could be:

 >>> s1 = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj' >>> s2 = 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ' >>> s1[list(re.finditer('[AZ]', s1))[3].start():] 'ZsdalkjgalsdkjTlaksdjfgasdkgj' >>> s2[list(re.finditer('[AZ]', s2))[3].start():] 'PlsdakjfsldgjQ'

Why does it work (all in one line)?

Searches for all capital letters in a string: re.finditer('[AZ]', s1)
Gets the found fourth capital letter: [3]
Returns the position from the fourth capital letter: .start()
Using the notation notation, we get the part we need from the string s1[position:]

+2

juliomalegria Jan 18 '12 at 16:57

source share

I believe this will work for you and will be fairly easy to spread in the future:

 check = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj' print re.match('([^AZ]*[AZ]){3}[^AZ]*([AZ].*)', check ).group(2)

The first part of the regular expression ([^AZ]*[AZ]){3} is the real key, it finds the first three lowercase letters and saves them together with the characters between them in group 1, after which we skip any number of letters without upper case after the third letter is in uppercase and finally we fix the rest of the line.

+1

Mike buland Jan 18 '12 at 16:41

source share

Testing various methods. I wrote the original string_after_Nth_upper and did not publish it; seeing that the jsbueno method is similar; in addition, by making additions / calculations for each character (even lowercase letters), his method is a bit slower.

 s='adsasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj' import re def string_after_Nth_upper(your_str, N=4): upper_count = 0 for i, c in enumerate(your_str): if c.isupper(): upper_count += 1 if upper_count == N: return your_str[i:] return "" def find_capital(string, n=4): count = 0 for index, letter in enumerate(string): # The boolean value counts as 0 for False or 1 for True count += letter.isupper() if count == n: return string[index:] return "" def regex1(s): return re.match(r'(?:.*?[AZ]){3}.*?([AZ].*)', s).group(1) def regex2(s): return re.match(r'([^AZ]*[AZ]){3}[^AZ]*([AZ].*)', s).group(2) def regex3(s): return s[list(re.finditer('[AZ]', s))[3].start():] if __name__ == '__main__': from timeit import Timer t_simple = Timer("string_after_Nth_upper(s)", "from __main__ import s, string_after_Nth_upper") print 'simple:', t_simple.timeit() t_jsbueno = Timer("find_capital(s)", "from __main__ import s, find_capital") print 'jsbueno:', t_jsbueno.timeit() t_regex1 = Timer("regex1(s)", "from __main__ import s, regex1; import re") print "Regex1:",t_regex1.timeit() t_regex2 = Timer("regex2(s)", "from __main__ import s, regex2; import re") print "Regex2:", t_regex2.timeit() t_regex3 = Timer("regex3(s)", "from __main__ import s, regex3; import re") print "Regex3:", t_regex3.timeit()

Results:

 Simple: 4.80558681488 jsbueno: 5.92122507095 Regex1: 3.21153497696 Regex2: 2.80767202377 Regex3: 6.64155721664

So regex2 wins in time.

+1

dr jimbob Jan 18 '12 at 17:12

source share

This is not the most beautiful approach, but:

 re.match(r'([^AZ]*[AZ]){3}[^AZ]*([AZ].*)', line).group(2)

0

Paul eastlund Jan 18 '12 at 16:29

source share

 import re strings = [ 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj', 'oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ', ] for s in strings: m = re.match('[az]*[AZ][az]*[AZ][az]*[AZ][az]*([AZ].+)', s) if m: print m.group(1)

0

Paulo scardine Jan 18 '12 at 16:30

source share

Analysis almost always includes regular expressions. However, the regular expression itself does not do the parser. In the simplest sense, a parser consists of:

 text input stream -> tokenizer

Usually it has an extra step:

 text input stream -> tokenizer -> parser

The notifier processes the input stream and collects the text accordingly, so the programmer does not need to think about it. It consumes text elements until only one match is available. Then it runs the code associated with this “token”. If you do not have a tokenizer, you need to manually knock it over (in pseudocode):

 while stuffInStream: currChars + getNextCharFromString if regex('firstCase'): do stuff elif regex('other stuff'): do more stuff

This loop code is full of gotchas unless you build them all the time. It is also easy to get a computer from a set of rules. This is how Lex / flex works. You can have token-related rules, pass the token to yacc / bison as your parser, which adds structure.

Note that a lexer is just a state machine . He can do anything when he is transferred from state to state. I wrote lexers that I used to strip characters from the input stream, open files, print text, send email, etc.

So, if you want to collect text after the fourth capital letter, the regular expression is not only suitable, but it is the right solution. BUT , if you want to parse text input, with different rules for what to do and an unknown amount of input, then you need a lexer / parser. I suggest PLY since you are using python.

0

Spencer rathbun Jan 18 '12 at 16:43

source share

 caps = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ") temp = '' for char in inputStr: if char in caps: temp += char if len(temp) == 4: print temp[-1] # this is the answer that you are looking for break

Alternatively, you can use re.sub to get rid of everything that is not a capital letter and get the 4th character of what is left

0

inspectorG4dget Jan 18 '12 at 16:56

source share

Another version ... not so beautiful, but doing its job.

 def stringafter4thupper(s): i,r = 0,'' for c in s: if c.isupper() and i < 4: i+=1 if i==4: r+=c return r

Examples:

 stringafter4thupper('adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj') stringafter4thupper('oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ') stringafter4thupper('') stringafter4thupper('abcdef') stringafter4thupper('ABCDEFGH')

Accordingly, the results:

 'ZsdalkjgalsdkjTlaksdjfgasdkgj' 'PlsdakjfsldgjQ' '' '' 'DEFGH'

0

Carlos Quintanilla Jan 18 '12 at 17:37

source share

NPE · Accepted Answer · 2012-01-18T16:29:55+0000

I present two approaches.

Approach 1: multiple regex

 In [1]: import re In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj' In [3]: re.match(r'(?:.*?[AZ]){3}.*?([AZ].*)', s).group(1) Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

.*?[AZ] consumes characters up to the first letter of upper case and includes the first letter.

(?: ... ){3} repeats three times without creating capture groups.

Next .*? matches the remaining characters before the fourth letter in upper case.

Finally, ([AZ].*) Captures the fourth uppercase letter and everything that follows in the capture group.

Approach 2: A Simpler Expression

 In [1]: import re In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj' In [3]: ''.join(re.findall(r'[AZ][^AZ]*', s)[3:]) Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

This attacks the problem directly, and it seems to me that it is easier to read.

Run 4th capital letter of string in Python?

More articles: