Python re.search () and re.findall ()

I am trying to solve this problem from Hackerrank . This is a machine learning problem. I initially tried to read all the words from a Corpus file to create unigram frequencies. According to this problem, ML word is defined as

A word is a sequence of characters containing only letters from a to z (lowercase only) and may contain hyphens ( - ) and apostrophe ( ' ). The word should begin and end only in lowercase letters.

I wrote a regular expression in python as follows:

 pat = "[az]+( ['-]+[az]+ ){0,}" 

I tried using both re.search() and re.findall() . I have problems in both.

  • The problem with re.findall() :

     string = "HELLO WORLD" 

    re.findall() output:

     [('Hello', ''), ('W', '-D')] 

    I could not get the word WORLD . When using re.search() I was able to fix it correctly.

  • Problem with re.search() :

     string = "123hello456world789" 

    re.search() output:

     'hello' 

    In this case, when using re.findall() I could get both 'hello' and 'world' .

+1
source share
1 answer

As I posted on your previous question , you should use re.findall() - but regardless of whether your problem is that your regular expression is incorrect. See the example below:

 >>> import re >>> regex = re.compile(r'([az][az-\']+[az])') >>> regex.findall("HELLO WORLD") # this has uppercase [] # there are no results here, because the string is uppercase >>> regex.findall("HELLO WORLD".lower()) # lets lowercase ['hello', 'worl-d'] # now we have results >>> regex.findall("123hello456world789") ['hello', 'world'] 

As you can see, the reason you refused the first example that you specified is in capital letters, you can simply add the re.IGNORECASE flag, although you mentioned that matches should only be lowercase.

+4
source

Source: https://habr.com/ru/post/958917/


All Articles