I am trying to solve this problem from Hackerrank . This is a machine learning problem. I initially tried to read all the words from a Corpus file to create unigram frequencies. According to this problem, ML word is defined as
A word is a sequence of characters containing only letters from a to z (lowercase only) and may contain hyphens ( - ) and apostrophe ( ' ). The word should begin and end only in lowercase letters.
I wrote a regular expression in python as follows:
pat = "[az]+( ['-]+[az]+ ){0,}"
I tried using both re.search() and re.findall() . I have problems in both.
The problem with re.findall() :
string = "HELLO WORLD"
re.findall() output:
[('Hello', ''), ('W', '-D')]
I could not get the word WORLD . When using re.search() I was able to fix it correctly.
Problem with re.search() :
string = "123hello456world789"
re.search() output:
'hello'
In this case, when using re.findall() I could get both 'hello' and 'world' .
source share