Correspondence of names in the form First Name Last Name with international characters

I try to catch the first names by making the assumption that they are in shape Firstname Lastlame. This works well with the code below, but I would like to catch international names, for example Pär Åberg. I found some solutions, but they, unfortunately, do not work with Python flavored regular expression. Anyone with an understanding for this?

#!/usr/bin/python
# -*- coding: utf-8 -*- 
import re

text = """
This is a text containing names of people in the text such as 
Hillary Clinton or Barack Obama. My problem is with names that uses stuff 
outside A-Z like Swedish names such as Pär Åberg."""

for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
    firstname = name[0].split()[0]
    print firstname
+4
source share
1 answer

You need an alternative regex library as you can use \p{L}. - any Unicode letter

Then use

ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'

Unicode UNICODE :

ASCII LOCALE, UNICODE , UNICODE Unicode ASCII .

+1

Source: https://habr.com/ru/post/1616089/


All Articles