How to reasonably parse a surname

Assuming the US inertia agreement is FirstName MiddleName(s) LastName ,

What would be the best way to correctly parse a surname from a full name?

For instance:

 John Smith --> 'Smith' John Maxwell Smith --> 'Smith' John Smith Jr --> 'Smith Jr' John van Damme --> 'van Damme' John Smith, IV --> 'Smith, IV' John Mark Del La Hoya --> 'Del La Hoya' 

... and many other permutations from this.

+4
source share
3 answers

Probably the best answer here is not to try. Names are individual and idiosyncratic, and even limiting the Western tradition, you can never be sure that you will think about all cases. One of my friends legally changed his name by one word, and he had time dealing with various institutions whose procedures could not cope with this. You have a unique position that you create software that implements the procedure, and therefore you have the opportunity to develop something that will not annoy the crap of people with non-traditional names. Think about why you need to sort out your last name first and see if there is anything else you could do.

This, as they say, is a purely technical issue, the best way would probably be, in particular, to cut the strings "junior", "junior", "junior", "III", ", III", etc. from the end of the line containing the name, and then get everything from the last space in the line (new, after removing Jr, etc.). It wouldn’t get, say, “Del La Hoya” from your example, but you can’t even count on the person to get it - I accept the reasonable assumption that the name of John Mark Del La Hoya is “Del La Hoya” and not “ Mark Del La Hoya "because I am a" native English speaker "and I have some kind of intuition about what Spanish surnames look like - if it were a name, say" Gauthip Yeidze Ka Illunyepsi ", I would absolutely not I knew to consider that Ka as part of a surname or not, because I have no idea what language it is.

+17
source

I partake of Tnekutippa here, but you should check the named entity recognition . This can help automate some of the processes. However, this, as already noted, is rather complicated. I'm not quite sure what the Stanford NER can extract from the first and last surnames from the box, but the machine learning method can be very useful for this task. The Stanford NER could be a good starting point, or you could try to create your own classifiers and study cases.

0
source

Passed through a lib called "nameparser" at https://pypi.python.org/pypi/nameparser It handles four of the six cases above:

 #!/usr/bin/env python from nameparser import HumanName def get_lname(somename): name = HumanName(somename) return name.last people_names = [ ('John Smith', 'Smith'), ('John Maxwell Smith', 'Smith'), # ('John Smith Jr', 'Smith Jr'), ('John van Damme', 'van Damme'), # ('John Smith, IV', 'Smith, IV'), ('John Mark Del La Hoya', 'Del La Hoya') ] for name, target in people_names: print('{} --> {} <-- {}'.format(name, get_lname(name), target)) assert get_lname(name) == target 
0
source

Source: https://habr.com/ru/post/1369340/


All Articles