Parsing a person’s name

I have a bunch of human names. They are all “western” names, and I only need American conventions / abbreviations (for example, Mr. instead of the elder for the senior). Unfortunately, the people to whom I send things did not enter their own names, so I can not ask them what they would like to call. I know the gender of each person and his full name, but actually did not understand him more specifically.

Some examples:

  • John Smith
  • John Smith, Jr.
  • John Smith, Jr.
  • John Smith XIV
  • Dr. John Smith, Ph.D.

I would like to be able to parse parts of each name:

name = Name.new("John Smith Jr.") name.first_name # <= John name.greeting # <= Mr. Smith 

If I am looking for a "greeting" (perhaps not the best term), then I want here, for 1-4, "Mr. Smith." At 5, I would like to see Dr. Smith, but I will agree to Mr. Smith.

A Ruby gem for this would be ideal. I was inspired to ask for something so strange from Chronic, a Ruby gem that handles time remarkably humanly, allowing me to say it right “last Tuesday” and come up with something reasonable. ”Some corner algorithms would be enough.

I'm trying to deal with some of the problems presented in false programmers believe in usernames

+6
source share
5 answers

Since you are limited to Western-style names, I think some rules will help you mainly:

  • If a comma appears, delete the left and everything after.
  • Keep deleting the words from the very beginning, and after converting to lower case and removing any complete stops, they belong to the set { mr mrs miss ms rev dr prof } and more that you can think of. Using the headings table “points” (for example, [mr=1, mrs=1, rev=2, dr=3, prof=4] - order them as you wish), write down the highest rating to be deleted.
  • Keep deleting words from the end until they belong to the { jr phd } , or are Roman numerals with a value of about 50 or less ( /[XVI]+/ , probably a good enough regular expression).
  • If one or more headings with non-zero points were deleted in step 2, use the highest rating. Otherwise use "Mr." or "Mrs." according to the gender provided.
  • As a last name, use the last word.

It is not possible to guarantee that a name similar to “John Baxter Smith” will be correctly parsed, since not all double-barreled surnames use hyphens. Is Baxter Smith a surname? Or is it "Baxter" - a middle name? I find it safe to assume that middle names are relatively more common than double, but not missing last names, which means that it is better to report the last word as a last name by default. However, you can also compile a list of common double-barreled surnames and check this out.

+6
source

Take a look at lufthansa . They ask them what kind of “title” they want to use. I have never seen such an idea.

I do not recommend using a gem or something in this case, because English / Spanish / French / .... there is a difference in gender, then if you try to discover for yourself, you will not be able to succeed.

I hope to help you.

+2
source

There is a Perl-based parser available for this type of extraction http://search.cpan.org/~kimryan/Lingua-EN-NameParse/

I ran it through your examples to get the following results. It processes ordinal suffixes up to 12 (XII), and also does not recognize. in Ph.D, so I had to change this in your input

 JOHN SMITH John Smith JOHN SMITH, JR. John Smith Jr JOHN SMITH JR. John Smith Jr JOHN SMITH XII John Smith XII DR. JOHN SMITH, PHD Dr. John Smith Phd 
+2
source

humanparser

Parse a string of a person’s name into a greeting, first name, middle name, last name, suffix.

Install

 npm install humanparser 

Using

 var human = require('humanparser'); var fullName = 'Mr. William R. Jenkins, III' , attrs = human.parseName(fullName); console.log(attrs); //produces the following output { saluation: 'Mr.', firstName: 'William', suffix: 'III', lastName: 'Jenkins', middleName: 'R.', fullName: 'Mr. William R. Jenkins, III' } 
+1
source

Have you tried Ruby Namae Stone?

It should be well versed in most Western names and comes with several configuration options for complex scenarios (the last few names, a comma, used to separate the names in the list part and the name). Having said that, this is a deterministic parser (using this grammar ), and in some cases it will not cover.

Here is your example:

 require('namae') Namae.parse 'John Smith and John Smith, Jr. and John Smith Jr and John Smith XIV' #=> [ #<Name family="Smith" given="John">, #<Name family="Smith" given="John" suffix="Jr.">, #<Name family="Smith" given="John" suffix="Jr">, #<Name family="Smith" given="John" suffix="XIV"> ] 

He struggles with the name of the doctor, but this is something we can fix.

+1
source

Source: https://habr.com/ru/post/948645/


All Articles