Comparing 2 lists in Excel with VBA Regex

I want to use them to compare two lists (columns) in Excel to find matches. Since this is a rather complicated operation, I performed it in the past using several different functions (not VBA) in Excel, but at best it turned out to be inconvenient, and therefore I wanted to try all-in-one VBA, if possible.

The first column contains names with irregularities (for example, quoted aliases, suffixes such as "jr" or "sr", and parentheses around the "preferred" versions of the first names). In addition, if middle names are present, they can be either a name or a starting one.

The order in the first column:

<first name or initial> <space> <any parenthetical 'preferred' names - if they exist> <space> <middle name or initial - if it exists> <space> <quoted nickname or initial - if it exists> <space> <last name> <comma - if necessary><space - if necessary><suffix - if it exists> 

The order in the second column:

  `<lastname><space><suffix>,<firstname><space><middle name, if it exists>` 

none of the “bumps” that the first column has.

My main goal is to “clear” the first column in the following order:

  `lastname-space-suffix,firstname-space-preferred name-space- middle name-space-nickname` 

Although I keep the “bumps” here, I could use some kind of “flags” in my code for comparison, to warn me about them on a case by case basis.

I tried several templates and this is the last:

 ["]?([A-Za-z]?)[.]?["]?[.]?[\s]?[,]?[\s]? 

However, I would like to allow the last name and suffixes (if they exist). I tested it with "global", but I couldn’t figure out how to separate the last name and suffixes, for example by calling a callback.

I would like to compare the last, first, middle initial (since most names are only initials in the first list) between the two lists.

  An example would be: (1st list) John (Johnny) B. "Abe" Smith, Jr. turned into: Smith Jr,John (Johnny) B "Abe" or Smith Jr,John B and (2nd list) Smith Jr,John Bertrand turned into: Smith Jr,John B Then run a comparison between the two columns. 

What would be a good start or continuation to compare this list?


April 10, 2012 update:

As a side note, I will need to exclude quotation marks from aliases and parentheses from preferred names. Can I just put grouped links further into subgroups (in the examples below)?

  (?: ([ ] \( [^)]* \)))? # (2) parenthetical 'preferred' name (optional) (?: ([ ] (["'] ) .*?) \6 )? # (5,6) quoted nickname or initial (optional) 

Can I group them as follows:

  (?:(([ ])(\()([^)]*)(\))))? # (2) parenthetical 'preferred' name (optional) not sure how to do this one - # (5,6) quoted nickname or initial (optional) 

I tried them in "Regex Coach" and "RegExr", and they worked fine, but in VBA, when I wanted backlinks to return as in \ 11 \ 5 all that was returned was Name, number 1 and comma ( e.g. "Carl1"). I come back to check for typos. Thanks for any help.


April 17, 2012 update:

There was one name "situation" that I forgot, and this is a last name consisting of two or more words, for example. "St. Cyrus" or "Von Wilhelm".
Will the next addition

  `((St|Von)[ ])? 

works in this regex that you suggested?

  `((St|Von)[ ])?([^\,()"']+) 

My tests in Regex Coach and RegExr didn’t quite work, because the replacement returns “St” with a preceding space.

+6
source share
2 answers

Repeat -

This is a different approach. This may work in your VBA, and this is just an example. I tested this on Perl and it worked great. But I will not show perl code,
just a regular expression and some clarifications.

This is a two-step process.

  • Normalize column text
  • Perform basic analysis

Normalize the process

  • Get column value
  • Cut all the dots . - globally search \. , replace with nothing. ''
  • Turn spaces to spaces - globally search for \s+ , replace with one space [ ]

(Note that if it cannot be normalized, I don’t see much chance of success, no matter what was tried)

Main analysis process

After normalizing the column value (do for both columns) run it through these regular expressions.

Reexpressing Column 1

 ^ [ ]? ([^\ ,()"']+) # (1) first name or initial (required) (?: ([ ] \( [^)]* \)) )? # (2) parenthetical 'preferred' name (optional) (?: ([ ] [^\ ,()"'] ) # (3,4) middle initial OR name (optional) ([^\ ,()"']*) # name and initial are both captured )? (?: ([ ] (["'] ) .*?) \6 )? # (5,6) quoted nickname or initial (optional) [ ] ([^\ ,()"']+) # (7) last name (required) (?: [, ]* ([ ].+?) [ ]? # (8) suffix (optional) | .*? )? $ 

Replacement depends on what you want.
Three types are defined (replace $ with \ as needed):

  • type 1a full middle - $7$8,$1$2$3$4$5$6
  • type 1b average initial - $7$8,$1$2$3$5$6
  • type 2 average elementary - $7$8,$1$3

Conversion Example:

 Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1 full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1 middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 2 middle initial = 'Smith Jr,John B' 

Regular expression for column 2

 ^ [ ]? ([^\ ,()"']+) # (1) last name (required) (?: ([ ] [^\ ,()"']+) )? # (2) suffix (optional) , ([^\ ,()"']+) # (3) first name or initial (required) (?: ([ ] [^\ ,()"']) # (4,5) middle initial OR name (optional) ([^\ ,()"']*) )? .* $ 

Replacement depends on what you want.
Two types are defined (replace $ with \ as necessary):

  • type 1a full medium - $1$2,$3$4$5
  • type 1b average initial - $1$2,$3$4

Conversion Example:

 Input = 'Smith Jr.,John Bertrand ' Out type 1 full middle = 'Smith Jr,John Bertrand' Out type 1 middle initial = 'Smith Jr,John B' 

VBA Replacement Help

This works on a very old copy of Excel, creating a VBA project.
These are two modules created to show an example.
They both do the same.

The first is a detailed example of all possible replacement options.
The second is an abridged version using only type 2 comparison.

I have not done VB before, as you can tell, but it should be simple enough

for you to find out how the replacement works, and how to tie the columns together.

If you just do the comparison equally, you can do col 1 val
once, then check each value in column 2 on it, then go to the next value in
column 1, then repeat.

For the fastest way to do this, create 2 additional columns, convert respected
vals columns for type 2 (strC1_2 and strC2_2 variables, see example), then copy them
to the new columns.
After that, you don’t need regex, just compare the columns, find the corresponding rows,
then delete type 2 columns.

Detailed -

 Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexs for whitespace and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' ------------------------------------------------------ ' This section is informational ' Shows some sample replacements before comparison ' Just pick 1 replacement from each column, discard the rest ' ------------------------------------------------------ ' Create Some Replacement Types for Column 1 ' ===================================================== strC1_1a = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$4$5$6") strC1_1b = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$5$6") strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") ' Create Some Replacement Types for Column 2 ' ===================================================== strC2_1b = rxC2.Replace(strC2_Normal, "$1$2,$3$4$5") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") ' Show Types in Message Box ' ===================================================== c1_t1a = "Column1 Types:" & Chr(13) & "type 1a full middle - " & strC1_1a c1_t1b = "type 1b middle initial - " & strC1_1b c1_t2 = "type 2 middle initial - " & strC1_2 c2_t1b = "Column2 Types:" & Chr(13) & "type 1b middle initial - " & strC2_1b c2_t2 = "type 2 middle initial - " & strC2_2 MsgBox (c1_t1a & Chr(13) & c1_t1b & Chr(13) & c1_t2 & Chr(13) & Chr(13) & c2_t1b & Chr(13) & c2_t2) ' ------------------------------------------------------ ' Compare a Value from Column 1 vs Column 2 ' For this we will compare Type 2 values ' ------------------------------------------------------ If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL: " & Chr(13) & strC1_2) Else MsgBox ("Type 2 values are NOT Equal:" & Chr(13) & strC1_2 & " != " & strC1_2) End If ' ------------------------------------------------------ ' Same comparison (Type 2) of Normalized column 1,2 values ' In esscense, this is all you need ' ------------------------------------------------------ If rxC1.Replace(strC1_Normal, "$7$8,$1$3") = rxC2.Replace(strC2_Normal, "$1$2,$3$4") Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub 

Type 2 only -

 Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexes for whitespace and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' Comparison (Type 2) of Normalized column 1,2 values ' ============================================ strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub 

Paren / quote response

As a side note, I will need to eliminate the quotes from the nicknames and the parentheses from the preferred names.

If I understood correctly.

Yes, you can grab content inside quotes and brackets separately.
It just requires some changes. The regular expression below has the ability to formulate a replacement with or without quotation marks and / or brackets,
or other forms.

The following are examples of wording for substitutions.

Very important to note here

IF you are talking about excluding quotes "" and parentheses () from the list The corresponding regular expression can also be executed. This requires a new regular expression.

The only problem is that ALL the difference between preferred / average / nick
throws out the window because they were positional, as well (i.e.: (preferred) average "nickname").

Distracting attention to this will require regular expression subexpressions like this

 (?:[ ]([^ ,]+))? # optional preferred (?:[ ]([^ ,]+))? # optional middle (?:[ ]([^ ,]+))? # optional nick 

And, being optional, loses the entire positional link and displays the average initial version of the expression invalid.

Final note

Regular expression pattern (use to form replacement strings)

 ^ [ ]? # (required) # First # $1 name # ----------------------------------------- ([^\ ,()"']+) # (1) name # (optional) # Parenthetical 'preferred' # $2 all # $3$4 name # ----------------------------------------- (?: ( # (2) all ([ ]) \( ([^)]*) \) # (3,4) space and name ) )? # (optional) # Middle # $5 initial # $5$6 name # ----------------------------------------- (?: ([ ] [^\ ,()"'] ) # (5) first character ([^\ ,()"']*) # (6) remaining characters )? # (optional) # Quoted nick # $7$8$9$8 all # $7$9 name # ----------------------------------------- (?: ([ ]) # (7) space (["']) # (8) quote (.*?) # (9) name \8 )? # (required) # Last # $10 name # ----------------------------------------- [ ] ([^\ ,()"']+) # (10) name # (optional) # Suffix # $11 suffix # ----------------------------------------- (?: [, ]* ([ ].+?) [ ]? # (11) suffix | .*? )? $ 

VBA regex (second edition tested in my VBA project from above)

 rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:(([ ])\(([^)]*)\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ])([""'])(.*?)\8)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" strC1_1a = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$6$7$8$9$8" ) strC1_1aa = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$6$7$9" ) strC1_1b = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$7$8$9$8" ) strC1_1bb = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$7$9" ) strC1_2 = rxC1.Replace( strC1_Normal, "$10$11,$1$5" ) 

Examples of I / O Capabilities

 Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1a full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Bertrand Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny B Abe' Out type 2 middle initial = 'Smith Jr,John B' Input (raw) = 'John (Johnny) Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny)' Out type 1aa full middle = 'Smith Jr,John Johnny' Out type 1b middle initial = 'Smith Jr,John (Johnny)' Out type 1bb middle initial = 'Smith Jr,John Johnny' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John (Johnny) "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny) "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny Abe' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John "Abe"' Out type 1aa full middle = 'Smith Jr,John Abe' Out type 1b middle initial = 'Smith Jr,John "Abe"' Out type 1bb middle initial = 'Smith Jr,John Abe' Out type 2 middle initial = 'Smith Jr,John' 

Re: 4/17 concern

last names that have 2 or more words. Would the allowance for certain literal names, rather than generic word patterns, be the solution?

Actually, no, it’s not. In this case, for your form that allows you to use several words in the name, it introduces a spatial field separator in the name field.

However, this can be done for your specific form, since the only drawback is when the "nick" field is absent. When it is absent, and provided that the application has only one word, the middle name, 2 permutations.

We hope you can get a solution from 3 regular expressions and test output. Regular expressions cut out spaces of space from capture. So, you can either make replacements using the Replace method or just save capture buffers for comparison with
capture results from other columns.

 Nick_rx.Pattern (template) * This pattern is multi-word last name, NICK is required ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Middle (?: [ ] ( # (4) full middle, -or- ([^\ ,()"']) # (5) initial [^\ ,()"']* ) )? # Quoted nick (req'd) [ ] ( # (6) "nick", (["']) # (7) n/a -or- (.*?) # (8) nick \7 ) # Single/Multi Last (req'd) [ ] ( # (9) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (10) suffix [ ]? $ ----------------------------------- FLs_rx.Pattern (template) * This pattern has no MIDDLE/NICK, is single-word last name, * and has no permutations. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Single Last (req'd) [ ] ([^\ ,()"']+) # (4) single word last name # Suffix (?: [ ]? , [ ]? (.*?) )? # (5) suffix [ ]? $ ----------------------------------- FLm_rx.Pattern (template) * This pattern has no NICK, is multi-word last name, * and has 2 permutations. * 1. Middle as part of Last name. * 2. Middle is separate from Last name. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Multi Last (req'd) [ ] ( # (4) Multi, as Middle + Last, # -or- (?: # Middle ( # (5) full middle, -or- ([^\ ,()"']) # (6) initial [^\ ,()"']* ) [ ] ) # Last (req'd) ( # (7) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (8) suffix [ ]? $ ----------------------------------- Each of these regexes are mutually exclusive and should be checked in an if-then-else like this (Pseudo code): str_Normal = rxDot.Replace(rxWSp.Replace(str, " "), "") If Nick_rx.Test(str_Normal) Then N_1a = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $4 $6 "), " ") N_1aa = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $4 $8 "), " ") N_1b = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $5 $6 "), " ") N_1bb = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $5 $8 "), " ") N_2 = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $5 "), " ") ' see test case results in output below Else If FLs_rx.Test(str_Normal) Then FLs_1a = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $2 "), " ") FLs_1aa = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $3 "), " ") FLs_2 = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 "), " ") Else If FLm_rx.Test(str_Normal) Then ' Permutation 1: FLm1_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $2 "), " ") FLm1_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $3 "), " ") FLm1_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 "), " ") ' Permutation 2: FLm2_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $5 "), " ") FLm2_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $5 "), " ") FLm2_1b = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $6 "), " ") FLm2_1bb = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $6 "), " ") FLm2_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $6 "), " ") ' At this point, the odds are that only one of these permutations will match ' a different column. Else ' The data could not be matched against a valid form End If ----------------------------- Test Cases Found form 'Nick' Input (raw) = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Normal = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John1 (JJ) Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John1 JJ Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John1 (JJ) B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John1 JJ B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John1 B ' ======================================================= Found form 'Nick' Input (raw) = 'John2 Bert "nick" Helsing ,Jr ' Normal = 'John2 Bert "nick" Helsing ,Jr ' Out type 1a full middle = 'Helsing Jr , John2 Bert "nick" ' Out type 1aa full middle = 'Helsing Jr , John2 Bert nick ' Out type 1b middle initial = 'Helsing Jr , John2 B "nick" ' Out type 1bb middle initial = 'Helsing Jr , John2 B nick ' Out type 2 middle initial = 'Helsing Jr , John2 B ' ======================================================= Found form 'Nick' Input (raw) = 'John3 Bert "nick" St Van Helsing ,Jr ' Normal = 'John3 Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John3 Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John3 Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John3 B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John3 B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John3 B ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John4 Helsing ' Normal = 'John4 Helsing ' Out type 1a no middle = 'Helsing , John4 ' Out type 1aa no middle = 'Helsing , John4 ' Out type 2 = 'Helsing , John4 ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John5 (JJ) Helsing ' Normal = 'John5 (JJ) Helsing ' Out type 1a no middle = 'Helsing , John5 (JJ) ' Out type 1aa no middle = 'Helsing , John5 JJ ' Out type 2 = 'Helsing , John5 ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John6 (JJ) Bert St Van Helsing ,Jr ' Normal = 'John6 (JJ) Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John6 (JJ) ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John6 JJ ' Out type 2 = 'Bert St Van Helsing Jr , John6 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John6 (JJ) Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John6 JJ Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John6 (JJ) B ' Out type 1bb middle initial = 'St Van Helsing Jr , John6 JJ B ' Out type 2 middle initial = 'St Van Helsing Jr , John6 B ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John7 Bert St Van Helsing ,Jr ' Normal = 'John7 Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John7 ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John7 ' Out type 2 = 'Bert St Van Helsing Jr , John7 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John7 B ' Out type 1bb middle initial = 'St Van Helsing Jr , John7 B ' Out type 2 middle initial = 'St Van Helsing Jr , John7 B ' ======================================================= Form *** (unknown) Input (raw) = ' do(e)s not. match ,' Normal = ' do(e)s not match ,' ======================================================= 
+2
source

Here is a regular expression that may be useful, it will give you 6 capture groups in the following order: first name, preferred name, middle name, alias, last name, suffix.

 ([az]+)\.?\s(?:(\([az]+\))\s)?(?:([az]+)\.?\s)?(?:("[az]+")\s)?([az]+)(?:,\s([az]+))? 

Here is an explanation:

 ([az]+)\.?\s # First name, followed by optional '.' (required) (?:(\([az]+\))\s)? # Preferred name, optional (?:([az]+)\.?\s)? # Middle name, optional (?:("[az]+")\s)? # Nickname, optional ([az]+) # Last name, required (?:,\s([az]+))? # Suffix, optional 

So for example, you can turn John (Johnny) B. "Abe" Smith, Jr. in Smith Jr,John (Johnny) B "Abe" by combining groups like this \5 \6,\1 \2 \3 \4 , or you can include it in Smith Jr,John B with \5 \6,\1 \3 .

+2
source

Source: https://habr.com/ru/post/912091/


All Articles