Repeat -
This is a different approach. This may work in your VBA, and this is just an example. I tested this on Perl and it worked great. But I will not show perl code,
just a regular expression and some clarifications.
This is a two-step process.
- Normalize column text
- Perform basic analysis
Normalize the process
- Get column value
- Cut all the dots
.
- globally search \.
, replace with nothing. ''
- Turn spaces to spaces - globally search for
\s+
, replace with one space [ ]
(Note that if it cannot be normalized, I don’t see much chance of success, no matter what was tried)
Main analysis process
After normalizing the column value (do for both columns) run it through these regular expressions.
Reexpressing Column 1
^ [ ]? ([^\ ,()"']+) # (1) first name or initial (required) (?: ([ ] \( [^)]* \)) )? # (2) parenthetical 'preferred' name (optional) (?: ([ ] [^\ ,()"'] ) # (3,4) middle initial OR name (optional) ([^\ ,()"']*)
Replacement depends on what you want.
Three types are defined (replace $
with \
as needed):
- type 1a full middle -
$7$8,$1$2$3$4$5$6
- type 1b average initial -
$7$8,$1$2$3$5$6
- type 2 average elementary -
$7$8,$1$3
Conversion Example:
Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1 full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1 middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 2 middle initial = 'Smith Jr,John B'
Regular expression for column 2
^ [ ]? ([^\ ,()"']+) # (1) last name (required) (?: ([ ] [^\ ,()"']+) )? # (2) suffix (optional) , ([^\ ,()"']+)
Replacement depends on what you want.
Two types are defined (replace $
with \
as necessary):
- type 1a full medium -
$1$2,$3$4$5
- type 1b average initial -
$1$2,$3$4
Conversion Example:
Input = 'Smith Jr.,John Bertrand ' Out type 1 full middle = 'Smith Jr,John Bertrand' Out type 1 middle initial = 'Smith Jr,John B'
VBA Replacement Help
This works on a very old copy of Excel, creating a VBA project.
These are two modules created to show an example.
They both do the same.
The first is a detailed example of all possible replacement options.
The second is an abridged version using only type 2 comparison.
I have not done VB before, as you can tell, but it should be simple enough
for you to find out how the replacement works, and how to tie the columns together.
If you just do the comparison equally, you can do col 1 val
once, then check each value in column 2 on it, then go to the next value in
column 1, then repeat.
For the fastest way to do this, create 2 additional columns, convert respected
vals columns for type 2 (strC1_2 and strC2_2 variables, see example), then copy them
to the new columns.
After that, you don’t need regex, just compare the columns, find the corresponding rows,
then delete type 2 columns.
Detailed -
Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexs for whitespace and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' ------------------------------------------------------ ' This section is informational ' Shows some sample replacements before comparison ' Just pick 1 replacement from each column, discard the rest ' ------------------------------------------------------ ' Create Some Replacement Types for Column 1 ' ===================================================== strC1_1a = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$4$5$6") strC1_1b = rxC1.Replace(strC1_Normal, "$7$8,$1$2$3$5$6") strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") ' Create Some Replacement Types for Column 2 ' ===================================================== strC2_1b = rxC2.Replace(strC2_Normal, "$1$2,$3$4$5") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") ' Show Types in Message Box ' ===================================================== c1_t1a = "Column1 Types:" & Chr(13) & "type 1a full middle - " & strC1_1a c1_t1b = "type 1b middle initial - " & strC1_1b c1_t2 = "type 2 middle initial - " & strC1_2 c2_t1b = "Column2 Types:" & Chr(13) & "type 1b middle initial - " & strC2_1b c2_t2 = "type 2 middle initial - " & strC2_2 MsgBox (c1_t1a & Chr(13) & c1_t1b & Chr(13) & c1_t2 & Chr(13) & Chr(13) & c2_t1b & Chr(13) & c2_t2) ' ------------------------------------------------------ ' Compare a Value from Column 1 vs Column 2 ' For this we will compare Type 2 values ' ------------------------------------------------------ If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL: " & Chr(13) & strC1_2) Else MsgBox ("Type 2 values are NOT Equal:" & Chr(13) & strC1_2 & " != " & strC1_2) End If ' ------------------------------------------------------ ' Same comparison (Type 2) of Normalized column 1,2 values ' In esscense, this is all you need ' ------------------------------------------------------ If rxC1.Replace(strC1_Normal, "$7$8,$1$3") = rxC2.Replace(strC2_Normal, "$1$2,$3$4") Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub
Type 2 only -
Sub RegexColumnValueComparison() ' Column 1 and 2 , Sample values ' These should probably be passed in values ' ============================================ strC1 = "John (Johnny) Bertrand ""Abe"" Smith, Jr. " strC2 = "Smith Jr.,John Bertrand " ' Normalization Regexes for whitespace and period's ' (use for both column values) ' ============================================= Set rxDot = CreateObject("vbscript.regexp") rxDot.Global = True rxDot.Pattern = "\." Set rxWSp = CreateObject("vbscript.regexp") rxWSp.Global = True rxWSp.Pattern = "\s+" ' Column 1 Regex ' ================== Set rxC1 = CreateObject("vbscript.regexp") rxC1.Global = False rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:([ ]\([^)]*\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ]([""']).*?)\6)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" ' Column 2 Regex ' ================== Set rxC2 = CreateObject("vbscript.regexp") rxC2.Global = False rxC2.Pattern = "^[ ]?([^ ,()""']+)(?:([ ][^ ,()""']+))?,([^ ,()""']+)(?:([ ][^ ,()""'])([^ ,()""']*))?.*$" ' Normalize column 1 and 2, Copy to new var ' ============================================ strC1_Normal = rxDot.Replace(rxWSp.Replace(strC1, " "), "") strC2_Normal = rxDot.Replace(rxWSp.Replace(strC2, " "), "") ' Comparison (Type 2) of Normalized column 1,2 values ' ============================================ strC1_2 = rxC1.Replace(strC1_Normal, "$7$8,$1$3") strC2_2 = rxC2.Replace(strC2_Normal, "$1$2,$3$4") If strC1_2 = strC2_2 Then MsgBox ("Type 2 values are EQUAL") Else MsgBox ("Type 2 values are NOT Equal") End If End Sub
Paren / quote response
As a side note, I will need to eliminate the quotes from the nicknames and the parentheses from the preferred names.
If I understood correctly.
Yes, you can grab content inside quotes and brackets separately.
It just requires some changes. The regular expression below has the ability to formulate a replacement with or without quotation marks and / or brackets,
or other forms.
The following are examples of wording for substitutions.
Very important to note here
IF you are talking about excluding quotes "" and parentheses () from the list The corresponding regular expression can also be executed. This requires a new regular expression.
The only problem is that ALL the difference between preferred / average / nick
throws out the window because they were positional, as well (i.e.: (preferred) average "nickname").
Distracting attention to this will require regular expression subexpressions like this
(?:[ ]([^ ,]+))? # optional preferred (?:[ ]([^ ,]+))? # optional middle (?:[ ]([^ ,]+))? # optional nick
And, being optional, loses the entire positional link and displays the average initial version of the expression invalid.
Final note
Regular expression pattern (use to form replacement strings)
^ [ ]? # (required) # First # $1 name # ----------------------------------------- ([^\ ,()"']+) # (1) name # (optional) # Parenthetical 'preferred' # $2 all # $3$4 name # ----------------------------------------- (?: ( # (2) all ([ ]) \( ([^)]*) \) # (3,4) space and name ) )? # (optional) # Middle # $5 initial # $5$6 name # ----------------------------------------- (?: ([ ] [^\ ,()"'] ) # (5) first character ([^\ ,()"']*) # (6) remaining characters )? # (optional) # Quoted nick # $7$8$9$8 all # $7$9 name # ----------------------------------------- (?: ([ ]) # (7) space (["']) # (8) quote (.*?) # (9) name \8 )? # (required) # Last # $10 name # ----------------------------------------- [ ] ([^\ ,()"']+) # (10) name # (optional) # Suffix # $11 suffix # ----------------------------------------- (?: [, ]* ([ ].+?) [ ]? # (11) suffix | .*? )? $
VBA regex (second edition tested in my VBA project from above)
rxC1.Pattern = "^[ ]?([^ ,()""']+)(?:(([ ])\(([^)]*)\)))?(?:([ ][^ ,()""'])([^ ,()""']*))?(?:([ ])([""'])(.*?)\8)?[ ]([^ ,()""']+)(?:[, ]*([ ].+?)[ ]?|.*?)?$" strC1_1a = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$6$7$8$9$8" ) strC1_1aa = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$6$7$9" ) strC1_1b = rxC1.Replace( strC1_Normal, "$10$11,$1$2$5$7$8$9$8" ) strC1_1bb = rxC1.Replace( strC1_Normal, "$10$11,$1$3$4$5$7$9" ) strC1_2 = rxC1.Replace( strC1_Normal, "$10$11,$1$5" )
Examples of I / O Capabilities
Input (raw) = 'John (Johnny) Bertrand "Abe" Smith, Jr. ' Out type 1a full middle = 'Smith Jr,John (Johnny) Bertrand "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Bertrand Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) B "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny B Abe' Out type 2 middle initial = 'Smith Jr,John B' Input (raw) = 'John (Johnny) Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny)' Out type 1aa full middle = 'Smith Jr,John Johnny' Out type 1b middle initial = 'Smith Jr,John (Johnny)' Out type 1bb middle initial = 'Smith Jr,John Johnny' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John (Johnny) "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John (Johnny) "Abe"' Out type 1aa full middle = 'Smith Jr,John Johnny Abe' Out type 1b middle initial = 'Smith Jr,John (Johnny) "Abe"' Out type 1bb middle initial = 'Smith Jr,John Johnny Abe' Out type 2 middle initial = 'Smith Jr,John' Input (raw) = 'John "Abe" Smith, Jr.' Out type 1a full middle = 'Smith Jr,John "Abe"' Out type 1aa full middle = 'Smith Jr,John Abe' Out type 1b middle initial = 'Smith Jr,John "Abe"' Out type 1bb middle initial = 'Smith Jr,John Abe' Out type 2 middle initial = 'Smith Jr,John'
Re: 4/17 concern
last names that have 2 or more words. Would the allowance for certain literal names, rather than generic word patterns, be the solution?
Actually, no, it’s not. In this case, for your form that allows you to use several words in the name, it introduces a spatial field separator in the name field.
However, this can be done for your specific form, since the only drawback is when the "nick"
field is absent. When it is absent, and provided that the application has only one word, the middle name, 2 permutations.
We hope you can get a solution from 3 regular expressions and test output. Regular expressions cut out spaces of space from capture. So, you can either make replacements using the Replace method or just save capture buffers for comparison with
capture results from other columns.
Nick_rx.Pattern (template) * This pattern is multi-word last name, NICK is required ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Middle (?: [ ] ( # (4) full middle, -or- ([^\ ,()"']) # (5) initial [^\ ,()"']* ) )? # Quoted nick (req'd) [ ] ( # (6) "nick", (["']) # (7) n/a -or- (.*?) # (8) nick \7 ) # Single/Multi Last (req'd) [ ] ( # (9) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (10) suffix [ ]? $ ----------------------------------- FLs_rx.Pattern (template) * This pattern has no MIDDLE/NICK, is single-word last name, * and has no permutations. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Single Last (req'd) [ ] ([^\ ,()"']+) # (4) single word last name # Suffix (?: [ ]? , [ ]? (.*?) )? # (5) suffix [ ]? $ ----------------------------------- FLm_rx.Pattern (template) * This pattern has no NICK, is multi-word last name, * and has 2 permutations. * 1. Middle as part of Last name. * 2. Middle is separate from Last name. ^ [ ]? # First (req'd) ([^\ ,()"']+) # (1) first name # Preferred first (?: [ ] ( # (2) (preferred), -or- \( ([^)]*?) \) # (3) preferred ) )? # Multi Last (req'd) [ ] ( # (4) Multi, as Middle + Last, # -or- (?: # Middle ( # (5) full middle, -or- ([^\ ,()"']) # (6) initial [^\ ,()"']* ) [ ] ) # Last (req'd) ( # (7) multi/single word last name [^\ ,()"']+ (?:[ ][^\ ,()"']+)* ) ) # Suffix (?: [ ]? , [ ]? (.*?) )? # (8) suffix [ ]? $ ----------------------------------- Each of these regexes are mutually exclusive and should be checked in an if-then-else like this (Pseudo code): str_Normal = rxDot.Replace(rxWSp.Replace(str, " "), "") If Nick_rx.Test(str_Normal) Then N_1a = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $4 $6 "), " ") N_1aa = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $4 $8 "), " ") N_1b = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $2 $5 $6 "), " ") N_1bb = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $3 $5 $8 "), " ") N_2 = rxWSp.Replace( Nick_rx.Replace(str_Normal, "$9 $10 , $1 $5 "), " ") ' see test case results in output below Else If FLs_rx.Test(str_Normal) Then FLs_1a = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $2 "), " ") FLs_1aa = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 $3 "), " ") FLs_2 = rxWSp.Replace( FLs_rx.Replace(str_Normal, "$4 $5 , $1 "), " ") Else If FLm_rx.Test(str_Normal) Then ' Permutation 1: FLm1_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $2 "), " ") FLm1_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 $3 "), " ") FLm1_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$4 $8 , $1 "), " ") ' Permutation 2: FLm2_1a = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $5 "), " ") FLm2_1aa = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $5 "), " ") FLm2_1b = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $2 $6 "), " ") FLm2_1bb = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $3 $6 "), " ") FLm2_2 = rxWSp.Replace( FLm_rx.Replace(str_Normal, "$7 $8 , $1 $6 "), " ") ' At this point, the odds are that only one of these permutations will match ' a different column. Else ' The data could not be matched against a valid form End If ----------------------------- Test Cases Found form 'Nick' Input (raw) = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Normal = 'John1 (JJ) Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John1 (JJ) Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John1 JJ Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John1 (JJ) B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John1 JJ B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John1 B ' ======================================================= Found form 'Nick' Input (raw) = 'John2 Bert "nick" Helsing ,Jr ' Normal = 'John2 Bert "nick" Helsing ,Jr ' Out type 1a full middle = 'Helsing Jr , John2 Bert "nick" ' Out type 1aa full middle = 'Helsing Jr , John2 Bert nick ' Out type 1b middle initial = 'Helsing Jr , John2 B "nick" ' Out type 1bb middle initial = 'Helsing Jr , John2 B nick ' Out type 2 middle initial = 'Helsing Jr , John2 B ' ======================================================= Found form 'Nick' Input (raw) = 'John3 Bert "nick" St Van Helsing ,Jr ' Normal = 'John3 Bert "nick" St Van Helsing ,Jr ' Out type 1a full middle = 'St Van Helsing Jr , John3 Bert "nick" ' Out type 1aa full middle = 'St Van Helsing Jr , John3 Bert nick ' Out type 1b middle initial = 'St Van Helsing Jr , John3 B "nick" ' Out type 1bb middle initial = 'St Van Helsing Jr , John3 B nick ' Out type 2 middle initial = 'St Van Helsing Jr , John3 B ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John4 Helsing ' Normal = 'John4 Helsing ' Out type 1a no middle = 'Helsing , John4 ' Out type 1aa no middle = 'Helsing , John4 ' Out type 2 = 'Helsing , John4 ' ======================================================= Found form 'First-Last (single)' Input (raw) = 'John5 (JJ) Helsing ' Normal = 'John5 (JJ) Helsing ' Out type 1a no middle = 'Helsing , John5 (JJ) ' Out type 1aa no middle = 'Helsing , John5 JJ ' Out type 2 = 'Helsing , John5 ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John6 (JJ) Bert St Van Helsing ,Jr ' Normal = 'John6 (JJ) Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John6 (JJ) ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John6 JJ ' Out type 2 = 'Bert St Van Helsing Jr , John6 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John6 (JJ) Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John6 JJ Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John6 (JJ) B ' Out type 1bb middle initial = 'St Van Helsing Jr , John6 JJ B ' Out type 2 middle initial = 'St Van Helsing Jr , John6 B ' ======================================================= Found form 'First-Last (multi)' Input (raw) = 'John7 Bert St Van Helsing ,Jr ' Normal = 'John7 Bert St Van Helsing ,Jr ' Permutation 1: Out type 1a no middle = 'Bert St Van Helsing Jr , John7 ' Out type 1aa no middle = 'Bert St Van Helsing Jr , John7 ' Out type 2 = 'Bert St Van Helsing Jr , John7 ' Permutation 2: Out type 1a full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1aa full middle = 'St Van Helsing Jr , John7 Bert ' Out type 1b middle initial = 'St Van Helsing Jr , John7 B ' Out type 1bb middle initial = 'St Van Helsing Jr , John7 B ' Out type 2 middle initial = 'St Van Helsing Jr , John7 B ' ======================================================= Form *** (unknown) Input (raw) = ' do(e)s not. match ,' Normal = ' do(e)s not match ,' =======================================================