RegEx template for extracting authorization numbers

I use the GSKinner Reg Exr tool to help create a template that can find authorization numbers in a field that contains a lot of other garbage. An authorization number is a string containing letters (sometimes), numbers (always) and hyphens (sometimes) (that is, authorization always contains a number somewhere, but does not always contain hyphens and letters). In addition, the authorization number can be located anywhere in the field I'm looking for.

Examples of valid authorization numbers include:

5555834384734 ' All digits 12110-AANM ' Alpha plus digits, plus hyphens R-455545-AB-9 ' Alpha plus digits, plus multiple hyphens R-45-54A-AB-9 ' Alpha plus digits, plus multiple hyphens W892160 ' Alpha plus digits without hypens 

Here are some examples of data with extra garbage that is sometimes added to a real authorization number with a hyphen or missing space, which makes it look like part of a number. Garbage comes, although in predictable forms / words: REF, CHEST, IP, AMB, OBV and HOLD, which are not included in the authorization number.

  5557653700 IP R025257413-001 REF 120407175 SNK601M71016 U0504124 AMB W892160 019870270000000 00Q926K2 A025229563 01615217 AMB 12042-0148 SNK601M71016 12096NHP174 12100-ACDE 12110-AANM 12114AD5QIP REF-34555 3681869/OBV ONL 

Here is the template I am using:

  "\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*" 

I am studying RegExp, so it can no doubt be improved, but it works for the above, and not for the following situations:

  REFA5-208-4990IP 'Extract the string 'A5-208-4990'without REF or IP OBV1213110379 'Extract the string '1213110379' without the OBV 5520849900AMB 'Extract the string '5520849900' without AMB 5520849900CHEST 'Extract the string '5520849900' without CHEST 5520849900-IP 'Extract the string '5520849900' without -IP 1205310691-OBV 'Extract the string without the -OBV R-025257413-001 'Numbers of this form should also be allowed. NO PCT 93660 'If string contains the word NO anywhere, it is not a match HOLDA5-208-4990 'If string contains the word HOLD anywhere, it is not a match 

Can anyone help?

For testing purposes, here is Sub, which creates a table with sample input:

  Sub CreateTestAuth() Dim dbs As Database Set dbs = CurrentDb With dbs .Execute "CREATE TABLE tbl_test_auth " _ & "(AUTHSTR CHAR);" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('5557653700 IP');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "(' R025257413-001');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('REF 120407175');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('SNK601M71016');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('U0504124 AMB');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('3681869/OBV ONL');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('REFA5-208-4990IP');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('5520849900AMB');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('5520849900CHEST');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('5520849900-IP');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('1205310691-OBV');" .Execute " INSERT INTO tbl_test_auth " _ & "(AUTHSTR) VALUES " _ & "('HOLDA5-208-4990');" .Close End With End Sub 
+2
source share
5 answers

Your sample input file (path to this s / b file, assigned by function<GetMatches> as inputFilePath ):

 5557653700 IP R025257413-001 REF 120407175 SNK601M71016 U0504124 AMB W892160 019870270000000 00Q926K2 A025229563 01615217 AMB 12042-0148 SNK601M71016 12096NHP174 12100-ACDE 12110-AANM 12114AD5QIP REF-34555 3681869/OBV ONL 

here the semen stored in the file (the path to this s / b file is indicated by function<GetMatches> as replaceDBPath ):

 ^REF IP$ ^OBV AMB$ CHEST$ -OBV$ ^.*(NO|HOLD).*$ 

And here comes the bas :

 Option Explicit 'This example uses the following references: 'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime Private fso As New Scripting.FileSystemObject Private re As New VBScript_RegExp_55.RegExp Private Function GetJunkList(fpath$) As String() 0 On Error GoTo errHandler 1 If fso.FileExists(fpath) Then 2 Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$ 3 tmp = fso.OpenTextFile(fpath).ReadAll() 4 With re 5 .Global = True 6 .MultiLine = True 7 .Pattern = "[^\r\n]+" 8 Set mts = .Execute(tmp) 9 ReDim junkList(mts.Count - 1) 10 For Each mt In mts 11 junkList(pos) = mt.Value 12 pos = pos + 1 13 Next mt 14 End With 15 GetJunkList = junkList 16 Else 17 MsgBox "File not found at:" & vbCr & fpath 18 End If 19 Exit Function errHandler: Dim Msg$ With Err Msg = "Error '" & .Number & " " & _ .Description & "' occurred in " & _ "Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".") End With MsgBox Msg, vbCritical End Function Public Function GetMatches(replaceDBPath$, inputFilePath$) As String() 0 On Error GoTo errHandler 1 Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match 2 junks = GetJunkList(replaceDBPath) 3 tmp = fso.OpenTextFile(inputFilePath).ReadAll 4 5 With re 6 .Global = True 7 .MultiLine = True 8 .IgnoreCase = True 9 For pos = LBound(junks) To UBound(junks) 10 .Pattern = junkPat 11 junkPat = junks(pos) 12 'replace junk with [] 13 tmp = .Replace(tmp, "") 14 Next pos 15 16 'trim lines [if all input data in one line] 17 .Pattern = "^[ \t]*|[ \t]*$" 18 tmp = .Replace(tmp, "") 19 20 'create array using provided pattern 21 pos = 0 22 .Pattern = "\b[az]*[\d]+\-*\d*[a-z0-9]*\b" 23 Set mts = .Execute(tmp) 24 ReDim results(mts.Count - 1) 25 For Each mt In mts 26 results(pos) = mt.Value 27 pos = pos + 1 28 Next mt 29 End With 30 31 GetMatches = results 32 Exit Function errHandler: Dim Msg$ With Err Msg = "Error '" & .Number & " " & _ .Description & "' occurred in " & _ "Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".") End With MsgBox Msg, vbCritical End Function 

And a sample tester

 Public Sub tester() Dim samples() As String, s samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt") For Each s In samples MsgBox s Next End Sub 

can be called from immediate window :

 tester 

Hope this helps.

0
source

Well, at first I thought that the additional requirement would make the regex much longer.
But with a positive look ahead, it is actually almost the same size. Only regex this time:
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b

Or broken down by comments (ignore spaces):

 \b # Word start (?=.*\d) # A number has to follow somewhere after this point ( # Start capture group [a-z0-9]+ # At least one alphanum (?:-[a-z0-9]+)* # Possibly more attached with hyphen ) # End capture group \b # Word end 

Please note that the width variable in reverse order is not supported by all regular expression flavors. I do not know about VBA.

Second note: the thing (?=) Will also be satisfied if the number appears after the end of the word. So, in DONT-RECOGNIZE-ME no-1-5ay-yes
The bold part is highlighted.

+1
source

\ b for a start - the problem. And also some places and some dashes need to be taken care of. Try this " [a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]* ". Run this only on authorization numbers.

0
source

I would use a two-step approach because of this extra filtering.

 var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline); const string INPUT = @"REFA5-208-4990IP OBV1213110379 5520849900AMB 5520849900CHEST 5520849900-IP 1205310691-OBV R-025257413-001 NO PCT 93660 HOLDA5-208-4990"; string[] lines = splitter.Split(INPUT); var blacklist = new[] { "NO", "HOLD" }; var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" }; var filtered = from line in lines where blacklist.All(black => line.IndexOf(black) < 0) select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, "")); var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase); foreach (string s in filtered) { Console.Write("'{0}' ==> ", s); var match = authorization.Match(s); if (match.Success) { Console.Write(match.Value); } Console.WriteLine(); } 

Print

 'A5-208-4990' ==> A5-208-4990 ' 1213110379' ==> 1213110379 ' 5520849900' ==> 5520849900 ' 5520849900' ==> 5520849900 ' 5520849900-' ==> 5520849900 ' 1205310691-' ==> 1205310691 ' R-025257413-001' ==> R-025257413-001 
0
source

Sometimes it's easy to let it go, rather than sticking hard one way or another. :)

Try the following:

1 - add this function

 Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String Dim junkWord For Each junkWord In junkWords inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare) Next RemoveJunk = inputValue End Function 

2 - Now your task is simple. See the example below on how to use it:

 Sub Sample() Dim theText As String theText = " REFA5-208-4990IP blah blah " theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way Debug.Print theText '' -- now apply the regexp here -- End Sub 

Completing a call to the RemoveJunk function is a bit trickier. Place longer ones before shorter ones. for example -OBV should appear before "OBV".

Try and see if your problem solves.

0
source

Source: https://habr.com/ru/post/916897/


All Articles