Unicode regular expression in string

I am working in C # doing some work with OCR and have extracted text that I need to work with. Now I need to parse the string using regular expressions.

string checkNum; string routingNum; string accountNum; Regex regEx = new Regex(@"\u9288\d+\u9288"); Match match = regEx.Match(numbers); if (match.Success) checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1); regEx = new Regex(@"\u9286\d{9}\u9286"); match = regEx.Match(numbers); if(match.Success) routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1); regEx = new Regex(@"\d{10}\u9288"); match = regEx.Match(numbers); if (match.Success) accountNum = match.Value.Remove(match.Value.Length - 1, 1); 

The problem is that the string contains the necessary Unicode characters when I do .ToCharArray() and check the contents of the string, but it never recognizes Unicode characters when I parse the string looking for them. I thought the lines in C # were Unicode by default.

+4
source share
3 answers

I get it. I used decimal values ​​instead of hex code In other words, instead of using \u9288 and \u9286 I had to use \u2448 and \u2446 http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440

Thanks guys for leading me in the right direction.

+3
source

This line:

 match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1); 

throws an exception because the resulting length from the first Remove less than the original match.Value.Length .

I suggest you use groups to extract the value. Example:

 Regex regEx = new Regex(@"\u9288(\d+)\u9288"); Match match = regEx.Match(numbers); if (match.Success) checkNum = match.Groups[1].Value; 

With this, I can correctly extract the values.

+1
source

String in .NET encoding UTF-16 .

In addition, Regex engines do not match Unicode characters, not Unicode codes. See this post .

0
source

Source: https://habr.com/ru/post/1309783/


All Articles