PCRE does not match utf8 characters

I am compiling a PCRE template with the utf8 flag turned on and trying to match the utf8 char* string to it, but it does not match, and pcre_exec returns negative. I pass a topic from 65 to pcre_exec , which is the number of characters in a string. I believe that it expects the number of bytes, so I tried increasing the argument to 70, but still get the same result. I do not know what else makes the match unsuccessful. Help before shooting yourself.

(If I try without the PCRE_UTF8 flag, but it matches, but the offset [1] vector is 30, which is the character index immediately before the Unicode character in my input line)

 #include "stdafx.h" #include "pcre.h" #include <pcre.h> /* PCRE lib NONE */ #include <stdio.h> /* I/O lib C89 */ #include <stdlib.h> /* Standard Lib C89 */ #include <string.h> /* Strings C89 */ #include <iostream> int main(int argc, char *argv[]) { pcre *reCompiled; int pcreExecRet; int subStrVec[30]; const char *pcreErrorStr; int pcreErrorOffset; char* aStrRegex = "(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)(" // params can be an empty pair of parenthesis or have parameters inside them as well. "\\(\\s*(?<params>[?\\w,]+)\\s*\\)" // paramList along with its parenthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters. ")?"; reCompiled = pcre_compile(aStrRegex, 0, &pcreErrorStr, &pcreErrorOffset, NULL); if(reCompiled == NULL) { printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr); exit(1); } char* line = "?rt?=call SqlTxFunctionTesting(?înFîéld?,?outField?,?inOutField?)"; pcreExecRet = pcre_exec(reCompiled, NULL, line, 65, // length of string 0, // Start looking at this point 0, // OPTIONS subStrVec, 30); // Length of subStrVec printf("\nret=%d",pcreExecRet); //int substrLen = pcre_get_substring(line, subStrVec, pcreExecRet, 1, &mantissa); } 
+2
source share
1 answer

one)

 char * q= "î"; printf("%d, %s", q[0], q); 

Output:
63 ,?

2) You must rebuild PCRE with PCRE_BUILD_PCRE16 (or 32) and PCRE_SUPPORT_UTF. And use pcre16.lib and / or pcre16.dll. Then you can try this code:

  pcre16 *reCompiled; int pcreExecRet; int subStrVec[30]; const char *pcreErrorStr; int pcreErrorOffset; wchar_t* aStrRegex = L"(\\?\\w+\\?\\s*=)?\\s*(call|exec|execute)\\s+(?<spName>\\w+)(" // params can be an empty pair of paranthesis or have parameters inside them as well. L"\\(\\s*(?<params>[?,\\w\\p{L}]+)\\s*\\)" // paramList along with its paranthesis is optional below so a SP call can be just "exec sp_name" for a stored proc call without any parameters. L")?"; reCompiled = pcre16_compile((PCRE_SPTR16)aStrRegex, PCRE_UTF8, &pcreErrorStr, &pcreErrorOffset, NULL); if(reCompiled == NULL) { printf("ERROR: Could not compile '%s': %s\n", aStrRegex, pcreErrorStr); exit(1); } const wchar_t* line = L"?rt?=call SqlTxFunctionTesting( ?inField?,?outField?,?inOutField?,?fd? )"; const wchar_t* mantissa=new wchar_t[wcslen(line)]; pcreExecRet = pcre16_exec(reCompiled, NULL, (PCRE_SPTR16)line, wcslen(line), // length of string 0, // Start looking at this point 0, // OPTIONS subStrVec, 30); // Length of subStrVec printf("\nret=%d",pcreExecRet); for (int i=0;i<pcreExecRet;i++){ int substrLen = pcre16_get_substring((PCRE_SPTR16)line, subStrVec, pcreExecRet, i, (PCRE_SPTR16 *)&mantissa); wprintf(L"\nret string=%s, length=%i\n",mantissa,substrLen); } 

3) \ w = [0-9A-Z_a-z]. It does not contain Unicode characters.
4) It can really help: http://answers.oreilly.com/topic/215-how-to-use-unicode-code-points-properties-blocks-and-scripts-in-regular-expressions/
5) from PCRE 8.33 source (pcre_exec.c: 2251)

 /* Find out if the previous and current characters are "word" characters. It takes a bit more work in UTF-8 mode. Characters > 255 are assumed to be "non-word" characters. Remember the earliest consulted character for partial matching. */ 
+1
source

Source: https://habr.com/ru/post/1502821/


All Articles