Is there even a quick embedding of a multi-line character string in unicode wstring?

In my project, where I adopted the Aho-Corasick algorithm to execute some server-side message filter mode, the message received by the server is a multibyte character string. But after several tests, I found that the bottleneck is the conversion between the mulitbyte string and the unicode wstring. Now I use a couple of mbstowcs_s and wcstombs_s, which takes up almost 95% of the cost of the whole mode. In addition, I tried MultiByteToWideChar / WideCharToMultiByte, it got the same result. So I wonder if there is any other effective way to do this work? My project is built on VS2005, and the converted string will contain Chinese characters. Many thanks.

+3
source share
4 answers

There are several possibilities.

First, what do you mean by "multibyte character"? Do you mean UTF8 or the ISO DBCS system?

If you look at the definitions of UTF8 and UTF16, you can make a highly optimized conversion by tearing out the x bits and reformatting them. See For example, http://www.faqs.org/rfcs/rfc2044.html talks about UTF8 <==> UTF32. Setting up for UTF16 will be easy.

The second option may be to fully work in UTF16. Provide your web page (or user interface dialog box or something else) in UTF16 and get the user login this way.

, , Aho-Corasick. , , .

[ 29 2010] . http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt, C mbtowc() wctomb(). wchar_ts. 16- wchar_ts, .

, ( ) .

+1

( ), (mbstowcs wcstombs). , . , (, a-z, 0-9), .?

0

Perhaps you can reduce the number of calls on MultiByteToWideChar?

0
source

Perhaps you can also take Aho-Corasick to work directly on multi-byte strings.

0
source

Source: https://habr.com/ru/post/1730179/


All Articles