RE2 and UTF16 (or UCS-2)

RE2 is excellent. Fast and deterministic.

However, it only supports UTF8. My lines are initially UTF16 , and converting back and forth will result in performance loss.

How difficult would it be to implement the built-in UTF16 feature in RE2?

How difficult would it be to implement the UCS-2 native capability in RE2? (this should be easier)

i.e. how many hours will a regular programmer need.

It bothered me for a couple of weeks, so I thought I would ask!

+4
source share
2 answers

Russ Cox, creator of RE2, was kind enough to publish a patch to support UCS-2. However, some claims for UCS-2 are not supported. The answer from Rus is published verbatim:

Hey. RE2 had UCS-2 mode before I open it, but it failed supporting statements like ^, $ and \ b, which limited its usefulness. If you do not need these operators, then this will probably work for you. I do not plan to re-add UCS-2 mode to RE2 sources, but I just post diff to the change that deleted it. You must be able to deploy diff in a local copy in order to return UCS-2 support. The file is ucs2.diff in the root of the Mercurial repository.

Enjoy.

Link to the code: http://code.google.com/p/re2/source/list

+5
source

Did you ask Russ Cox that his opinion might be related to the answer to your question? I bet it's too long to behold.

I really think that you are overestimating the cost of converting from ugly UTF-16 to regular UTF-8 and underestimating the cost of converting a very tuned library.

Just bite the bullet and use UTF-8, just like us.

I myself am a big fan of RE2, but it never crossed my mind to use it on UTF-16. UTF-16 just does not enter my world. Like any other encoding, everything that we get in UTF-16 is immediately updated to UTF-8, so that the entire tool chain can work with it, because we start the whole chain with pure UTF8.

Perhaps you live in the opposite world?

+1
source

Source: https://habr.com/ru/post/1395157/


All Articles