Fast, optimized UTF8 decoding

Do you know the fastest way to encode and decode UTF8 with additional information? Here are some interesting cases that happen to me:

Serialization

I just want to encode an opaque buffer without checking, so I can decode it again later. The fastest way is to use a basic memory buffer and it is somehow unsafe to force it from Text to ByteString without touching the contents.

Probably ASCII

I assume that in 99% of cases my UTF8 is actually ASCII, so it makes sense to make the first pass to confirm this, and only further processing if it turned out to be invalid.

Probably not ASCII

Conversion of the previous one.

Probably short

One key in JSON or a database, which, I think, will be from 1 to 20 characters. It would be foolish to pay some upfront costs as a vectorized SIMD approach.

Probably long

HTML document. It’s worth it to pay some upfront costs for maximum bandwidth.

There are a few more options that are similar, for example, when encoding JSON or a URL, and you think that there are probably no escape characters.

I ask this question under the [Haskell] tag, since Haskell's strong padding makes some methods easy, let's say C is hard to implement. In addition, there may be some special GHC tricks, such as using Intel SSE4 instructions, that would be interesting. But this is more of a UTF8 problem in general, and good ideas would be useful for any language.

Update

After some research, I suggest implementing encode and decode for serialization purposes as follows:

 myEncode :: Text -> ByteString myEncode = unsafeCoerce myDecode :: ByteString -> Text myDecode = unsafeCoerce 

This is a great idea if you like segfault ...

+5
source share
1 answer

This question involves a wide range of issues. I will interpret it as "In Haskell, how do I convert between Unicode and other character encodings?"

In Haskell, the recommended way to convert to and from Unicode are text-icu functions that provide some basic functions :

 fromUnicode :: Converter -> Text -> ByteString toUnicode :: Converter -> ByteString -> Text 

text-icu is a binding to international components for Unicode libraries , which does the hard work, in particular, for encoding and decoding for character sets other than Unicode. His website provides documentation on conversion in general and some specific information on how its implementation converter. Note that different character sets require several different hidden implementations.

The ICU may also try to automatically detect a set of input characters . "This is, at best, an inaccurate operation using statistics and heuristics." No other implementation can "fix" this characteristic. Haskell links do not reveal this functionality when I write; see # 8 .

I do not know any character set conversion procedures written in my native Haskell. As the ICU documentation points out, there is a lot of complexity; after all, it is a rich area of ​​international computing history.

Performance

As the frequently asked questions, the ICU succinctly notes , "In most cases, the amount of hard drive and RAM memory is a major performance limitation." Although this comment is not specific to conversions, I would expect it to be overall here. Your experience otherwise?

unsafeCoerce is not suitable here.

+4
source

Source: https://habr.com/ru/post/1205552/


All Articles