Do you know the fastest way to encode and decode UTF8 with additional information? Here are some interesting cases that happen to me:
Serialization
I just want to encode an opaque buffer without checking, so I can decode it again later. The fastest way is to use a basic memory buffer and it is somehow unsafe to force it from Text to ByteString without touching the contents.
Probably ASCII
I assume that in 99% of cases my UTF8 is actually ASCII, so it makes sense to make the first pass to confirm this, and only further processing if it turned out to be invalid.
Probably not ASCII
Conversion of the previous one.
Probably short
One key in JSON or a database, which, I think, will be from 1 to 20 characters. It would be foolish to pay some upfront costs as a vectorized SIMD approach.
Probably long
HTML document. It’s worth it to pay some upfront costs for maximum bandwidth.
There are a few more options that are similar, for example, when encoding JSON or a URL, and you think that there are probably no escape characters.
I ask this question under the [Haskell] tag, since Haskell's strong padding makes some methods easy, let's say C is hard to implement. In addition, there may be some special GHC tricks, such as using Intel SSE4 instructions, that would be interesting. But this is more of a UTF8 problem in general, and good ideas would be useful for any language.
Update
After some research, I suggest implementing encode and decode for serialization purposes as follows:
myEncode :: Text -> ByteString myEncode = unsafeCoerce myDecode :: ByteString -> Text myDecode = unsafeCoerce
This is a great idea if you like segfault ...