Get `Char` from` ByteString`

Is there a way to get the first UTF-8 Charin ByteStringin O (1) time? I'm looking for something like

headUtf8 :: ByteString -> Char
tailUtf8 :: ByteString -> ByteString

I still can not use strict or lazy ByteString, but I would prefer rigor. For the lazy, ByteStringI can weave something together through Text, but I'm not sure how effective (especially complex in space).

import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Encoding (decodeUtf8With, encodeUtf8)
import Data.Text.Encoding.Error (lenientDecode)

headUtf8 :: ByteString -> Char
headUtf8 = T.head . decodeUtf8With lenientDecode

tailUtf8 :: ByteString -> ByteString
tailUtf8 = encodeUtf8 . T.tail . decodeUtf8With lenientDecode

In case someone is interested, this problem occurs when using Alex to create a lexer that supports UTF-8 1 characters .


1 I know that with Alex 3.0 you need to provide alexGetByte(and it's great!), But I still need to be able to receive characters in another code in the lexer.

+4
2

Data.Bytestring.UTF8 utf8. uncons :

uncons :: ByteString -> Maybe (Char, ByteString)

:

headUtf8 :: ByteString -> Char
headUtf8 = fst . fromJust . uncons

tailUtf8 :: ByteString -> ByteString
tailUtf8 = snd . fromJust . uncons
+4

UTF-8 6 , , 1, 2,... , 6- , , O (1):

import Data.Text as Text
import Data.Text.Encoding as Text
import Data.ByteString as BS

splitUtf8 :: ByteString -> (Char, ByteString)
splitUtf8 bs = go 1
  where
    go n | BS.null slack = (Text.head t, bs')
         | otherwise = go (n + 1)
      where
        (bs1, bs') = BS.splitAt n bs
        Some t slack _ = Text.streamDecodeUtf8 bs1

, 2 + 3- ByteString:

*SO_40414452> splitUtf8 $ BS.pack[197, 145, 226, 138, 162]
('\337',"\226\138\162")

3 + 2-:

*SO_40414452> splitUtf8 $ BS.pack[226, 138, 162, 197, 145]
('\8866',"\197\145")
0

Source: https://habr.com/ru/post/1659783/


All Articles