Convert Unicode Code (UTF-8) to Bytes

Question

Convert Unicode Code (UTF-8) to Bytes

I got to looking for C sources, but I cannot find this function, and I really do not want to write it myself, because it must be there.

To clarify: Unicode points are represented as U + ######## - it’s easy to get what I need, this is the format that the character is written to the file (for example). The Unicode code word translates into bytes so that 7 bits of the rightmost byte are written to the first byte, then 6 bits of the following bits are written to the next byte, and so on. Emacs, of course, knows how to do this, but I cannot find to get from it the byte sequence of the UTF-8 encoded string as a sequence of bytes (each of which contains 8 bits).

Functions such as get-byte or multybite-char-to-unibyte only work with characters that can be represented using a maximum of 8 bits. I need the same thing as get-byte , but for multi-byte characters, so instead of the integer 0..256, I would get either a vector of integers 0..256, or one long integer 0..2 ^ 32.

EDIT

Just in case, someone will need this later:

 (defun haxe-string-to-x-string (s) (with-output-to-string (let (current parts) (dotimes (i (length s)) (if (> 0 (multibyte-char-to-unibyte (aref si))) (progn (setq current (encode-coding-string (char-to-string (aref si)) 'utf-8)) (dotimes (j (length current)) (princ (format "\\x%02x" (aref current j))))) (princ (format "\\x%02x" (aref si))))))))

+6

emacs unicode utf-8 elisp

user797257 Jun 18 '12 at 2:43

source share

1 answer

legoscia · Accepted Answer · 2012-06-18T15:00:12+0000

encode-coding-string may be what you are looking for:

 *** Welcome to IELM *** Type (describe-mode) for help. ELISP> (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8) "e\304\245o\305\235an\304\235o \304\211iu\304\265a\305\255de"

It returns a string, but you can access individual bytes with aref :

 ELISP> (aref (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8) 1) 196 ELISP> (format "%o" 196) "304"

or if you don't mind using cl functions, concatenate is your friend:

 ELISP> (concatenate 'list (encode-coding-string "eĥoŝanĝo ĉiuĵaŭde" 'utf-8)) (101 196 165 111 197 157 97 110 196 157 111 32 196 137 105 117 196 181 97 197 173 100 101)

Convert Unicode Code (UTF-8) to Bytes

More articles: