For this question, I created the following Lua code that converts a Unicode code point to a UTF-8 character string. Is there a better way to do this (in Lua 5.1+)? “Better” in this case means “significantly more efficient, or, preferably, much fewer lines of code.”
Note. I am not asking for a code review of this algorithm; I ask for a better algorithm (or built-in library).
do local bytebits = { {0x7F,{0,128}}, {0x7FF,{192,32},{128,64}}, {0xFFFF,{224,16},{128,64},{128,64}}, {0x1FFFFF,{240,8},{128,64},{128,64},{128,64}} } function utf8(decimal) local charbytes = {} for b,lim in ipairs(bytebits) do if decimal<=lim[1] then for i=b,1,-1 do local prefix,max = lim[i+1][1],lim[i+1][2] local mod = decimal % max charbytes[i] = string.char( prefix + mod ) decimal = ( decimal - mod ) / max end break end end return table.concat(charbytes) end end c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes. c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes. c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes. c=utf8(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes. c=utf8(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes. c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
I feel that there must be a way to get rid of the entire predefined bytebits table and loop to find the corresponding entry. The loop from the back, I could constantly %64 and add 128 to form continuation bytes until the value is below 128, but I can’t figure out how to gracefully generate 11110 for adding.
Change Here, processing with speed optimization is slightly improved. However, this is not an acceptable answer, since the algorithm still basically represents the same idea and about the same amount of code.
do local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} } function utf8(decimal) if decimal<128 then return string.char(decimal) end local charbytes = {} for bytes,vals in ipairs(bytemarkers) do if decimal<=vals[1] then for b=bytes+1,2,-1 do local mod = decimal%64 decimal = (decimal-mod)/64 charbytes[b] = string.char(128+mod) end charbytes[1] = string.char(vals[2]+decimal) break end end return table.concat(charbytes) end end