A more elegant and easy way to convert code to UTF-8

For this question, I created the following Lua code that converts a Unicode code point to a UTF-8 character string. Is there a better way to do this (in Lua 5.1+)? “Better” in this case means “significantly more efficient, or, preferably, much fewer lines of code.”

Note. I am not asking for a code review of this algorithm; I ask for a better algorithm (or built-in library).

do local bytebits = { {0x7F,{0,128}}, {0x7FF,{192,32},{128,64}}, {0xFFFF,{224,16},{128,64},{128,64}}, {0x1FFFFF,{240,8},{128,64},{128,64},{128,64}} } function utf8(decimal) local charbytes = {} for b,lim in ipairs(bytebits) do if decimal<=lim[1] then for i=b,1,-1 do local prefix,max = lim[i+1][1],lim[i+1][2] local mod = decimal % max charbytes[i] = string.char( prefix + mod ) decimal = ( decimal - mod ) / max end break end end return table.concat(charbytes) end end c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes. c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes. c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes. c=utf8(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes. c=utf8(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes. c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes. 

I feel that there must be a way to get rid of the entire predefined bytebits table and loop to find the corresponding entry. The loop from the back, I could constantly %64 and add 128 to form continuation bytes until the value is below 128, but I can’t figure out how to gracefully generate 11110 for adding.


Change Here, processing with speed optimization is slightly improved. However, this is not an acceptable answer, since the algorithm still basically represents the same idea and about the same amount of code.

 do local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} } function utf8(decimal) if decimal<128 then return string.char(decimal) end local charbytes = {} for bytes,vals in ipairs(bytemarkers) do if decimal<=vals[1] then for b=bytes+1,2,-1 do local mod = decimal%64 decimal = (decimal-mod)/64 charbytes[b] = string.char(128+mod) end charbytes[1] = string.char(vals[2]+decimal) break end end return table.concat(charbytes) end end 
+5
source share
2 answers

In terms of speed, it’s very important to use a real world usage pattern. But here we are in a vacuum, so let it all the same.

This algorithm is probably what you are looking for when you say that you need to get rid of bytes:

 do local string_char = string.char function utf8(cp) if cp < 128 then return string_char(cp) end local s = "" local prefix_max = 32 while true do local suffix = cp % 64 s = string_char(128 + suffix)..s cp = (cp - suffix) / 64 if cp < prefix_max then return string_char((256 - (2 * prefix_max)) + cp)..s end prefix_max = prefix_max / 2 end end end 

It also includes some other optimizations that are not particularly interesting, and for me it is about 2 times faster than your optimized code. (As a bonus, it should work up to U + 7FFFFFFF.)

If we want to further optimize micro-optimization, the cycle can be expanded to:

 do local string_char = string.char function utf8_unrolled(cp) if cp < 128 then return string_char(cp) end local suffix = cp % 64 local c4 = 128 + suffix cp = (cp - suffix) / 64 if cp < 32 then return string_char(192 + cp, c4) end suffix = cp % 64 local c3 = 128 + suffix cp = (cp - suffix) / 64 if cp < 16 then return string_char(224 + cp, c3, c4) end suffix = cp % 64 cp = (cp - suffix) / 64 return string_char(240 + cp, 128 + suffix, c3, c4) end end 

This is about 5 times faster than your optimized code, but completely inelegant. I think the main benefits are not to keep intermediate results on the heap and have fewer function calls.

However, the fastest (as far as I can find) approach is not to do the calculations at all:

 do local lookup = {} for i=0,0x1FFFFF do lookup[i]=calculate_utf8(i) end function utf8(cp) return lookup[cp] end end 

This is about 30 times faster than your optimized code, which may qualify as "significantly more efficient" (although memory usage is ridiculous). However, this is also not interesting. (A good trade-off in some cases would be to use memoization.)

Of course, any implementation with pure c will most likely be faster than any calculations performed in Lua.

+3
source

Lua 5.3 provides the UTF-8 core library , among which the utf8.char function is what you are looking for:

Gets zero or more integers, converts each of them into the corresponding sequence of UTF-8 bytes and returns a string with the concatenation of all these sequences.

 c = utf8.char(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes. c = utf8.char(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes. c = utf8.char(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes. c = utf8.char(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes. c = utf8.char(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes. c = utf8.char(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes. 
+3
source

Source: https://habr.com/ru/post/1203522/


All Articles