Methods and user-defined functions for obtaining UTF8 string length in Lua

Time:2022-4-27
Copy codeThe code is as follows:

—Method for obtaining the correct length of utf8 encoded string
— @param str
— @return number
function utfstrlen(str)
local len = #str;
local left = len;
local cnt = 0;
local arr={0,0xc0,0xe0,0xf0,0xf8,0xfc};
while left ~= 0 do
local tmp=string.byte(str,-left);
local i=#arr;
while arr[i] do
if tmp>=arr[i] then left=left-i;break;end
i=i-1;
end
cnt=cnt+1;
end
return cnt;
end


Lua’s string library does not support processing UTF-8 encoded Chinese characters. It’s hard to deal with Chinese characters with Lua.

 

Coding rules of utf8:

1. The first byte range of characters: 0x00-0x7f (0-127), or 0xc2-0xf4 (194-244); Utf8 is compatible with ASCII, so 0 ~ 127 is completely consistent with ASCII
2. 0xc0, 0xc1, 0xf5 – 0xff (192, 193 and 245-255) will not appear in utf8 coding
3. 0x80-0xbf (128-191) will only appear in the second and subsequent codes (for multi byte codes, such as Chinese characters)
 
In this way, we can use Lua’s powerful pattern matching to achieve the desired effect. There are two key processes:
1. local _, count = string. Gsub (STR, “[^ \ 128 – \ 193]”, “”), used to get the number of characters in str
2. for uchar in string. Gfind (STR, “[% Z \ 1 – \ 127 \ 194 – \ 244] [\ 128 – \ 191] *”) do tab [#tab + 1] = uchar end, used to map each character in STR to tab