Simple introduction to multi byte string operation in PHP

Time:2021-9-18

What is a multi byte string operation? In fact, many students may have used it, but we still have to start with the most basic problem.

A character takes up a few bytes, which is not what we see on the surface. Normally, a number or English and English symbols occupy one byte. However, there are so many languages and characters in this world, especially Chinese and Japanese, which can’t be loaded in one byte. At this time, multi bytes are needed to solve it (generally, the first byte is the leading byte, which indicates what language and text is currently, and the following byte is being encoded). For example, a Chinese character occupies two bytes in GBK environment and three bytes in UTF-8 environment. In recent years, due to the emergence of Emoji expressions, utf-8mb4 has become the mainstream. When expressing these Emoji expression characters, utf-8mb4, which occupies four bytes, is often used to represent them.

Although different byte settings can help us show rich content, some operations on it also bring trouble.

String operation

$STR = "ABC test";
echo strlen($str), PHP_EOL; // 15

The strlen () function is familiar to everyone, but for Chinese, the number returned by it is obviously wrong. Our current default encoding format is UTF-8, so treating a Chinese as three English characters is exactly 15 characters long. Obviously, this is not the result we want. If we want to intercept the string, the calculation of this length is very laborious, and it may be prone to garbled code.

Fortunately, we have a set of MB in the default extension of PHP_ Function library, which is specially used to deal with this kind of multi byte string.

echo mb_strlen($str), PHP_EOL; // 7
echo mb_strlen($str, 'GB2312'), PHP_EOL; // 11

MB is not specified_ In the case of the second parameter of the strlen() function, it will be converted according to the default encoding format of the current document, so our string length will be displayed normally in the UTF-8 environment. Of course, we can also specify the second parameter as other encoding formats, such as GB2312 or GBK, which is commonly used in the past. In this way, the returned character length is the length returned in the form of one Chinese occupying two bytes.

var_ Dump (mb_strpos ($STR, "test")// int(3)

var_ dump(mb_convert_case($str, MB_CASE_UPPER)); //  String (15) "ABC test"
var_ dump(mb_convert_case($str, MB_CASE_LOWER)); //  String (15) "ABC test"

var_ dump(mb_substr($str, 5)); //  String (6) "click"

Of course, MB_ The relevant string operation functions are relatively comprehensive. Functions such as character occurrence position, case conversion and string interception are provided. The parameters called are no different from ordinary string operation functions, but they have an optional parameter with specified coding. Under normal circumstances, as long as our file is in the corresponding encoding format, this parameter does not need to be written.

Of course, there are many string operation functions, which are not listed here one by one. You can consult the relevant documents by yourself.

String regular operation

Now that we’ve talked about string operations, regular functions are also essential. Let’s take a look at using the default preg_ Related function operation Chinese problems.

$str = iconv('UTF-8', 'GB2312', $str);

var_ Dump (preg_match ("/ [A-Z] * test / I", $STR))// int(0)
var_ Dump (preg_replace ("/ [A-Z] * test / I", "try", $STR))// string(11) "abc���� һ ��"

First, we convert the test string to the form of GB2312. For example, the external interface we obtained may return the code of GB2312. In this case, preg is used directly_ The related functions can not get the results we want correctly.

mb_regex_encoding('GB2312');
$pattern = iconv ('utf-8 ',' GB2312 ', "[A-Z] * test");
var_dump(mb_ereg($pattern, $str)); // int(1)
var_dump(mb_eregi($pattern, $str)); // int(1)

var_ Dump (mb_ereg_replace ($pattern, "try", $STR))// String (10) "try it һ ��"
var_ Dump (mb_eregi_replace ($pattern, "try", $STR))// String (10) "try it һ ��"

Next, we pass MB_ Ereg related functions for regular matching and replacement, you can normally operate on different encoded strings. Note that we need to specify MB_ regex_ The encoding () function tells that the current default planning replacement code is GB2312, and the regular rules should also be converted to the corresponding coding format.

mb_ Eregi related functions and MB_ Ereg has no essential difference, but it is not case sensitive, just like the suffix symbol I when we write regular in preg correlation functions. Ereg related functions do not need to write backslashes. In ordinary functions, they are actually eliminated functions (performance is not as good as preg, and syntax is different). In most cases, preg related functions will be directly used for operation. However, if multi byte related problems are involved, in MB_ In the function library, only functions such as ereg can be used.

String encoding conversion

Like the iconv () function we learned before, MB_ The library also provides functions for character encoding conversion.

$phone = file_get_contents('https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=13888888888');

print_r($phone);
// __GetZoneResult_ = {
//     mts:'1388888',
//     province:'����',
//     catName:'�й��ƶ�',
//     telString:'13888888888',
// 	areaVid:'30515',
// 	ispVid:'3236139',
// 	carrier:'�����ƶ�'
// }

var_dump(mb_convert_encoding($phone, 'UTF-8', "GBK"));
// string(183) "__GetZoneResult_ = {
//     mts:'1388888',
//Province: 'Yunnan',
//Catname: 'China Mobile',
//     telString:'13888888888',
// 	areaVid:'30515',
// 	ispVid:'3236139',
//  	 Carrier: 'Yunnan Mobile'
// }
// "

echo mb_detect_encoding($phone, 'UTF-8,GBK'), PHP_EOL; // CP936

Similarly, we still test the public interface for obtaining mobile phone number information. The content returned is the encoded content of GBK. We can use MB_ convert_ Encoding () to convert its encoded content. mb_ detect_ Encoding () is to detect the encoding format. Here we give two parameters, which will return the qualified encoding content. Cp936 is another representation of GBK (IBM placed the GBK code on page 936 when making the code page).

HTTP parameter operation

mb_internal_encoding("UTF-8");

First, we introduce a MB_ internal_ The encoding () function actually sets the default encoding rules in the current running environment. If it is not set, it takes the encoding rules of the current PHP file as the default. Let’s know, because it will affect the content we will introduce later.

//// localhost: 9991 /? A = I'm on
var_dump(mb_http_input('GPC')); // bool(false)
var_dump(mb_http_output()); // string(5) "UTF-8"

mb_internal_encoding("CP936");
mb_parse_str($_SERVER['QUERY_STRING'], $result);
print_r($result);
// Array
// (
//[a] = > I'm on
// )

First, we run the test file, and then use the browser to request the link address. mb_ http_ Input () is used to detect the character encoding of HTTP input, but my test results all return false. If you know something, you can leave a message to explain what the situation is. And MB_ http_ Output is to set the code of the detection output, which will be affected by MB_ internal_ The influence of the content defined by encoding().

In addition, MB_ parse_ Str() is parse_ For the multi byte version of the str () function, we can convert the browser’s default encoding to GBK or request it later because we set the current MB_ internal_ Encoding() is cp936. By default, if the browser requests using UTF-8, an error will be reported here, which is MB_ internal_ The impact of encoding () on these functions.

View other properties

Finally, let’s look at some MB_ Contents of related information attributes.

var_dump(mb_language());
// string(7) "neutral"

mb_ The language() function is used to get / set the current language. It can receive a parameter to set the current language information. It is mainly used to encode mail information MB_ send_ The mail () function uses it to encode mail. About MB_ send_ You can try the use of mail (). In fact, it is also send_ A multibyte version of the mail() function. Neutral means neutral. In fact, it is also with our MB_ internal_ Encoding ().

var_dump(mb_list_encodings());
// array(86) {
//     [0]=>
//     string(4) "pass"
//     [1]=>
//     string(5) "wchar"
//     [2]=>
//     string(7) "byte2be"
//     [3]=>
//     ……
//     [65]=>
//     string(5) "CP936"
//     ……

mb_ list_ Encodings () is used to display the list of all language codes supported in the current system. In this list, we can see cp936, but there is no GBK. Just remember that they are the same thing.

var_dump(mb_get_info());
// array(14) {
//     ["internal_encoding"]=>
//     string(5) "UTF-8"
//     ["http_output"]=>
//     string(5) "UTF-8"
//     ["http_output_conv_mimetypes"]=>
//     string(31) "^(text/|application/xhtml\+xml)"
//     ["func_overload"]=>
//     int(0)
//     ["func_overload_list"]=>
//     string(11) "no overload"
//     ["mail_charset"]=>
//     string(5) "UTF-8"
//     ["mail_header_encoding"]=>
//     string(6) "BASE64"
//     ["mail_body_encoding"]=>
//     string(6) "BASE64"
//     ["illegal_chars"]=>
//     int(0)
//     ["encoding_translation"]=>
//     string(3) "Off"
//     ["language"]=>
//     string(7) "neutral"
//     ["detect_order"]=>
//     array(2) {
//       [0]=>
//       string(5) "ASCII"
//       [1]=>
//       string(5) "UTF-8"
//     }
//     ["substitute_character"]=>
//     int(63)
//     ["strict_detection"]=>
//     string(3) "Off"
//   }

mb_ get_ Info () is used to view the default configuration of these language codes in the current environment, such as internal_ encoding 、 http_ The output attribute can be seen here.

summary

Have the used students found the new posture of today’s article? Yes, GBK and cp936 have become the surprises of today’s article. I really didn’t notice this before. Actually MB_ The use of related functions has been very common. It is basically a necessary knowledge for learning PHP. There are many functions that have not been listed one by one. Interested students can consult the official manual for more in-depth study.

Test code:

[https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/10. Simple introduction to multi byte string operation in PHP. PHP][https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/10. Simple introduction to multi byte string operation in PHP. PHP]

Reference documents:

https://www.php.net/manual/zh/book.mbstring.php

Official account: hard core project manager

Add wechat / QQ friends: [xiaoyuezigonggong / 149844827] get free PHP and project management learning materials

Tiktok, official account, voice, headline search, hard core project manager.

Station B ID: 482780532