Learn the interesting character set internationalization verification function in PHP

Time:2021-9-19

Today’s content is very simple, but also very interesting. I don’t know if you have experienced such a thing, that is, in some fonts, 0 and o are difficult to distinguish, and 1 and L are also difficult to see clearly. Of course, the default fonts of most editors and ides now choose those that are better to distinguish these similar characters, such as adding a slash or a dot in the middle of 0. There are also such functions in PHP to help us check whether there are such indistinguishable characters.

Similar character detection

$checker = new Spoofchecker();

var_dump($checker->areConfusable('google.com', 'goog1e.com')); // true

var_dump($checker->areConfusable('google.com', 'g00g1e.com')); // false

The spoofchecker class is used for such detection. Its areconfifusable () method can help us detect whether there are similar characters in two strings. For example, in our first test code, l and 1 may admit mistakes if we don’t look carefully. The second segment of the detection code returns false, indicating that there are no very similar characters, but if we replace the lowercase o in the first string with the uppercase o, this segment will also return true. You can test it yourself.

Suspicious character detection

In addition, we can use another method of the spoofchecker class to detect suspicious characters in the string.

var_dump($checker->isSuspicious('google.com')); // FALSE

var_dump($checker->isSuspicious('Рaypal.com')); // TRUE

Why does paypal.com return true? What’s suspicious about it?

In fact, issuspicious () detects whether each character in the string comes from different Unicode characters. The uppercase P may come from the Unicode character set cyrylic, not necessarily the Latin character P. Of course, as Chinese, we don’t know much about this knowledge. Except for friends who specialize in foreign languages or have studied the relevant knowledge of the source of letters, they may know better.

Effects in different regional languages

Since it is an international class and method, will the detection results be different if we modify the regional language?

$checker->setAllowedLocales('zh_CN');

var_dump($checker->areConfusable('google.com', 'goog1e.com')); // true

var_dump($checker->areConfusable('google.com', 'g00g1e.com')); // false

var_dump($checker->isSuspicious('google.com')); // TRUE

var_dump($checker->isSuspicious('Рaypal.com')); // TRUE

Use the setallowedlocales () method of spoofchecker to set the current regional language information for the operation of spoofchecker. After setting it to Chinese, the content returned by issuspicious() is true. After all, the character set used is different, and the default Latin character set will not be used.

summary

Well, this article is really just for fun. In the actual business, if we want to do some article and code verification functions, perhaps arecompatible () can provide us with some convenience. Let’s try to play with an understanding attitude!

Test code:

https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/9. Learn the interesting character set internationalization verification function in PHP. PHP

Reference documents:

https://www.php.net/manual/zh/class.spoofchecker.php

Official account: hard core project manager

Add wechat / QQ friends: [xiaoyuezigonggong / 149844827] get free PHP and project management learning materials

Tiktok, official account, voice, headline search, hard core project manager.

Station B ID: 482780532