A regular expression caused a homicide case, making the online CPU 100% abnormal!

Time:2020-1-15

Author: Chen Shuyi this article is from Tencent cloud + community Chen Shuyi’s column

The monitoring information of a project on the first few antennas suddenly reported an exception. After getting on the machine, check the usage of related resources and find that the CPU utilization rate is nearly 100%. Through the thread dump tool of Java, we export the stack information of the problem.

A regular expression caused a homicide case, making the online CPU 100% abnormal!

We can see that all stacks point to a method named validateurl, and there are more than 100 error messages in the stack. By checking the code, we know that the main function of this method is to verify whether the URL is legal.

It’s strange how a regular expression can lead to high CPU utilization. In order to find out the recurrence problem, we extract the key code and do a simple unit test.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}

When we run the above example, we can see that the CPU utilization of a process named Java has soared to 91.4% through the resource monitor.

A regular expression caused a homicide case, making the online CPU 100% abnormal!

See here, we can basically infer that this regular expression is the killer of high CPU utilization!

So we focus on that regular expression:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

This regular expression seems to be OK. It can be divided into three parts:

The first part matches HTTP and HTTPS protocols, the second part matches www. characters, and the third part matches many characters. I looked at this expression for a long time, but I didn’t find that there was no big problem.

In fact, the key reasons for high CPU utilization are as follows:The engine implementation used by Java regular expression is NFA automaton, which backtracks when matching characters.Once backtracking occurs, it will take a long time, which may be minutes or hours, depending on the number and complexity of backtracking.

Seeing this, maybe you don’t know what backtracking is. It’s a bit muddled. It doesn’t matter. Let’s start with the principle of regular expressions a little bit.

Regular expression engine

Regular expression is a very convenient matching symbol, but in order to achieve such a complex and powerful matching syntax, we must have a set of algorithm to implement, and the thing to implement this algorithm is called regular expression engine. Simply put, there are two ways to implement the regular expression engine:DFA automaton(deterministic final automata) andNFA automaton(non deterministic finite automaton).

For these two kinds of automata, they have their own differences, and it is not intended to go deep into their principles here. In short, the time complexity of DFA automata is linear, more stable, but its function is limited. The time complexity of NFA is not stable, sometimes it’s very good, sometimes it’s not very good, whether it’s good or not depends on the regular expression you write. However, NFA is more powerful than NFA, so it is used to implement regular expressions in Java,. Net, Perl, python, ruby, PHP and other languages.

So how does NFA auto add match? Let’s illustrate with the following characters and expressions.

text="Today is a nice day."
regex="day"

It’s important to remember that NFA is based on regular expressions. That is to say, the NFA will automatically read one character of the regular expression and match it with the target string. If the match is successful, the next character of the regular expression will be replaced. Otherwise, it will continue to compare with the next character of the target string. Maybe you can’t understand it. It’s OK. Let’s take the above example step by step.

  • First, get the first match of the regular expression: D. Then compare it with the character of the string. The first character of the string is t. if it doesn’t match, replace it with the next one. The second one is o. it doesn’t match either. Change to the next one. The third is d. if it matches, then read the second character of the regular expression: a.
  • The second match read to the regular expression: a. Then continue to compare with the fourth character a of the string, and match again. Then read the third character of the regular expression: y.
  • The third match read to the regular expression: y. Then continue to compare with the fifth character y of the string, and match again. Try to read the next character of the regular expression. If it is not found, the match ends.

The above matching process is the matching process of NFA automata, but in fact, the matching process is much more complex than this, but its principle is unchanged.

Retrospection of NFA automata

Now that we know how NFA does string matching, let’s talk about the main point of this article: backtracking. In order to better explain backtracking, we also use the following example to explain it.

text="abbc"
regex="ab{1,3}c"

The purpose of the above example is relatively simple. The matching starts with a, ends with C, and there are 1-3 B character strings in the middle. The process of NFA parsing is as follows:

  • First, read the first matching character a of regular expression and compare it with the first character a of string. The second character of the regular expression is read.
  • Read the second matching character B {1,3} of the regular expression and compare it with the second character B of the string. But because B {1,3} represents 1-3 B strings, and the greedy nature of NFA automata (that is, to match as many as possible), we will not read the match of the next regular expression at this time, but still use B {1,3} to compare with the third character B of the string, and find out whether to match. So we continue to use B {1,3} to compare with the fourth character c of the string and find that there is no match. A retrospective occurs.
  • What is the operation of backtracking? After backtracking, the fourth character c of the string we have read will be spit out, and the pointer will return to the position of the third string. After that, the program reads the next operator c of the regular expression, reads the next character c of the current pointer for comparison, and finds a match. So read the next operator, but that’s it.

Let’s look back at the regular expression of the validation URL:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$

The problem URL is:

http://www.fapiao.com/dzfp-web/pdf/download?request=6e7JGm38jfjghVrv4ILd-kEn64HcUX4qL4a4qJ4-CHLmqVnenXC692m74H5oxkjgdsYazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf

We divide this regular expression into three parts:

  • The first part: verification protocol.^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)
  • The second part: verify the domain name.(([A-Za-z0-9-~]+).)+
  • The third part: verification parameters.([A-Za-z0-9-~\\/])+$

We can find regular expression checking protocolhttp://There is no problem with this part, but when checking www.fapiao.com, it usesxxxx.This way to check. In fact, the matching process is as follows:

  • Match to www
  • Match to fapiao
  • Match tocom/dzfp-web/pdf/download?request=6e7JGm38jf....., you will find that because of greedy matching, the program will always read the following string for matching, and finally find that there is no dot, so it goes back one by one.

This is the first problem of this regular expression.

Another problem is in the third part of the regular expression. We find that the URL in question is underlined (_) and percentage (%), but not in the regular expression corresponding to the third part. In this way, after a long string of characters are matched in front of you, you will find the mismatch and finally go back.

This is the second problem of this regular expression.

Solution

After you understand that backtracking is the cause of the problem, it is actually to reduce this backtracking. You will find that if I add the underscore and percentage sign in the third part, the program will be normal.

public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~_%\\\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}

Run the above program and it will print out immediatelymatch!!

But this is not enough. If there are other URLs that contain a mess of characters in the future, we can’t help but modify them again. It must be unrealistic!

In fact, there are three patterns in regular expressions:Greedy mode, lazy mode, exclusive mode.

In the match about quantity, there are+ ? * {min,max}Four twice, if used alone, are greedy patterns.

If they are followed by a? Symbol, the original greedy pattern will become lazy, that is, matching as little as possible.But the lazy model will still have backtracking.TODOFor example:

text="abbc"
regex="ab{1,3}?c"

The first operator a of the regular expression matches the first character a of the string. Then the second operator B {1,3}? Of the regular expression and the second character B of the string match successfully. Because of the minimum matching principle, the third regular expression operator c is used to match the third character B of the string, and a mismatch is found. So we go back and match the second regular expression operator B {1,3}? With the third character B of the string. The match is successful. Then the third regular expression operator c is matched with the fourth character c of the string, and the matching is successful. So it’s over.

If a + sign is added after them, the original greedy mode will become exclusive mode, that is, match as many as possible without backtracking.

Therefore, if we want to solve the problem thoroughly, we need to ensure the function without backtracking. I added a + sign after the second part of the regular expression above to verify the URL, which is as follows:

^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)
(([a-za-z0-9 - ~] +) + + + - > > > (a + sign is added here)
([A-Za-z0-9-~\\/])+$

After that, there is no problem running the original program.

Finally, we recommend a website that can check whether there is any problem when the regular expression you write matches the corresponding string.

Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript

For example, the URL in question in my article will be prompted after checking the website: catastrophic backing.

A regular expression caused a homicide case, making the online CPU 100% abnormal!

When you click “regex debugger” in the lower left corner, it will tell you how many steps have been checked, list all the steps, and indicate the location of backtracking.

A regular expression caused a homicide case, making the online CPU 100% abnormal!

The regular expression in this article stopped automatically after 110000 attempts. This shows that this regular expression does have problems and needs to be improved.

But when I test it with the regular expression we modified, that is, the regular expression below.

^([hH][tT]{2}[pP]:\/\/|[hH][tT]{2}[pP][sS]:\/\/)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~\\\/])+$

The tooltip takes only 58 steps to complete the check.

A regular expression caused a homicide case, making the online CPU 100% abnormal!

A character difference, the performance gap is tens of thousands of times.

Tree righteousness has a saying

It’s amazing that a small regular expression can drag down the CPU. This also gives us a wake-up call when we write programs. When we encounter regular expressions, we should pay attention to greedy patterns and backtracking problems. Otherwise, every expression we write is a thunder.

By looking up the online information, I found that lazada students in Shenzhen Ali center also encountered this problem in 17 years. They also did not find problems in the testing environment, but the problem of CPU 100% occurred when they arrived online, and the problems they encountered were almost the same as ours. Interested friends can click to read the original text to see their post summary: a murder case triggered by regular expression – Mingzhi jianzhiyuan – blog Garden

Although I have finished this article, there is not enough explanation on the principle of NFA automata, especially on the explanation of lazy mode and exclusive mode. Because NFA automata is not so easy to understand, we need to learn more about it. We welcome friends who are knowledgeable to learn and exchange, and promote each other.


Questions and answers

How to understand regular expressions?

Related reading

regular expression

Regular extension exercise

C ා regular expression

This article has been released by Tencent cloud + community authorized by the author. The original link: https://cloud.tencent.com/dev

Welcome to Tencent cloud + community or pay attention to WeChat community (QcloudCommunity).

Recommended Today

SQL exercise 20 – Modeling & Reporting

This blog is used to review and sort out the common topic modeling architecture, analysis oriented architecture and integration topic reports in data warehouse. I have uploaded these reports to GitHub. If you are interested, you can have a lookAddress:https://github.com/nino-laiqiu/TiTanI recorded a relatively complete development process in my hexo blog deployed on GitHub. You can […]