Use regular expressions to find items that do not contain a specific string

Time:2020-5-27

To do log analysis work, we often need to deal with thousands of log entries. In order to find the specific pattern data in a large amount of data, we often need to write many complex regular expressions. For example, enumerate the entries in the log file that do not contain a specific string, find the entries that do not start with a specific string, and so on.

Using a negative perspective

There are two concepts in regular expressions: look ahead and look behind. These two terms vividly describe the matching behavior of regular engine. It should be noted that the front and back in regular expressions are a little different from what we generally understand. For a paragraph of text, we usually call the direction at the beginning of the text “front” and the direction at the end of the text “back”. But for the regular expression engine, because it starts to parse from the head of the text to the tail (you can control the parsing direction through the regular option), it is called “front” for the tail direction of the text, because at this time, the regular engine has not yet reached that block, and for the head direction of the text, it is called “back”, because the regular engine has already passed that block. As shown in the figure below:

正向前瞻逆向前瞻

The so-called forward-looking is to look at the “unresolved text” in advance when the regular expression matches a certain character to see whether it conforms to / does not conform to the matching pattern, and then look at the text that has been matched by the regular engine to see whether it conforms to / does not conform to the matching pattern. Matching and non matching are also called positive matching and negative matching.

Modern high-level regular expression engines generally support forward-looking support, which is not very extensive for backward support, so we use negative forward-looking to achieve our needs.

realization

Test data:

Copy codeThe code is as follows:
2009-07-07 04:38:44 127.0.0.1 GET /robots.txt
2009-07-07 04:38:44 127.0.0.1 GET /posts/robotfile.txt
2009-07-08 04:38:44 127.0.0.1 GET /

For example, for the above simple log entries, we want to achieve two goals:

1. Filter out the data of No.8
2. Put those not included robots.txt Find the entry of string (as long as the URL contains robots.txt All of them are filtered out.

Forward looking grammar is:

(?! match pattern) let’s achieve the first goal first — matching entries that don’t start with a specific string.

Here we want to exclude a continuous string, so the matching pattern is very simple, that is, 2009-07-08. The implementation is as follows:

Copy codeThe code is as follows:
^(?!2009-07-08).*?$

With Expresso we can see that the result does filter out the data of No. 8.

Next, let’s achieve the second goal — to exclude entries that contain specific strings.

According to the above description, I took a picture of the gourd:

Copy codeThe code is as follows:
^.*?(?!robots\.txt).*?$

This rule is described in big white: start with any character, and then don’t follow it robots.txt A continuous string followed by any number of characters at the end of the string.
Running the test, we found that:

image

It didn’t achieve the effect we wanted. Why is that? Let’s add two capture groups to the above regular expression for debugging:

Copy codeThe code is as follows:
^(.*?)(?!robots\.txt)(.*?)$

Test results:

image

We see that the first group matches nothing, while the second group matches the entire string. Go back and analyze the regular expression. In fact, when the regular engine resolves to region a, it has already started to perform the forward-looking work of region B. At this time, it was found that when region a was null, the matching succeeded -. * originally allowed to match empty characters, and the forward-looking conditions were met. Region a was followed by the “2009” string, not robots. So the whole matching process successfully matches all entries.

image

After analyzing the causes, we modify the above regularities and move. *? Into the forward-looking expression as follows:

Copy codeThe code is as follows:
^(?!.*?robots).*$

Test results:

image

complete

The implementation method of excluding a string with regular implementation in PHP

preg_match(“/^((?!abc).)*$/is”, $str);

Full code example

Copy codeThe code is as follows:
$str = “dfadfadf765577abc55fd”;
$pattern_url = “/^((?!abc).)*$/is”;
if (preg_match($pattern_url, $str))
{
Echo “does not contain ABC! “;
}
else
{
Echo “contains ABC! “;
}

The result is: false, including ABC!

At the same time, a regular expression that contains the string “ABC” and does not contain the string “XYZ”:

preg_match(“/(abc)[^((?!xyz).)*$]/is”, $str);

This method is effective. I use it as follows:

(?: (?! < \ / div >). | \ n) *? / / matches a string that does not contain < / div >

However, in the final use, it is found that this method is extremely inefficient. It can be considered to be used in the processing of very short text (there are dozens of words or at most dozens of words to match the same part of the regular form). However, when it is used for large-scale article analysis or multiple parts that need to change the matching time, it should not be used, and other methods should be considered to replace it (for example, first parse out the text to match the regular form, Regular expressions are not very effective for matching text segments without specific strings