Learn to use six rules of regular expressions

Time:2022-11-24
Table of contents:
1. What is regularization
2. Character understanding
3. Cycle and repeat
4. Position boundary
5. Subexpression
6. Logic processing

1. What is regularization

A regular expression is essentially a tool for string pattern matching, which implements the search and replace function of strings.

We can see from its name that it is an expression used to describe a certain rule.

What we want to learn is its internal search and replace functions. As for its underlying implementation mechanism, we don’t go into details.


2. Character understanding

Regular expressions are basically composed of various characters, which are divided into characters and metacharacters:

Character: It is the basic computer character encoding. Usually, numbers and English letters are used in regular expressions.

Metacharacters: Characters used to represent special semantics. Such as ^ means not, | means or and so on.

Regular expressions are composed of these two characters to form various actual rules.

2.1 Single character matching

The simplest regular expression consists of numbers and letters. If I want to match the character ‘b’ in ‘banana’, then I can directly use the regular expression /b/, but if we use /a/, all ‘a’ characters will be matched.

But if we want to match special characters, such as ‘*’, which itself is a special character, we need to escape the character \ to make the special character lose its original meaning.

/\*/ is to match the character '*'

We can also use \ to make characters that are not originally special characters have special meanings. For example, when we want to match symbols such as spaces, newlines, and fallbacks, we can use the following table to remember:

Learn to use six rules of regular expressions

2.2 Multiple character matching

The matching of a single character is one-to-one, that is to say, only one result can be matched, which is obviously not enough. Regular expressions also introduce a set interval and wildcards to achieve one-to-many matching, that is, one regular expression matches multiple different characters.

The set interval is represented by [], such as: /[123]/ can match all 1 2 3 characters in the string. Meta interval – can be used to indicate the range of the collection interval, for example: /[0-9]/ can match all numbers in the string, and /[az]/ can match all English lowercase letters.

In addition, regular expressions also derive a batch of convenient and special regular expressions for matching intervals, as follows:

Learn to use six rules of regular expressions


3. Cycle and repeat

The above is about one-to-one and one-to-many character matching, and what I want to talk about here is the rules of matching characters.

The so-called simultaneous matching of multiple characters is to control the number of characters in a regular expression.The number of occurrences
The number of occurrences is divided into: 0 times, 1 time, countless times, and fixed times.

3.1 Match 0 or 1

The metacharacter ‘?’ means to match the preceding character 0 or 1 time, for example: the regular expression /colou?r/ can match the strings ‘color’ and ‘colour’ at the same time, which means that ‘u’ can appear 0 times or 1 time.

3.2 Match 0 or more times

The metacharacter ‘*’ is used to match 0 characters or an infinite number of characters.

3.3 Match 1 or more times

The metacharacter ‘+’ is suitable for matching one or more occurrences of the same character.

3.4 Match a specific number of times

The metacharacter ‘{}’ can match a specific number of characters, such as: /b{3}/ means that I want to match 3 consecutive b, and it has other derived rules:

- {x}: x times

- {min, max}: between min times and max times

- {min, }: at least min times

- {0, max}: at most max times

Finally, a summary of the matching of loops and repetitions:

Learn to use six rules of regular expressions


4. Position boundary

The so-called position boundary matching refers to the condition of restricting the position of characters to be searched in a long text string. Such as we only want to find characters at the beginning or end of a word, etc.

4.1 Word boundaries

Word boundaries match only individual words. A more common scenario is to match specific words in specific articles or sentences. like:

The cat scattered his food all over the room.

If I use /cat/ to match words in the article, the final match will also match the word ‘scattered’ redundantly. At this time, we use the special character \b to wrap the word we want to set the boundary, such as: /\bcat\b/, so that only the word ‘cat’ can be matched.

4.2 String boundaries

The above is to match words, but sometimes we need to match a whole string, how to do it?
The metacharacter ‘^’ can be used to match the beginning of a string. The metacharacter ‘$’ can be used to match the end of a string.

In addition, to match the entire string, it is necessary to avoid the interference of newline characters, which requires adding the letter ‘m’ at the end of the regular expression, which means multi-line mode.

For example, I want to match the text ‘I am the rain man’ in the following text:

until  that day,
he finally told me that,
' I am the rain man'

Use the regular expression /^I am the rain man$/m.
In addition to multi-line mode, there are other matching modes for regular matching. The final content of this part is as follows:
Learn to use six rules of regular expressions


5. Subexpression

The above is the most basic character matching content, and the next thing to talk about is regular subexpressions.

Its core idea is to complicate regular expressions. The evolution of regular expressions from simple to complex usually takesGrouping, backreferencing, and logical processingthought of. Using these three rules, infinitely complex regular expressions can be deduced.

5.1. Grouping

Grouping is the core idea of ​​subexpression, and its principle is: regular expressions contained in ‘()’ are used as a group of subexpressions, and multiple subexpressions contained in ‘()’ can be combined into complex regular expressions.

5.2, back reference

The so-called backreference means that the subsequent subexpression reuses the previously matched substring. It can be understood as variable use. Its use method is as follows: \1, \2 represent the first and second sub-expression of the reference respectively. And \0 means the whole expression.

Back references are often used in replacement strings, use ‘$1’ ‘$2’ to represent the string to be replaced, as follows:

let str = 'cad cae gg'

let str1 = str.replace(/(ca)d/g, '$1e')

console.log(str1);   //cae cae gg

Its operation is equivalent to:

let str1 = str.replace(/(ca)d/g, 'cae')

Sometimes we want to limit the scope of application of back references, we can useforward/backward lookupaccomplish.

forward lookup

It is used to limit the suffix, and the content of the suffix is ​​limited by the subexpression form of (?=regex), so as to realize the forward search.

For example: happy happily these two words, if I want to get the adverb starting with happy, I can use happ(?=ily) to match.

The popular understanding is to match the ‘happ’ part with the suffix ‘ily’, see the following code for details:

let str = 'happy happily'

let str1 = str.replace(/happ(?=ily)/, 'haha')

console.log(str1);   //happy hahaily

backward lookup

Backward lookup (lookbehind) is to specify a subexpression, and then start from the position that matches the subexpression to find the string that meets the rules.

Such as: apple people these two words, I just want to match apple’s ple, how to achieve it?

/(?<=app)ple/

In fact, the principle is the opposite of forward, or look at the example:

let str = 'happy syppy'

let str1 = str.replace(/(?<=sy)ppy/, 'haha')

console.log(str1);   //happy syhaha

Remember the most critical points here:The conditions in the brackets are the conditions, and the content to be matched and replaced is outside the brackets!

Finally review this part:
Learn to use six rules of regular expressions


6. Logic processing

The so-called logical processing refers to three logical relationships, and or not.

Among them, only the relationship between or and non is discussed.

Or relations, usually used for sub-expression classification. For example, if I match a and b at the same time, I can use a subexpression like (a|b).

Instead of relations, there are two cases: one is character matching, and the other is subexpression matching.
1) Character matching: Indicates that the metacharacter ^ is not required.Only the ^ used inside [ and ] indicates the relationship of not
2) Subexpression matching: For non-relationships, the forward negative lookup subexpression (?!regex) or backward negative lookup subexpression (?!

This section is summarized as follows:
Learn to use six rules of regular expressions


In the end, as long as you master the above six main rules, you can deal with most matching and replacement problems implemented with regular expressions.