Looking back on the first time you saw a regular expression, it was probably all in your eyes.`$7^(0^=]W-\^*d+`

In my heart, I refused. However, in the following daily work, more and more regular expressions have been used, and regular expressions have gradually become a very common tool.

To master a tool, besides its usage, it is equally important to understand its principle. Generally speaking, regular engines can be roughly divided into two categories: DFA (Deterministic Finite Automata) deterministic finite automata and NFA (Nondeterministic Finite Automata) uncertain finite automata.

Tools for using NFA include

`.NET`

、`PHP`

、`Ruby`

、`Perl`

、`Python`

、`GNU Emacs`

、`ed`

、`sec`

、`vi`

、`grep`

Most versions, and even some versions`egrep`

and`awk`

。 The main tools used in DFA are`egrep`

、`awk`

、`lex`

and`flex`

。 Some systems use hybrid engines, which choose the right engine according to the different tasks (or even different engines for different parts of the same expression in order to achieve the best balance between function and speed).—— Jeffrey E.F. Friedl, Proficiency in Regular Expressions

Both DFA and NFA are called finite automata. They have many similarities. Automata is essentially a graph similar to state transition graph.*(Note: Automata will not be strictly defined in this paper. An in-depth understanding of automata can be read in Introduction to Automata Theory, Language and Computing.)*

## NFA

An NFA is divided into the following parts:

- An initial state
- One or more termination States
- State transition function

The figure above is one with two states`q0`

and`q1`

NFA, initial state is`q0`

(No preface state), termination state is`q1`

(Two circle markings). stay`q0`

There’s an arrow pointing at it.`q1`

That means when the NFA is in`q0`

When in state, accept input`a`

It will be transferred to state.`q1`

。

When accepting a string, we initialize the NFA to the initial state, and then transfer the state according to the input. If the NFA is in the end state after the input, it means that the acceptance is successful. If the input symbol has no corresponding state transition, or the NFA is not in the end state after the input, then it means that the acceptance is successful. It means accepting failure.

As you can see from the above, this NFA can accept and only accept strings.`a`

。

So why is it called NFA, because**For the same state and the same input symbol, NFA can reach different states.**As follows:

stay`q0`

When the input is`a`

The NFA can go back`q0`

Or arrive`q1`

So the NFA is acceptable`abb`

（`q0 -> q1 -> q2 -> q3`

It is also acceptable.`aabb`

（`q0 -> q0 -> q1 -> q2 -> q3`

The same acceptance`ababb`

、`aaabbbabababb`

Wait a minute, you may have found that the regular expression represented by this NFA is exactly the same.`(a|b)*abb`

## ε-NFA

In addition to reaching multiple states, NFA can also accept empty symbols`ε`

As follows:

This is an acceptance.`(a+|b+)`

NAFA, because there are paths`q0 -ε-> q1 -a-> q2 -a-> q2`

，`ε`

Represents an empty string, which is removed at connection time, so this path represents acceptance.`aa`

。

You may wonder why not use it directly.`q0`

adopt`a`

Connect`q2`

Through`b`

connection to`q4`

That’s because`ε`

Mainly plays the role of connection, which will be felt later.

## DFA

After introducing the uncertain finite automata, the deterministic finite automata can be easily understood. The difference between DFA and NFA lies in:

- No
`ε`

transfer - For the same state and input, there is only one transition

So DFA is much simpler than NFA. Why not use DFA directly? This is because for regular language descriptions, it is often much easier to construct NFA than DFA, as mentioned above.`(a|b)*abb`

NFA is easy to construct and understand:

But it’s not so easy to construct the corresponding DFA directly. You can try to construct it first, and the result is probably like this:

So NFA is easy to construct, but because of its uncertainty, it is difficult to implement state transition logic by program; NFA is not easy to construct, but because its certainty is easy to implement state transition logic by program, what should we do?

The magic is that every NFA has its corresponding DFA, so we usually construct NFA according to regular expression, then convert it into corresponding DFA, and finally recognize it.

## McMaughton-Yamada-Thompson algorithm

McMaughton-Yamada-Thompson algorithm can convert any regular expression into NFA that accepts the same language. It is divided into two rules:

### Basic rules

- For expressions
`ε`

Construct the following NFA:

- For non
`ε`

Construct the following NFA:

### Induction rule

Suppose that the NFAs of regular expressions s and T are respectively`N(s)`

and`N(t)`

For a new regular expression r, the following structure is constructed`N(r)`

：

#### and

When`r = s|t`

，`N(r)`

by

#### Connect

When`r = st`

，`N(r)`

by

#### closure

When`r = s*`

，`N(r)`

by

Other`+`

，`?`

Equal qualifiers can be implemented similarly. That’s the end of this article’s knowledge of automata, and then we can start building NFA.

## Implementation Based on NFA

Ken Thompson published a paper in 1968*Regular Expression Search Algorithm*In this article, he describes a regular expression compiler, which gave birth to later ones.`qed`

、`ed`

、`grep`

and`egrep`

。 This paper is relatively difficult to understand. The implementation-a-regular-expression-engine article is also based on Thompson’s paper. To some extent, this paper also refers to the realization ideas of this article.

### Add connectors

Before building NFA, we need to process regular expressions to`(a|b)*abb`

For example, there are no join symbols in regular expressions, so we can’t know which two NFAs to connect.

So first we need to explicitly add connectors to expressions, such as`·`

Add rules can be listed:

Left symbols/right symbols | * | ( | ) | and | Letter |
---|---|---|---|---|---|

* | ❌ | ✅ | ❌ | ❌ | ✅ |

( | ❌ | ❌ | ❌ | ❌ | ❌ |

) | ❌ | ✅ | ❌ | ❌ | ✅ |

and | ❌ | ❌ | ❌ | ❌ | ❌ |

Letter | ❌ | ✅ | ❌ | ❌ | ✅ |

`(a|b)*abb`

When added, the`(a|b)*·a·b·b`

To achieve the following:

### Infix expression to suffix expression

If you’ve written about calculators, you should know that infix expressions are not conducive to analysing the priority of operators, and so is here. We need to convert expressions from infix expressions to suffix expressions.

In this paper, the specific process is as follows:

- If you encounter letters, output them.
- If you encounter left parentheses, put them on the stack.
- If you encounter right parentheses, pop up stack elements and output them until you encounter left parentheses. The left parentheses only pop up and do not output.
- If a qualifier is encountered, the qualifier whose top priority is greater than or equal to the qualifier is popped up in turn, and then it is put on the stack.
- If you read the end of the input, all the elements in the stack pop up in turn.

In the implementation scope of this article, the priority from small to large is respectively

- Connector
`·`

- closure
`*`

- and
`|`

The realization is as follows:

as`(a|b)*·c`

Convert to a suffix expression`ab|*c·`

### Building NFA

It’s much easier to construct NFA from suffix expressions. Read in the expression content from left to right:

- If for the letter s, build the basic NFA
`N(s)`

And put it on the stack - If so
`|`

Pop up two elements in the stack`N(s)`

、`N(t)`

Construction`N(r)`

Put it on the stack（`r = s|t`

） - If so
`·`

Pop up two elements in the stack`N(s)`

、`N(t)`

Construction`N(r)`

Put it on the stack（`r = st`

） - If so
`*`

Pop up an element in the stack`N(s)`

Construction`N(r)`

Put it on the stack（`r = s*`

）

See automata.ts for the code

### Building DFA

With NFA, you can convert it to DFA. The method of transferring NFA to DFA can be used.**Subset construction method**Each state of the DFA constructed by NFA is a set of multiple states of the original NFA, such as the original NFA is

Here we need to use an operation`ε-closure(s)`

This operation represents the ability to start with the state s of the NFA and only pass through`ε`

The set of states that the transition arrives at, such as`ε-closure(q0) = {q0, q1, q3}`

We use this set as the starting state of DFA.`A`

。

So what are the transitions of state A? A set contains`q1`

Acceptable`a`

Yes, there is.`q3`

Acceptable`b`

So A can accept it.`a`

and`b`

。 When A accepts`a`

When we get`q2`

Then`ε-closure(q2)`

Then act as**A-state acceptance aLater arrival state B.**Similarly, state A acceptance

`b`

After arrival`ε-closure(q4)`

State C.State B is acceptable.`a`

The same is true of arrival.`ε-closure(q2)`

Well, let’s say state B accepts`a`

It’s still in state B. Similarly, state C accepts`b`

It will also return to state C. Thus, the constructed DFA is

The start state of DFA includes the start state of NFA, and the end state is the same.

### search

In fact, we do not need to explicitly construct DFA, but use this idea to traverse NFA, which is essentially a graph search, the implementation code is as follows:

`getClosure`

The code is as follows:

## summary

Overall, we have gone through a number of steps to implement a simple regular expression engine based on NFA:

- Add connectors
- Convert to a suffix expression
- Building NFA
- Determine whether NFA accepts input strings

See GitHub for the complete code