Implementing a simple regular expression engine

Time:2019-9-28

Looking back on the first time you saw a regular expression, it was probably all in your eyes.$7^(0^=]W-\^*d+In my heart, I refused. However, in the following daily work, more and more regular expressions have been used, and regular expressions have gradually become a very common tool.

To master a tool, besides its usage, it is equally important to understand its principle. Generally speaking, regular engines can be roughly divided into two categories: DFA (Deterministic Finite Automata) deterministic finite automata and NFA (Nondeterministic Finite Automata) uncertain finite automata.

Tools for using NFA include.NETPHPRubyPerlPythonGNU EmacsedsecvigrepMost versions, and even some versionsegrepandawk。 The main tools used in DFA areegrepawklexandflex。 Some systems use hybrid engines, which choose the right engine according to the different tasks (or even different engines for different parts of the same expression in order to achieve the best balance between function and speed).—— Jeffrey E.F. Friedl, Proficiency in Regular Expressions

Both DFA and NFA are called finite automata. They have many similarities. Automata is essentially a graph similar to state transition graph.(Note: Automata will not be strictly defined in this paper. An in-depth understanding of automata can be read in Introduction to Automata Theory, Language and Computing.)

NFA

An NFA is divided into the following parts:

  • An initial state
  • One or more termination States
  • State transition function

Implementing a simple regular expression engine

The figure above is one with two statesq0andq1NFA, initial state isq0(No preface state), termination state isq1(Two circle markings). stayq0There’s an arrow pointing at it.q1That means when the NFA is inq0When in state, accept inputaIt will be transferred to state.q1

When accepting a string, we initialize the NFA to the initial state, and then transfer the state according to the input. If the NFA is in the end state after the input, it means that the acceptance is successful. If the input symbol has no corresponding state transition, or the NFA is not in the end state after the input, then it means that the acceptance is successful. It means accepting failure.

As you can see from the above, this NFA can accept and only accept strings.a

So why is it called NFA, becauseFor the same state and the same input symbol, NFA can reach different states.As follows:

Implementing a simple regular expression engine

stayq0When the input isaThe NFA can go backq0Or arriveq1So the NFA is acceptableabbq0 -> q1 -> q2 -> q3It is also acceptable.aabbq0 -> q0 -> q1 -> q2 -> q3The same acceptanceababbaaabbbabababbWait a minute, you may have found that the regular expression represented by this NFA is exactly the same.(a|b)*abb

ε-NFA

In addition to reaching multiple states, NFA can also accept empty symbolsεAs follows:

Implementing a simple regular expression engine

This is an acceptance.(a+|b+)NAFA, because there are pathsq0 -ε-> q1 -a-> q2 -a-> q2εRepresents an empty string, which is removed at connection time, so this path represents acceptance.aa

You may wonder why not use it directly.q0adoptaConnectq2Throughbconnection toq4That’s becauseεMainly plays the role of connection, which will be felt later.

DFA

After introducing the uncertain finite automata, the deterministic finite automata can be easily understood. The difference between DFA and NFA lies in:

  • Noεtransfer
  • For the same state and input, there is only one transition

So DFA is much simpler than NFA. Why not use DFA directly? This is because for regular language descriptions, it is often much easier to construct NFA than DFA, as mentioned above.(a|b)*abbNFA is easy to construct and understand:

Implementing a simple regular expression engine

But it’s not so easy to construct the corresponding DFA directly. You can try to construct it first, and the result is probably like this:

Implementing a simple regular expression engine

So NFA is easy to construct, but because of its uncertainty, it is difficult to implement state transition logic by program; NFA is not easy to construct, but because its certainty is easy to implement state transition logic by program, what should we do?

The magic is that every NFA has its corresponding DFA, so we usually construct NFA according to regular expression, then convert it into corresponding DFA, and finally recognize it.

McMaughton-Yamada-Thompson algorithm

McMaughton-Yamada-Thompson algorithm can convert any regular expression into NFA that accepts the same language. It is divided into two rules:

Basic rules

  1. For expressionsεConstruct the following NFA:
    Implementing a simple regular expression engine
  2. For nonεConstruct the following NFA:
    Implementing a simple regular expression engine

Induction rule

Suppose that the NFAs of regular expressions s and T are respectivelyN(s)andN(t)For a new regular expression r, the following structure is constructedN(r)

and

Whenr = s|tN(r)by

Implementing a simple regular expression engine

Connect

Whenr = stN(r)by

Implementing a simple regular expression engine

closure

Whenr = s*N(r)by

Implementing a simple regular expression engine

Other+?Equal qualifiers can be implemented similarly. That’s the end of this article’s knowledge of automata, and then we can start building NFA.

Implementation Based on NFA

Ken Thompson published a paper in 1968Regular Expression Search AlgorithmIn this article, he describes a regular expression compiler, which gave birth to later ones.qededgrepandegrep。 This paper is relatively difficult to understand. The implementation-a-regular-expression-engine article is also based on Thompson’s paper. To some extent, this paper also refers to the realization ideas of this article.

Add connectors

Before building NFA, we need to process regular expressions to(a|b)*abbFor example, there are no join symbols in regular expressions, so we can’t know which two NFAs to connect.

So first we need to explicitly add connectors to expressions, such as·Add rules can be listed:

Left symbols/right symbols * ( ) and Letter
*
(
)
and
Letter

(a|b)*abbWhen added, the(a|b)*·a·b·bTo achieve the following:

Implementing a simple regular expression engine

Infix expression to suffix expression

If you’ve written about calculators, you should know that infix expressions are not conducive to analysing the priority of operators, and so is here. We need to convert expressions from infix expressions to suffix expressions.

In this paper, the specific process is as follows:

  1. If you encounter letters, output them.
  2. If you encounter left parentheses, put them on the stack.
  3. If you encounter right parentheses, pop up stack elements and output them until you encounter left parentheses. The left parentheses only pop up and do not output.
  4. If a qualifier is encountered, the qualifier whose top priority is greater than or equal to the qualifier is popped up in turn, and then it is put on the stack.
  5. If you read the end of the input, all the elements in the stack pop up in turn.

In the implementation scope of this article, the priority from small to large is respectively

  • Connector·
  • closure*
  • and|

The realization is as follows:

Implementing a simple regular expression engine

as(a|b)*·cConvert to a suffix expressionab|*c·

Building NFA

It’s much easier to construct NFA from suffix expressions. Read in the expression content from left to right:

  • If for the letter s, build the basic NFAN(s)And put it on the stack
  • If so|Pop up two elements in the stackN(s)N(t)ConstructionN(r)Put it on the stack(r = s|t
  • If so·Pop up two elements in the stackN(s)N(t)ConstructionN(r)Put it on the stack(r = st
  • If so*Pop up an element in the stackN(s)ConstructionN(r)Put it on the stack(r = s*

See automata.ts for the code

Building DFA

With NFA, you can convert it to DFA. The method of transferring NFA to DFA can be used.Subset construction methodEach state of the DFA constructed by NFA is a set of multiple states of the original NFA, such as the original NFA is

Implementing a simple regular expression engine

Here we need to use an operationε-closure(s)This operation represents the ability to start with the state s of the NFA and only pass throughεThe set of states that the transition arrives at, such asε-closure(q0) = {q0, q1, q3}We use this set as the starting state of DFA.A

So what are the transitions of state A? A set containsq1AcceptableaYes, there is.q3AcceptablebSo A can accept it.aandb。 When A acceptsaWhen we getq2Thenε-closure(q2)Then act asA-state acceptanceaLater arrival state B.Similarly, state A acceptancebAfter arrivalε-closure(q4)State C.

State B is acceptable.aThe same is true of arrival.ε-closure(q2)Well, let’s say state B acceptsaIt’s still in state B. Similarly, state C acceptsbIt will also return to state C. Thus, the constructed DFA is

Implementing a simple regular expression engine

The start state of DFA includes the start state of NFA, and the end state is the same.

search

In fact, we do not need to explicitly construct DFA, but use this idea to traverse NFA, which is essentially a graph search, the implementation code is as follows:

Implementing a simple regular expression engine

getClosureThe code is as follows:

Implementing a simple regular expression engine

summary

Overall, we have gone through a number of steps to implement a simple regular expression engine based on NFA:

  1. Add connectors
  2. Convert to a suffix expression
  3. Building NFA
  4. Determine whether NFA accepts input strings

See GitHub for the complete code

Recommended Today

The use of springboot Ajax

Ajax overview What is Ajax? data Ajax application scenarios? project Commodity system. Evaluation system. Map system. ….. Ajax can only send and retrieve the necessary data to the server, and use JavaScript to process the response from the server on the client side. data But Ajax technology also has disadvantages, the biggest disadvantage is that […]