Elisp 05: text matching

Time:2021-6-9

Last chapter:iteration

In the second chapterText analysisIn order to determine whether a line of text is\`\`\`At the beginning, I defined a function:

(defun text-match (source target)
  (setq n (length target))
  (if (< (length source) n)
      nil
    (string= (substring source 0 n) target)))

In fact, elisp provides a more powerful text matching function. How strong is it? Powerful enough to support regular expression matching.

Regular expressions are like the portraits of criminals on the wanted notices posted by the ancient government on the side of the city gate when they caught the bandits. The more characteristic a criminal’s appearance is, the more useful his portrait is. I also think that modern machine learning programs recognize faces in photos, and its principle is like putting up wanted notices on the side of city gates.

How to give a text portrait? Specifically, how to give\`\`\`What about the text portrait as the beginning? It’s very simple. Just draw like this

^```

^It means “at the beginning” followed by “at the beginning”\`\`\`It means the beginning is\`\`\`

Elisp’sstring-matchFunction can use a string object composed of regular expressions to match another string object, for example:

(string-match "^```" "```lisp")

Note that for the sake of illustration, from now on, string objects (or instances of string types) and list objects (or instances of list types) are all referred to as string and list without special declaration. There should be no misunderstanding.

In the above example, because the string"\`\`\`lisp"So\`\`\`At the beginning, sostring-matchThe evaluation result of is notnilOtherwise it isnil. For the elisp interpreter, nonnilThat is to say, if a value is not truenilIt’s not'()No matter what it is, elisp will equate it tot. Remember what I said before,nilAnd'()Equivalence. Keep that in mind. In fact, the evaluation result of the above example is 0, but 0 is notnilNeither'()

Why is the evaluation result of the above example 0? becausestring-matchAt the beginning of the string, you find the part that matches the regular expression. The beginning of a string, that is, the index (or subscript) of the first character of the string, whose value is 0. Let’s take another example

(setq r "```")
(setq x "foo```bar")
(string-match r x)

At this point,string-matchIs a judgment stringxIs there a regular expression withrThe result of evaluation is the index of the first character of the matched text. Because inxInside,\`\`\`The index of the first character of is 3, so in the above examplestring-matchThe result is 3. The implication of this evaluation is that it conforms to the regular expressionrThe text of is inxThe fourth character position of begins to appear.

Here’s an example,

(setq r "```$")
(setq x "foo```")
(string-match r x)

It can be judgedxIs it based on\`\`\`ending. In regular expressions,$Represents the end of the text.

Guess what,^\`\`\`$What do you mean? Guess, although there is no reward, but you can be sure that they are not stupid.

Now, we can use the text matching function in the parser program in Chapter 2string-matchInstead. So far, the knowledge related to the parser has been popularized. The problem it solved is no longer a problem. I need to find new problems.

The new problem is still in the foo. MD file. Only part of it is given below

# Hello world!

The following is the content of the C language Hello world program source file hello. C:

```
#include <stdio.h>
... ... ...
```

... ... ...

Among them,# Hello world!Is the title of the document section. Using regular expressions^#You can match it, but there’s also a way to copy it#The line of text that begins with the. Now, do some people understand why, from the second chapter to now, I’m right\`\`\`Is the beginning of the text line so obsessive? Only by identifying the transcription environment and ignoring them, can we have enough possibility to match the section title of the document. As for how to ignore the copy environment of the text, now and put it down. Just remember, there is a new problem, and I don’t know how many chapters will be needed to solve it.

In the premise of ignoring the copying environment, the use of^#You can match the document section title, but it’s too coarse. Because, the actual appearance of the document section title can be as follows

#Title
#Title
#Title

or#There should be at least one space between the title and the name of the title. In addition, the title of the name is also allowed to appear after the space, such as the input title, accidentally introduced. Therefore, a more precise regular expression for matching document section titles is

^#[[:blank:]]+.+$

Among them,[[:blank:]]Matches white space characters, which cover spaces.+Indicates that there may be one or more characters before it.*Indicates that the character before it may not exist, or there may be one or more characters..It can match any character. therefore[[:blank:]]+One or more spaces can be matched,.+Can match 1 or more characters, and[[:blank:]]*Can match 0, 1 or more spaces. Using this regular expression, you can more accurately match the section titles of documents. For example:

(setq x "#                    Hello world!             ")
(setq r "^#[[:blank:]]+.+[[:blank:]]*$")
(string-match r x)

string-matchThe evaluation result of is 0, which is correct. Now I can think that if I define a text matching function with similar functions, I can’t estimate the workload based on my current elisp programming skills and my understanding of NFA.

Regular expressions are not only used for matching, but also for text capture. For example, from the string in the example abovexCapture document section title name inHello world!The corresponding regular expression should be written as

(setq r "^#[[:blank:]]+\(.+\)[[:blank:]]*$")

That is, in regular expressions\\(and\\)The regular expression segment corresponding to the text to be captured.+Include it.string-matchWhen using this regular expression for text matching, the\\(and\\)Included.+To save the matched text segment, use the(match-string 1)Extraction. for example

(setq x "#                    Hello world!             ")
(setq r "^#[[:blank:]]+\(.+\)[[:blank:]]*$")
(string-match r x)
(princ\' (match-string 1 x))

Above program outputHello world!

match-stringThe first parameter of is in the regular expression\\(...\\)The serial number of the. Because there can be many places in a regular expression\\(..\\))Therefore, thematch-stringSpecifies where to get the text in\\(...\\)Captured.

The following program uses two regular expression traps

(setq x "############                    Hello world!             ")
(setq r "^\(#+\)[[:blank:]]+\(.+\)[[:blank:]]*$")
(string-match r x)
(princ\' (match-string 1 x))
(princ\' (match-string 2 x))

Output:

############
Hello world!

The above is just some basic knowledge of regular expression, because the main problem is how to use regular expression to match text in elisp program. As for more knowledge of regular expression itself, we can temporarily hold our feet when we encounter practical problems1

Next chapter:Buffer transformation


  1. seehttps://www.gnu.org/software/…