Elisp 02: text analysis

Time:2021-6-21

Last chapter:Hello world!

This chapter introduces elisp’s variable, list, symbol, function recursion and some more convenient insertion point moving functions. This knowledge will be gradually developed around a practical problem solving process.

problem

Suppose there is a document foo.md, which is as follows:

# Hello world!

The following is the content of the C language Hello world program source file hello. C:

```
#include <stdio.h>

int main(void) {
    printf("Hello world!\n")
    return 0;
}
```

... ... ...

Some of them are included in the\`\`\`For the first two lines of text, how to use elisp to write a program to identify them from foo. MD?

Note: the markdown parser of this website is not perfect enough to understand character escape, resulting in the failure to correctly display the three consecutive counter quotation marks in the form of character escape.

Parser

Each line of text in foo.md file is one of the following three cases. These three situations are

  1. with\`\`\`The first line of text;
  2. Located in two\`\`\`The line of text between the first line of text;
  3. Text lines other than the above two cases.

Suppose that the program I want to write is simple-md-parser.el, as long as it can determine the situation of each line of text and record the determination result, then the problem will be solved. Although this program is simple, it is really a parser.

Variables and lists

The result of simple-md-parser.el for each line of text in foo.md file can be stored in elisp’s list type variable.

In elisp language, a variable is a symbol bound to a certain type of data object, so to define a variable is to bind a symbol to a data object. For example,

(setq x "Hello world!")

Put a symbolxBound to data of a string type, the variable is definedx

A list variable is a symbol that is bound to an instance of a list type, which can be used by thelistFunction creation, for example

(setq x (list 1 2 3 4 5))

Put the symbolxBind to list object(1 2 3 4 5)Then a list variable is definedx

You can also define empty list variables, such as

(setq x '())

Single quotation mark'The reference is represented in elisp. When the elisp interpreter encounters the symbol or list it picks up, it takes the latter itself as the result of the evaluation. This is one of the features of lisp. The following examples may help to understand this feature:

(setq x (list 1 2 3 4 5))
(princ\' x)

(setq x '(list 1 2 3 4 5))
(princ\' x)

(setq x '(1 2 3 4 5))
(princ\' x)

The output of the above program is

(1 2 3 4 5)
(list 1 2 3 4 5)
(1 2 3 4 5)

Based on the output of the above program, it can be found that

(setq x '(list 1 2 3 4 5))

It’s a symbolxBound to(list 1 2 3 4 5)This list, because the'(list 1 2 3 4 5)Elisp interpreter is blocked from accessing(list 1 2 3 4 5)Instead, it directly takes the modified statement itself as the evaluation result.

You can also see that the following two lines of code are equivalent:

(setq x (list 1 2 3 4 5))
(setq x '(1 2 3 4 5))

If you understand the above, it is not difficult to understand why'()Indicates an empty list.

The list is one-way

Elisp’s list is unidirectional. It’s much easier to access the first element of the list than the last element. usecarFunction to get the first element of the list. for example

(setq x '(1 2 3 4 5))
(princ\' (car x))

output1

cdrFunction can remove the first element of the list and take the rest as the result of evaluation. for example

(princ\' (cdr '(1 2 3 4 5)))

output(2 3 4 5)

If you want to get the tail element of the list, you need to use thecdrKeep cutting the top of the list until the last element is left. Fortunately, to solve the problem raised at the beginning of this chapter, you don’t need to get the tail element of the list. You can put it down for the time being.

Similar to accessing the top and bottom elements of a list, it is much more difficult to append elements to the bottom of the list than to append elements to the top of the list. Elisp providesconsFunction to add an element to the top of the list and return to the new list. for example

(setq x '(1 2 3 4 5))
(setq x (cons 0 x))
(princ\' x)

output(0 1 2 3 4 5)

evaluation

From now on, I will not talk about the return results of functions, but about the evaluation results. Although they can be understood as one thing in most cases, some terms of LISP language should be respected.

The previous chapter vaguely mentioned that the ellisp program is interpreted and executed by the ellisp interpreter. How does this process work? This process is essentially composed of the process of elisp interpreter evaluating each expression in the program in order.

Expression, also known as block (form). In elisp language, the definition and use of variables and functions are expressions. Even a number, a string, or an instance of some other type is an expression.

The following statement, each line is an expression:

42
"Hello world!"
(setq x 42)
(princ\' (buffer-string))

Expressions can be nested. Nested structures are usually expressed in pairs of brackets. For example, the definition of a function is a typical nested structure

(defun princ\' (x)
  (princ x)
  (princ "\n"))

Yes, the elisp interpreter also evaluates the definition of a function, and the result is the name of the function.

In the view of the elisp interpreter, any expression has its value, so its interpretation and execution of the elisp program is essentially to evaluate all the expressions in the program one by one.

Note that the expression(princ\' "Hello world!)The evaluation result of is not output in the terminalHello world!. When a program writes information to a terminal, it essentially writes information to a file. This work is the sideline of the evaluation process of the elisp interpreter. Its main business is to evaluate the expression, and the evaluation result is not visible outside the elisp interpreter.

Binding a symbol to a data object or a group of expressions, that is, defining a variable or function, can also be regarded as a sideline of the elisp interpreter in a sense.

Symbol

It is now clear that a variable is a symbol bound to a certain type of data object. In fact, functions are similar. When defining a function, for example

(defun princ\' (x)
  (princ x)
  (princ "\n"))

It’s just a symbolprinc\'Bound to a set of expressions. Defining a function is essentially binding a symbol to an anonymous function. This anonymous function is called a lambda expression. It’s ok if you don’t want to delve into this knowledge, but you should know that lambda expressions are one of the quintessence of lisp.

Symbols can be used as the names of variables and functions, but there is another use for symbols, which is to use themselves. Because of the single quotation mark'Elisp can prevent any interpretation of a name, only taking the name itself as the evaluation result. Therefore, in this way, the symbol itself can be used directly in the program.

Now back to the problem to be solved in this chapter, remember that each line of text in the foo. MD file can only be one of three cases? I can use symbols to represent these three situations:

'a line of text that begins with three consecutive back quotes
'a line of text that is contained between two lines of text that begin with three consecutive back quotes
'a line of text that does not start with three consecutive back quotes and is not contained by two lines of text that start with three consecutive back quotes

No kidding, because elisp really supports such a long symbol. However, the symbols are too long and it’s tiring to write code. To simplify, the above three cases are simplified and further subdivided into the following four cases:

'code block start
'code block
'code block结束
'unknown

Why start with\`\`\`What about the text area between the two lines of text called a “code block”? Because the content in the foo. MD file is actually markdown markup text.

Traverse the buffer line by line

It seems that everything is on the right path, and it’s time to consider how to read every line of text in the foo. MD file.

As pointed out in the previous chapter, the use offind-fileFunction to read the specified file into a buffer, and then use thegoto-charFunction to move the insertion point in the buffer to the specified location. Elisp provides a larger step insertion point move functionforward-line, which moves the cursor to the back of the current text line or to the beginning of the previous text line. In the buffer, the coordinates of the beginning and end of the text line where the insertion point is located can be passed through theline-beginning-positionandline-end-positionAnd pass them as parameter values to thebuffer-substringThen the latter can obtain the content of the text line where the insertion point is located, store it in a string object and take it as the evaluation result. In short, based on these functions, we can grab any line of text in the buffer as a string object. For example, the following program can grab the third line of foo.md file:

(find-file "foo.md")
(forward-line 2)
(princ\' (buffer-substring (line-beginning-position) (line-end-position)))

Why is it wrong to move the insertion point to the third line of the current buffer(forward-line 2)What about it? This is because,(find-file "foo.md")After opening the file, the insertion point is at the beginning of the first line of the current buffer by default.forward-lineThe parameter value of the function is the number of relative offset lines relative to the current text line of the insertion point. Moving 2 lines backward from the first line is the third line.forward-lineThe parameter value of can also be negative to move the insertion point to a line before the current line of text.

Note that in order to easily get the content of the text line where the insertion point is located, I definedcurrent-lineFunction:

(defun current-line ()
  (buffer-substring (line-beginning-position) (line-end-position)))

If you define a function, use it internally(forward-line 1)Move the insertion point to the next line, and then call the function itself to read the contents of the buffer line by line. for example

(defun every-line ()
  (princ\' (current-line))
  (forward-line 1)
  (every-line))

(find-file "foo.md")
(every-line)

every-lineIt’s a recursive function. Call the function itself in the definition of a function, that is, recursive function. The interpreter of any programming language will fall into the process of repeatedly evaluating the definition of a function when it encounters a recursive function. Recursive function is like the engine of a car, it runs round and round. As for the car can carry people from one place to another, it’s just a side effect of the engine.

It is true that the above program can display the current buffer line by line, but the program will eventually crash

Lisp nesting exceeds ‘max-lisp-eval-depth’

Because inevery-lineIn the definition of the function, whether the insertion point moves to the end of the buffer content is not detected, so the recursive process cannot be terminated, which leads to the failure of the elisp interpreter to get the evaluation result. However, the LISP interpreter has a limit on the depth of recursion, which is 800 times by default. If the depth of recursion exceeds this limit, the interpreter will report an error and exit.

Conditional expression

How to judge that the insertion point has moved to the end of the current buffer? Remember the function used in the previous chapterpointIs that right? It gives the current coordinates of the insertion point. Rememberpoint-minandpoint-maxIs that right? They can give the start and end coordinates of the current buffer respectively. So, whenpointThe results are consistent withpoint-maxWhen the results are equal, it means that the insertion point is at the end of the current buffer. At the moment, the missing knowledge is elisp’s conditional expression.

In elisp,=Is a function that can be used to determine whether two values are equal. for example

(= (point) (point-max))

You can determine whether the current insertion point is at the end of the current buffer. If the above logical expression is true, the evaluation result istOtherwise, the result isnil. In elisp, symbolstIt’s true,nilIt means false. In addition,nilAlso equivalent to'()But I think it’s better not to mix it up.

Now it’s almost clear why elisp doesn’t define variables in the same way as non LISP languages=Instead of usingsetq. The variable definition syntax of non LISP language is simpler, but they are sacrificed=In order to judge whether two values are equal, we often use the==Or other symbols. Don’t care what I say, it’s just my fantasy.

Based on the evaluation result of logical expression, the corresponding program branches are executed. In elisp language, the program branches can be executed byifexpression.ifThe form of the expression is as follows:

(if) logical expression
    Program branch 1
  Program branch 2)

If the evaluation result of elisp interpreter on logical expression is true, it will interpret executive branch 1 instead, otherwise, it will interpret executive branch 2. be based onifExpression can be redefinedevery-lineFunction.

(defun every-line ()
  (if (= (point) (point-max))
      (princ "")
  (princ\' (current-line))
  (forward-line 1)
  (every-line)))

This function can, as I wish, terminate the recursive process when the insertion point reaches the end of the current buffer and output an empty string object. However, the semantics of this function is somewhat confusing. In its definition, there are four lines of code,

      (princ "")
    (princ\' (current-line))
    (forward-line 1)
    (every-line)))

Which of them should be called “program branch 1” and which should be called “program branch 2”? Elisp’s syntax is not indented, so the indentation of the first line of code is deeper than that of the last three lines, which does not help it to be different from the latter. In order to make the semantics clear, we need to use theprognGrammar.prognYou can integrate a set of statements and take the evaluation result of the last statement as the evaluation result. For example,

(defun every-line ()
  (if (= (point) (point-max))
      (princ "")
    (progn 
      (princ\' (current-line))
      (forward-line 1)
      (every-line))))

Now?every-lineThe semantics of conditional expression in function is very clear. No matter whether the result of logical expression is true or false, the corresponding program branch is one expression, not multiple.

string matching

Now I have the ability to get any line of text in the current buffer, but in order to solve the problem raised at the beginning of this chapter, I still need to determine whether a line of text is in\`\`\`start. Intercept 3 characters from the beginning of each line of text to determine whether it is\`\`\`This small problem can be solved. In fact, elisp provides perfect regular expressions that can be used to match text with specific patterns, but I’m not going to use it now. Because regular expressions are a little complicated, we even need to open a separate chapter for them.

substringThe function takes a subset of a string object that falls into a specified range and evaluates it. for example

(print \ '(substring "heaven and earth are one finger, all things are one horse" 0.4))

output

A finger of heaven and earth

Determine whether the contents of two string objects are the same, and cannot be used=, should be usedstring=Remember. For example,

(string= "Hello" "Hello")

The result ist, and

(string= "Hello" "World")

The result isnil

The following code determines whether the beginning of the text line where the insertion point is located is\`\`\`

(string= (substring (current-line) 0 3) "```")

You can determine whether the current line of text is\`\`\`But in practice, this expression is too optimistic, because not all text lines contain more than three characters. For example, there are many empty lines in foo.md file, which only contain one character\n, which is the line break. In the above example, if the current text line contains less than 3 characters,substringFunction will report an error:

Args out of range: "", 0, 3

Then the elisp interpreter stops working, and the program can no longer run. If we want to solve this problem, we need to deal with special situations

(setq x (current-line))
(setq y "```")
(setq n (length y))
(if (< (length x) n)
    nil
  (string= (substring x 0 n) y))

<It is also a function to compare the size of two values. For expressions(< a b), ifaless thanb, the evaluation result istOtherwisenillengthFunction to get the length of the string object, that is, the number of characters contained in the string object.

lengthIt can also be used to get the length of the list — the number of elements in the list, such as

(length '(1 2 3))

The evaluation result is 3.

Implement parser

We can write simple-md-parser.el by using all the above knowledge. The full implementation is given below

(defun princ\' (x)
  (princ x)
  (princ "\n"))

(defun current-line ()
  (buffer-substring (line-beginning-position) (line-end-position)))

(defun text-match (src dest)
  (setq n (length dest))
  (if (< (length src) n)
      nil
    (string= (substring src 0 n) dest)))

(defun every-line (result in-code-block)
  (if (= (point) (point-max))
      result
    (progn
      (if (text-match (current) "```")
          (progn
            (if in-code-block
                (progn
                  (setq result (cons' code block end result))
                  (setq in-code-block nil))
              (progn
                (setq result (cons' code block start result))
                (setq in-code-block t))))
        (progn
          (if in-code-block
              (setq result (cons' block result))
            (setq result (cons' unknown result)))
      (forward-line 1)
      (every-line result in-code-blcok))))

(progn
  (find-file "foo.md")
  (princ\' (every-line '() nil)))

every-lineAt first glance, the definition of function is a little complicated, but in fact, the logic it expresses is very simple. For each line of text in the current buffer, the function first determines whether it uses\`\`\`At the beginning, if so, you need to further determine whether the previous line of the line of text is in the code block, and then you can determine whether the current\`\`\`The line of text that begins with is'code block start, or'end of code block. The second parameter of this function is used to record whether the previous line of text in the current line belongs to'code block. In addition, the function also shows a list as the result of the evaluationresultHow to start from an empty list object and gradually grow in the recursive process of the function.

List inversion

The parser implemented in the previous section, whereevery-lineFunction evaluates to a list object. This list object is actually inverted, that is, the situation of the first line from the bottom of the foo.md file corresponds to the first element of the list object; The second line belongs to the second element of the list object; And so on.

If you want to reverse this list, you need to write another function:

(defun reverse-list (x y)
  (if (null x)
      y
    (reverse-list (cdr x) (cons (car x) y))))

Elisp functionnullCan be used to determine whether a list is'()

The usage of this function is as follows:

(setq x '(5 4 3 2 1))
(princ\' (reverse-list x '()))

output(1 2 3 4 5)

utilizereverse-listFunction, you can further improve the simple-md-parser.el implemented in the previous section, which should be the exercise of this chapter.

epilogue

The simple-md-parser.el program implemented in this chapter is just a beginner’s code of elisp language, which is cumbersome and even unsafe. In the next three chapters, I have simplified and improved these codes to a certain extent, and learned more elisp syntax and functions in these works.

Next chapter:variable