Elisp 08: text cross line extraction

Time:2021-6-1

Last chapter:Command line program interface

In the conclusion of the previous chapter, I said that whether this tutorial will have a second part depends on whether I encounter new text processing problems. It turned out to be true soon.

problem

The following is the content of the XML file foo.xml:

<bib>
  <title>foo</title>
</bib>
<attachment>
  <resource/>
  <title>foo</title>
</attachment>
<bib>
  <title>bar</title>
</bib>
<attachment>
  <resource/>
  <title>bar</title>
</attachment>

I need to start from<attachment>...<attachment>Block to extract the following entries:

<resource/>
<title>foo</title>
<resource/>
<title>bar</title>

Text cross line matching

Now suppose you’ve used the elisp functionfind-fileAll the contents of foo. XML file are loaded into the buffer

(find-file "foo.xml")

Then I found that the previous knowledge of elisp was almost useless. The previous text matching and extraction methods are only applicable to single line text, but now we are faced with the problem of multi line text matching and extraction, that is, extracting from the current buffer

<attachment>
  <resource/>
  <title>foo</title>
</attachment>
<attachment>
  <resource/>
  <title>bar</title>
</attachment>

Not to mention extraction, just how to match<attachment>...</attachment>It’s not easy to solve the problem. For example, the following procedure

(find-file "foo.xml")

(let ((x (buffer-string)))
  (string-match "<attachment>\(.+\)</attachment>" x)
  (princ\' (match-string 1 x)))

outputnilWhich meansstring-matchMatches in the current buffer content<attachment>...</attachment>Block failed. The reason for the failure is also very simple, because regular expressions.Although it can match any character, it does not include line breaks.

crossing the sea under camouflage

It’s not impossible to realize cross line matching of text, but it needs more knowledge of elisp’s regular expression than now1. However, what I want to say is that for the above problems, the existing elisp knowledge is actually enough. We just need to change our thinking.

Why is text multiline? This is because when you enter text, a new line character is added at the end of each line by a person or program. If you can temporarily replace these line breaks with a special token, multiple lines of text become single line text. At the end of text matching and processing, the special mark is replaced with a new line character, and the single line text is restored to multi line text. This is a trick to hide the truth.

Replacing all the newline characters in the current buffer with a special token can be implemented based on the buffer transformation method described in Chapter 6. This chapter gives a faster method. Elisp functionreplace-stringYou can replace all target strings with the specified string in the current buffer, such as

(let ((x "")
      (y "")
      (one-line (generate-new-buffer "one-line")))
  (find-file "foo.xml")
  (setq x (buffer-string))
  (with-current-buffer one-line
    (insert x)
    (goto-char (point-min))
    (replace-string "\n" "<linebreak/>")
    (setq y (buffer-string)))
  (princ\' y))

After executing the above program, the newly created buffer one line stores the single line result of foo. XML buffer. If the(princ\' y)Statement with

(string-match "<attachment>\(.+\)</attachment>" y)
(princ\' (match-string 1 y))

Then it can be extracted<attachment>...</attachment>Block, even though the extracted result is wrong.

In order to observe errors more conveniently, we need to construct a simple example

(setq x "abcabcabc")
(string-match "a\(.+\)a" x)
(princ\' (match-string 1 x))

What is the output of this example? Although I’m looking forward to itbcBut what it actually outputs isbcabc. that is because+It’s greedy. It always wants to match the longest result, not the shortest one.*The same is true. In elisp’s regular expressions, add one after them?To curb their greed, for example

(setq x "abcabcabc")
(string-match "a\(.+?\)a" x)
(princ\' (match-string 1 x))

At this point, the output of the program isbcIt’s too late.

Incremental search

Elisp functionre-search-forwardThe insertion point can be moved to the matching position of the buffer while searching for the text matching the regular expression in the buffer. Based on this function, and with the help of the text capture function of elisp regular expression, we can extract multiple files from the one line buffer constructed in the previous section<attachment>...</attaqchment>It’s a piece.

For the demonstrationre-search-forwardI changed the sample code in the previous section to the following code:

(let ((x "")
      (one-line (generate-new-buffer "one-line"))
      (output (generate-new-buffer "output")))
  (find-file "foo.xml")
  (setq x (buffer-string))
  (with-current-buffer one-line
    (insert x)
    (goto-char (point-min))
    (replace-string "\n" "<linebreak/>")
    (goto-char (point-min))
    (while t
      (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1)
          Program branch 1
        Program branch 2)))

re-search-forwardIt is the most complex elisp function I have ever used. It has four parameters, but only the first parameter is required. The other three parameters are optional – if their values are not set,re-search-forwardTheir default values are used. The definitions of the four parameters are as follows:

  • The first parameter is the elisp regular expression for text matching.
  • The second parameter is used to set the maximum search range. becausere-search-forwardIs to search for text matching in the current buffer. The starting position of the search is the position of the insertion point, and the ending position can be set by its second parameter. If the parameter value isnilThe end of the current buffer is used as the end of the search range.
  • If the third parameter value isnil, when no matching text is found,re-search-forwardIt will report an error. If the parameter value istre-search-forwardWill returnnil. If the parameter value is notnilIt’s nott, thenre-search-forwardThe function moves the insertion point to the end of the search area and returnsnil
  • The fourth parameterCOUNT, can makere-search-forwardThe search process lasted until the third dayCOUNTIf this parameter is not set, the default value is 1.

If fully understoodre-search-forwardFunction, then the code corresponding to the program branch 1 of the above code can be written, and new elisp knowledge is no longer needed

(let ((y (match-string 1)))
  (with-current-buffer output
    (insert (concat y "\n"))))

It’s going to bere-search-forwardCaptured text withmatch-stringFunction and insert the output buffer. Note that if the text captured by the regular expression belongs to the current buffer,match-stringFunction does not need to write the second argument.

For program branch 2, i.ere-search-forwardThe existing elisp knowledge is really not enough to deal with the matching failure. Because the program branch belongs to an infinite iterative process, to jump out of the latter, like other programming languages, you need to havereturnorbreakSyntax to terminate the iteration process ahead of time.

catch/throw

Elisp language does notreturnandbreakBut it doescatch/throwexpression.

Here’s an example

(catch 'foo
  (princ\' "foo")
  (princ\' "bar"))

Exportable

foo
bar

Now, if I change the above code to

(catch 'foo
  (princ\' "foo")
  (throw 'foo nil)
  (princ\' "bar"))

So inthrowThe code after the expression will be ignored by the elisp interpreter, so the current code can only output

foo

If the above code is changed to

(princ\' (catch 'foo
           (princ\' "foo")
           (throw 'foo nil)
           (princ\' "bar")))

The output becomes

foo
nil

becausethrowThe second parameter ofnilWill be taken by elisp ascatchThe result of evaluating the expression.

Catch / throw is called “non local exit” in elisp language. Based on them, we can simulate the non local exit in other programming languagesreturnbreakAnd the abnormal mechanism.

Based on catch / throw, the program branch 2 described in the previous section can be realized, for example

(throw 'break nil)

And then just putwhileThe expression is placed in thecatchBlock, captured by the latterthrowThrown out'break, i.e

(catch 'break
  (while t
    (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1)
        Program branch 1
      (throw 'break nil))))

Restore multiline text

Now, the following code

(let ((x "")
      (one-line (generate-new-buffer "one-line"))
      (output (generate-new-buffer "output")))
  (find-file "foo.xml")
  (setq x (buffer-string))
  (with-current-buffer one-line
    (insert x)
    (goto-char (point-min))
    (replace-string "\n" "<linebreak/>")
    (goto-char (point-min))
    (catch 'break
        (while t
          (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1)
              (let ((y (match-string 1)))
                (with-current-buffer output
                  (insert (concat y "\n"))))
            (throw 'break nil))))))

The problem raised at the beginning of this chapter has been basically solved, because the output buffer contains two files extracted from the foo. XML file<attachment>...</attachment>Next, I just need to<linebreak/>Replace with\nThe problem is completely solved. However, I think this task can be left as an exercise in this chapter.

save-excursion

In the current buffer,insertreplace-stringas well asre-search-forwardAnd so on, they all have side effects, they move the insertion point. In text processing, remember where the insertion point is, and then call these functions, then you need to re insert the insertion point. This is the first few sections of the code appear many times

(goto-char (point-min))

The main reason for this. Elisp providessave-excursionSyntax, which can automatically save the location of the insertion point, then perform some operations that may move the insertion point, and finally restore the insertion point to its original position. for example

(save-excursion
  (insert x))

And

(let ((p (point)))
  (insert x)
  (goto-char p))

Equivalence.

Therefore, the second exercise in this chapter is based onsave-excursionRevise the answers to the exercises in the previous section.

epilogue

This chapter introduces more operations in elisp buffer and non local exit syntax. With this knowledge, we can extract the text block which is composed of multiple lines of text from any text document.

Next chapter:library


  1. https://www.emacswiki.org/ema…