Last chapter:Command line program interface
In the conclusion of the previous chapter, I said that whether this tutorial will have a second part depends on whether I encounter new text processing problems. It turned out to be true soon.
The following is the content of the XML file foo.xml:
<bib> <title>foo</title> </bib> <attachment> <resource/> <title>foo</title> </attachment> <bib> <title>bar</title> </bib> <attachment> <resource/> <title>bar</title> </attachment>
I need to start from
<attachment>...<attachment>Block to extract the following entries:
<resource/> <title>foo</title> <resource/> <title>bar</title>
Text cross line matching
Now suppose you’ve used the elisp function
find-fileAll the contents of foo. XML file are loaded into the buffer
Then I found that the previous knowledge of elisp was almost useless. The previous text matching and extraction methods are only applicable to single line text, but now we are faced with the problem of multi line text matching and extraction, that is, extracting from the current buffer
<attachment> <resource/> <title>foo</title> </attachment> <attachment> <resource/> <title>bar</title> </attachment>
Not to mention extraction, just how to match
<attachment>...</attachment>It’s not easy to solve the problem. For example, the following procedure
(find-file "foo.xml") (let ((x (buffer-string))) (string-match "<attachment>\(.+\)</attachment>" x) (princ\' (match-string 1 x)))
string-matchMatches in the current buffer content
<attachment>...</attachment>Block failed. The reason for the failure is also very simple, because regular expressions
.Although it can match any character, it does not include line breaks.
crossing the sea under camouflage
It’s not impossible to realize cross line matching of text, but it needs more knowledge of elisp’s regular expression than now1. However, what I want to say is that for the above problems, the existing elisp knowledge is actually enough. We just need to change our thinking.
Why is text multiline? This is because when you enter text, a new line character is added at the end of each line by a person or program. If you can temporarily replace these line breaks with a special token, multiple lines of text become single line text. At the end of text matching and processing, the special mark is replaced with a new line character, and the single line text is restored to multi line text. This is a trick to hide the truth.
Replacing all the newline characters in the current buffer with a special token can be implemented based on the buffer transformation method described in Chapter 6. This chapter gives a faster method. Elisp function
replace-stringYou can replace all target strings with the specified string in the current buffer, such as
(let ((x "") (y "") (one-line (generate-new-buffer "one-line"))) (find-file "foo.xml") (setq x (buffer-string)) (with-current-buffer one-line (insert x) (goto-char (point-min)) (replace-string "\n" "<linebreak/>") (setq y (buffer-string))) (princ\' y))
After executing the above program, the newly created buffer one line stores the single line result of foo. XML buffer. If the
(princ\' y)Statement with
(string-match "<attachment>\(.+\)</attachment>" y) (princ\' (match-string 1 y))
Then it can be extracted
<attachment>...</attachment>Block, even though the extracted result is wrong.
In order to observe errors more conveniently, we need to construct a simple example
(setq x "abcabcabc") (string-match "a\(.+\)a" x) (princ\' (match-string 1 x))
What is the output of this example? Although I’m looking forward to it
bcBut what it actually outputs is
bcabc. that is because
+It’s greedy. It always wants to match the longest result, not the shortest one.
*The same is true. In elisp’s regular expressions, add one after them
?To curb their greed, for example
(setq x "abcabcabc") (string-match "a\(.+?\)a" x) (princ\' (match-string 1 x))
At this point, the output of the program is
bcIt’s too late.
re-search-forwardThe insertion point can be moved to the matching position of the buffer while searching for the text matching the regular expression in the buffer. Based on this function, and with the help of the text capture function of elisp regular expression, we can extract multiple files from the one line buffer constructed in the previous section
<attachment>...</attaqchment>It’s a piece.
For the demonstration
re-search-forwardI changed the sample code in the previous section to the following code:
(let ((x "") (one-line (generate-new-buffer "one-line")) (output (generate-new-buffer "output"))) (find-file "foo.xml") (setq x (buffer-string)) (with-current-buffer one-line (insert x) (goto-char (point-min)) (replace-string "\n" "<linebreak/>") (goto-char (point-min)) (while t (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1) Program branch 1 Program branch 2)))
re-search-forwardIt is the most complex elisp function I have ever used. It has four parameters, but only the first parameter is required. The other three parameters are optional – if their values are not set,
re-search-forwardTheir default values are used. The definitions of the four parameters are as follows:
- The first parameter is the elisp regular expression for text matching.
- The second parameter is used to set the maximum search range. because
re-search-forwardIs to search for text matching in the current buffer. The starting position of the search is the position of the insertion point, and the ending position can be set by its second parameter. If the parameter value is
nilThe end of the current buffer is used as the end of the search range.
- If the third parameter value is
nil, when no matching text is found,
re-search-forwardIt will report an error. If the parameter value is
nil. If the parameter value is not
re-search-forwardThe function moves the insertion point to the end of the search area and returns
- The fourth parameter
COUNT, can make
re-search-forwardThe search process lasted until the third day
COUNTIf this parameter is not set, the default value is 1.
If fully understood
re-search-forwardFunction, then the code corresponding to the program branch 1 of the above code can be written, and new elisp knowledge is no longer needed
(let ((y (match-string 1))) (with-current-buffer output (insert (concat y "\n"))))
It’s going to be
re-search-forwardCaptured text with
match-stringFunction and insert the output buffer. Note that if the text captured by the regular expression belongs to the current buffer,
match-stringFunction does not need to write the second argument.
For program branch 2, i.e
re-search-forwardThe existing elisp knowledge is really not enough to deal with the matching failure. Because the program branch belongs to an infinite iterative process, to jump out of the latter, like other programming languages, you need to have
breakSyntax to terminate the iteration process ahead of time.
Elisp language does not
breakBut it does
Here’s an example
(catch 'foo (princ\' "foo") (princ\' "bar"))
Now, if I change the above code to
(catch 'foo (princ\' "foo") (throw 'foo nil) (princ\' "bar"))
throwThe code after the expression will be ignored by the elisp interpreter, so the current code can only output
If the above code is changed to
(princ\' (catch 'foo (princ\' "foo") (throw 'foo nil) (princ\' "bar")))
The output becomes
throwThe second parameter of
nilWill be taken by elisp as
catchThe result of evaluating the expression.
Catch / throw is called “non local exit” in elisp language. Based on them, we can simulate the non local exit in other programming languages
breakAnd the abnormal mechanism.
Based on catch / throw, the program branch 2 described in the previous section can be realized, for example
(throw 'break nil)
And then just put
whileThe expression is placed in the
catchBlock, captured by the latter
(catch 'break (while t (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1) Program branch 1 (throw 'break nil))))
Restore multiline text
Now, the following code
(let ((x "") (one-line (generate-new-buffer "one-line")) (output (generate-new-buffer "output"))) (find-file "foo.xml") (setq x (buffer-string)) (with-current-buffer one-line (insert x) (goto-char (point-min)) (replace-string "\n" "<linebreak/>") (goto-char (point-min)) (catch 'break (while t (if (re-search-forward "\(<attachment>.+?</attachment>\)" nil t 1) (let ((y (match-string 1))) (with-current-buffer output (insert (concat y "\n")))) (throw 'break nil))))))
The problem raised at the beginning of this chapter has been basically solved, because the output buffer contains two files extracted from the foo. XML file
<attachment>...</attachment>Next, I just need to
\nThe problem is completely solved. However, I think this task can be left as an exercise in this chapter.
In the current buffer,
replace-stringas well as
re-search-forwardAnd so on, they all have side effects, they move the insertion point. In text processing, remember where the insertion point is, and then call these functions, then you need to re insert the insertion point. This is the first few sections of the code appear many times
The main reason for this. Elisp provides
save-excursionSyntax, which can automatically save the location of the insertion point, then perform some operations that may move the insertion point, and finally restore the insertion point to its original position. for example
(save-excursion (insert x))
(let ((p (point))) (insert x) (goto-char p))
Therefore, the second exercise in this chapter is based on
save-excursionRevise the answers to the exercises in the previous section.
This chapter introduces more operations in elisp buffer and non local exit syntax. With this knowledge, we can extract the text block which is composed of multiple lines of text from any text document.