Awk learning notes of Linux text processing three swordsmen 05: detailed explanation of getline usage

Time:2021-12-4

Getline usage details

By default, awk supports reading data from files or stdin. We can also use getline to flexibly read data, such as reading the data of a non pending file during the execution of the main code block, or reading the result data of a shell command from a.

Getline has a return value:

  • 1: The data was read correctly.
  • 0: EOF encountered while reading data.
  • Negative number: reading encountered an error- 1 means the file cannot be opened, – 2 means the IO operation needs to be retried. When an error is encountered, the variable errno is also used to describe the error.

For the robustness of awk code, conditional judgment is usually added when using getline.

if((getline)<0){...}
if((getline)<=0){...}
if((getline)>0){...}

Remember to wrap getline in parentheses, otherwise getline < 0 will be recognized as input redirection rather than size judgment.

Getline without parameters

When getline has no parameters, it means that the next record is immediately read from the current data stream (file or stdin) and saved to $0. Do field segmentation. Then from getlinepositionContinue to execute the awk code backwards.

At this time, getline will set $0, position parameters ($1… $NF), NR, FNR and RT.

# awk '/^1/{print;getline;print}' a.txt 
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
10  Bruce   female  27   [email protected]   13942943905
10  Bruce   female  27   [email protected]   13942943905

Remember, omitting the print parameter means print $0. From the output, line 4 is weird. Because Bruce’s line is already the end of the file, then getline will encounter EOF, and the return value is 0$ 0 is still Bruce’s line without modification. So Bruce’s line was output twice.

Therefore, we’d better make conditional judgment on getline to enhance the robustness of the code.

# awk '/^1/{print;if((getline)<=0){exit};print}' a.txt 
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
10  Bruce   female  27   [email protected]   13942943905

There is another instruction in awk, similar to getline, called next. Let’s look at the execution results first.

# awk '/^1/{print;next;print}' a.txt 
1   Bob     male    28   [email protected]     18023394012
10  Bruce   female  27   [email protected]   13942943905

After encountering next, the next record will be read immediately, but it will not continue to execute the code from the current position like getline. Instead, it will jump out of the current awk internal loop (similar to continue in the loop statement) and re execute the main code block (that is, it will re match the pattern). Because the pattern needs to be re matched, the Alice line obtained by the next does not conform to the pattern for the first time, and the EOF has been encountered by the next for the second time, so it is over.

Getline with parameters

After obtaining the next record, the parameterless getline assigns the record to $0 and divides the fields, while the getline with parameters takes a parameter, which is a variable. Getline with parameters assigns the record to the parameter variable after obtaining the next record, andnoDivide the fields.

Therefore, getline with parameters will only set NR, FNR, RT and parameter variable VaR, and will not modify $0, location parameter and NF.

# awk '/^1/{print;if((getline var)<=0){exit};print var;print $0;print $2}' a.txt 
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
1   Bob     male    28   [email protected]     18023394012
Bob
10  Bruce   female  27   [email protected]   13942943905

In the above output result, $0 and $2 are still the data of Bob in the previous row, even though the row of Alice has been processed through getline.

Let’s take another example to compare the difference between getline with parameters and getline without parameters.

[[email protected] awk]# awk '/Tony/{print;getline;print $0,$2}' a.txt 
3   Tony    male    21   [email protected]    17048792503
4   Kevin   male    21   [email protected]    17023929033 Kevin
[[email protected] awk]# awk '/Tony/{print;getline var;print $0,$2}' a.txt 
3   Tony    male    21   [email protected]    17048792503
3   Tony    male    21   [email protected]    17048792503 Tony

Note that we do not make conditional judgment on the return value of getline for convenience.

Getline from the specified file

The above two getlines are used to read the next record from the currently processed file (assuming stdin is not used, because there are few cases). However, we generally use getline to obtain the data of other files for processing during the processing of the current file. For example, suppose a.txt is a configuration file. Some keywords are encountered in the process of processing this file, and the content of another configuration file c.txt needs to be added. This situation may exist.

No parameter getline obtains data from the file: save the record to $0, divide the field (set $n (i.e. location parameter)), and set NF. NR and FNR are not set because the data of other files is read.

getline < filename
# awk 'NR==5{print NR,FNR,$0,$2,NF;getline a.txt 
5 5 4   Kevin   male    21   [email protected]    17023929033 Kevin 6
5 5 aaa bbb ccc ddd bbb 4

Getline with parameter gets data from the file: the record is saved to the variable var$ 0, $n, NF, NR, and FNR are not set.

getline var < filename
# awk 'NR==5{print NR,FNR,$0,$2,NF;getline var a.txt 
5 5 4   Kevin   male    21   [email protected]    17023929033 Kevin 6
5 5 4   Kevin   male    21   [email protected]    17023929033 Kevin 6

When using getline to obtain file data, the file name needs to be wrapped in double quotes so that it will not be recognized as a variable by awk.

awk 'BEGIN{getline a.txt 
awk 'BEGIN{getline a.txt

The file path can be disassembled into directory and file name and saved in variables. When combined, the priority should be adjusted with parentheses.

awk 'BEGIN{dir="/root/awk";file="c.txt";getline < dir"/"file;print $0}' a.txt
awk 'BEGIN{dir="/root/awk";file="c.txt";getline < (dir"/"file);print $0}' a.txt

The above getline only reads one record. If we want to read the data of the whole file, we should use a loop. We modify the contents of the c.txt file.

# cat c.txt
abc
def
ABC
DEF

Read the contents of the entire c.txt file. Due to the existence of the return value of getline, 0 will be returned when EOF is read, and the loop will stop automatically.

# awk 'BEGIN{while(getline a.txt 
abc
def
ABC
DEF

We try to read and output c.txt again after printing the first record.

# awk 'BEGIN{while(getline a.txt 
abc
def
ABC
DEF
ID  name    gender  age  email          phone

At this point, we will find that the second attempt to output c.txt failed. The reason is that every time we getline c.txt will read a record and return, and mark the tail of the record (similar to the pointer).

ABC | # first getline marker point
Def | # second getline marker point
ABC | # third getline marker point
Def | # fourth getline marker point

The loop in begin is performed 4 times, and each time the corresponding position is marked. The next time getline reads the next record from this position. Therefore, after the begin loop, the mark point is located in the EOF of the file. The mark will not be re pointed to the file header because the EOF is read, but will always be in this position by default. In the loop judgment in main, since the first judgment is directly EOF, the loop body will not be executed once. So the above output appears.

It can also be understood that the file is only opened at the first getline. If we want to return the tag to the file header, we need to reopen the file, that is, we need to close the file first. We need to use the close () function.

# awk 'BEGIN{while(getline a.txt 
abc
def
ABC
DEF
ID  name    gender  age  email          phone
abc
def
ABC
DEF

The second close does not affect the output results, but it is a good habit to close the files once opened by getline and avoid potential bugs.

Getline from shell command results

"cmd" | getline

Read 1 record from the result of shell command CMD and save it to $0. The field will be split, so $0, $n, NF and RT will be set. NR and FNR are not set because they are not the current getline file.

"cmd" | getline var

Read 1 record from the result of shell command CMD and save it to variable var. Only VaR and RT are set.

Similar to getline from file, CMDmustWrapped in double quotation marks, the result of the shell command can also be understood as the data of the file, and the getline should be closed after reading.

# awk '/^1/{print;while("seq 1 5" | getline){print};{close("seq 1 5")}}' a.txt 
1   Bob     male    28   [email protected]     18023394012
1
2
3
4
5
10  Bruce   female  27   [email protected]   13942943905
1
2
3
4
5

Shell commands are generally long, and they should be opened and closed at least once. They can be saved to variables to facilitate opening and closing. If quotation marks appear in shell commands, escape characters should be used appropriately or quotation marks should be used alternately if conditions permit.

# awk 'BEGIN{getDate="date +\"%F %T\""}/^1/{print;getDate|getline date;print date;close(getDate)}' a.txt 
1   Bob     male    28   [email protected]     18023394012
2021-01-08 10:22:05
10  Bruce   female  27   [email protected]   13942943905
2021-01-08 10:22:05

In this example, the double quotation marks of the date command are escaped with a backslash. Single quotation marks cannot be used here, otherwise they will conflict with the outermost single quotation marks that wrap awk code.

The shell command itself can also contain some special characters, such as pipeline and redirection.

awk 'BEGIN{cmd="seq 1 5|xargs -i echo x{}y 2>/dev/null"}/^1/{print;while(cmd|getline){print};close(cmd)}' a.txt

Getline from coprocess

In Chinese, there are two explanations for collaborative process in English. One is called coroutine and the other is called coprocess. They are different concepts.

By awk, we mean coprocess, which means assisted program. To explain the collaboration, let’s first look at a command in bash.

cmd1 | cmd2 | cmd3 ...

This is the bash pipeline. Commands between pipelines are executed synchronously. The co process is executed asynchronously, like a pipe.

cmd1 |& cmd2
cmd2 |& cmd3

The pseudo code shown here is because the bash built-in command coproc is used to implement the collaboration in bash. “| &” is the symbol of awk implementation process. Cmd2 is called coprocess.

Note that this kind of pipe is also called two-way pipe.

Usage scenario of the collaboration: Although awk is powerful, some functions are not easy to implement with awk, or users are more familiar with other commands under bash, then we can use the collaboration to transfer data from awk to the collaboration for processing, and then from the collaboration to awk. The pseudo code is as follows.

awkPrint "data" |& shellCmd
shellCmd |& getline [var]

For example, if we don’t understand substr () in awk, which is a substring function, we can use the shell command sed to obtain the domain name of the mailbox field.

First, we determine the SED command.

# echo "[email protected]" | sed -nr "s/.*@(.*)//p"
qq.com

There is a large amount of code, so write it as a file and call it with the – f option. Double quotes and backslashes in sed in awk need to be escaped.

# cat getlineCoprocSed.awk 
BEGIN {
    CMD="sed -nr \"s/.*@(.*)/\/p\""
}

NR>1{
    print $5 |& CMD
    close(CMD,"to")
    CMD |& getline email_domain
    close(CMD)
    print email_domain
}
# awk -f getlineCoprocSed.awk a.txt 
qq.com
... ...
139.com

There are two close functions in the code that need our attention. Let’s take a look at the first close () function.

print $5 |& CMD
close(CMD,"to")

If the value of the second parameter of the close() function is to, it means that the pipeline that writes data to the coroutine is closed, which can also be understood as writing EOF to the coroutine. It is used to identify that we have written data to the collaboration, and the commands in the collaboration can be executed (in this case, the SED command). The reason for this is that some commands in the collaboration process can be executed only after the file contents are all ready. For example, the sort command, no matter what the sorting rules are, the prerequisite for sorting is to read all the data. Determining whether you have read all the data of the file is to see whether you encounter EOF. If the command needs EOF and does not exist in the process, the command will block and wait for EOF there. Students can try to annotate the close by themselves.

Let’s look at the second close () function.

CMD |& getline email_domain
close(CMD)

Although the close () function here does not take the second parameter, it actually omits from because it is the default parameter. The following two are equivalent.

close(CMD)
close(CMD,"from")

It means closing the pipeline that reads data from the coprocess. If the collaboration pipeline at the data writing end is closed and the collaboration pipeline at the data reading end is not closed, the pipeline will exist. Even the same code will continue to use the same pipeline next time. When we try to comment out the second close() function in getlinecoprocess.awk, we will encounter an error.

# awk -f getlineCoprocSed.awk a.txt 
qq.com
awk: getlineCoprocSed.awk:6: (FILENAME=a.txt FNR=3) fatal: print: attempt to write to closed write end of two-way pipe

When NR = = 2, we output qq.com, but when NR = = 3, because we did not close the pipeline for reading the process data in the previous record processing process, the two-way pipeline still exists, and the data writing end of the pipeline has been closed by us before, so we encounter such an error.

Therefore, the correct way to use the coordinated two-way pipeline is:

  • After writing data to the collaboration, close the pipeline at the writing end (close (CMD, “to”)).
  • After reading data from the coroutine, close the pipeline at the reading end (CMD [, “from”]).

Let’s take another example of using coprocessing. We expect to sort the contents of the a.txt file according to the age field. The output content is the output result of the sort command, but we must use the awk command.

sort -k4n a.txt

Idea: awk is our main program. Use the sort command as a helper. The awk internal loop sends each line of data starting from the second line to the coroutine. After all the data is sent (end code block), sort the data, and then recycle the sorted data.

# cat getlineCoprocSort.awk 
BEGIN {
    cmd="sort -k4n"
}
NR==1 {
    print
}
NR>1 {
    print |& cmd
}
END {
    Close (CMD, "to") # here, close is required, otherwise the coroutine sort will be blocked.
    while(cmd |& getline){
        print
    }
    Close (CMD) # the actual close measurement here is unnecessary, because it is just at the end of the code, but it is strongly not recommended to develop this bad habit!
}

Here is another knowledge point, which I don’t know very well, but I still list it.

If the CMD in the coroutine is buffered by block, it needs to be changed to buffered by line, otherwise getline will block.

cmd="cmdline"
cmd="stdbuf -oL cmdline"

Close() function

In awk, getline is used to obtain data from files or command results. Files / commands will only be opened / executed the first time getline is used. When there are multiple records in the file content / command result, getline only obtains the next record each time. If you want getline to obtain multiple records, you need to use a loop.

Due to the operation mechanism of getline, after reading all records of the dataset (the contents of the file and the execution results of the command, which I call dataset more convenient), the tag of getline will always stay at the EOFalikeThe data set of the file or command cannot be retrieved by getline. If you want to retrieve it, you must close it. After closing the dataset, the dataset will not be reopened until it is used next time.

close("file")
close("cmd")

In the case of getline from copprocess, a two-way pipe will be generated. One end writes data to the coprocess and the other end reads data from the coprocess. Both ends need to be closed.

Awkprint "data" | & shellcmd # use close (shellcmd, "to") to close.
Shellcmd | & getline [var] # use close (shellcmd, "from"), which can be abbreviated as close (shellcmd).

Execute shell commands through the system() function

We can print the shell command to be executed to the shell interpreter through the pipeline.

# awk 'BEGIN{print "pwd" | "bash"}'
/root
# awk 'BEGIN{print "date" | "bash"}'
Sat Jan  9 15:36:49 CST 2021

Shell interpreters can be sh, bash, etc. absolute paths can be used first, or only the interpreter name can be written.

We can also execute shell commands through system (). The return value of the system() function is the exit status code of the shell command. Shell commands called through system can also include complex operations such as redirection and pipeline.

# awk 'BEGIN{system("date +\"%F %T\"")}'
2021-01-09 15:40:14
# awk 'BEGIN{system("date +\"%F %T\">/dev/null")}'
# awk 'BEGIN{system("date +\"%F %T\"|cat")}'
2021-01-09 15:40:52

System () will flush out the buffer data of awk before running. If the shell command is empty, the system (“”) will not execute any shell commands, but will only flush the buffer. Please refer to the concept of this partAwk built-in function fflush()