Linux text processing three swordsman’s awk learning notes 10: Functions

Time:2021-9-15

preface

About the basic concept of function, in learningBash functionI have explained it roughly when I was in college, and I also studied C language (although I forgot it), so I won’t introduce too many redundant functions here.

Awk roughly divides functions into custom functions and built-in functions. However, there is no difference in essence. The functions written by ourselves are called user-defined functions, while the officially written functions embedded in awk are called built-in functions. For an introduction to built-in functions, seehere

The content of this blog is to understand functions and learn how to create and use them. That is, learn to customize functions. Although the built-in function can be used directly, there is no need to understand its internal implementation. However, learning user-defined functions is based on learning functions.

Definition of function

function funcName([arg, ...]){
    ... function body ...
}

func funcName([arg, ...]){
    ... function body ...
}

Awk functions can be defined at any position in the code, without sequence. For example, define the underline position below:

awk '_BEGIN{}_main{}_main{}_END{}_' ...

This is because when awk executes the begin code block, awk will encode the code into an internal format, and in this step, it will identify the function definition in the code. This is in awkWorkflowIt is mentioned in.

Note: do not define the function in the main code block. After all, you don’t have to define a function every time an internal loop.

Therefore, you can call any function defined anywhere.

# awk 'BEGIN{f()}function f(){print "hello world"}'
hello world

Return value of function

The function uses a return statement to return a return value. Once a return statement is encountered, the internal statement of the function after the return statement will not be executed.

# awk 'func re(){return 100;print "hello world"} BEGIN{a=re();print a;print re()}'
100
100

Note: the return value can also be a string.

# awk 'func re(){return "abc";print "hello world"} BEGIN{a=re();print a;print re()}'
abc
abc

If the function does not have a return statement or the return statement does not have a specific return value, an empty string is returned.

# awk 'func f(){} BEGIN{print "---"f()"---"}'
------
# awk 'func f(){return} BEGIN{print "---"f()"---"}'
------
# awk 'func f(){return 100} BEGIN{print "---"f()"---"}'
---100---

Parameters of function

Functions can take no parameters, but most of the time they take parameters, which makes the function more flexible when calling.

# cat funcArg.awk
func f(a,b){
    print a
    print b
    return a+b
}

BEGIN{
    x=10;y=20
    Res = f (x, y) # prints the values of X and y when calling the function, and assigns the return value to res
    print res
    Print f (x, y) # while printing the return value of the function, because the function is called and the function itself contains the values of print x and y, print the values of X and Y first, and then print the return value.
}
# awk -f funcArg.awk
10
20
30
10
20
30

Use the function to repeat the connection string. The function accepts two parameters, one is the string you want to concatenate, and the other is the number of times you want to concatenate.

# cat funcCatStr.awk
func cat(str,count){
    for(i=1;i<=count;i++){
        newStr=newStr""str
    }
    return newStr
}
BEGIN{print cat("-",5)}
# awk -f funcCatStr.awk
-----

In programming language, there are two kinds of function parameters, formal parameters and actual parameters.

When defining a function, the parameters used to define the acceptable parameters of the function are called formal parameters, or formal parameters for short. When a function is called, the parameters actually passed to the function become actual parameters, which are referred to as actual parameters for short.

F (x, y) {} # defines the formal parameters X and y.
a=10;b=5
F (a, b) # passes the actual parameters a and B.

In computer English, we use parameter to represent formal parameters and argument to represent arguments. If there is no distinction between an argument and a formal parameter in some cases, both parameter and argument can be used to represent the meaning of a parameter.

When calling a function, the number of arguments and formal parameters can be inconsistent. However, if the number of arguments is more than the number of formal parameters, awk returns a warning message.

# awk 'func f(a,b){} BEGIN{f(1,2,3)}'
awk: cmd. line:1: warning: function `f' called with more arguments than declared

Parameter type conflict

The variable types of arguments and formal parameters must be consistent, otherwise an error will be reported.

#Arguments are numeric variables and formal parameters are arrays.
# awk 'func f(a){a["name"]="alongdidi"} BEGIN{x=10;f(x)}'
awk: cmd. line:1: fatal: attempt to use scalar parameter `a' as an array
#First, call the function. The argument x is recognized as an array in the formal parameter. Therefore, an error will be reported if you want to use the argument x as a numeric variable later in begin.
# awk 'func f(a){a["name"]="alongdidi"} BEGIN{f(x);x=10}'
awk: cmd. line:1: fatal: attempt to use array `x' in a scalar context

Parameter transfer method

First, let’s review the concept of variable in bash. The name of the variable starts with the address pointing to a memory space. When we refer to a variable, we refer to the data in the memory space of the corresponding address.

There are two ways to transfer function parameters:

  1. First find the data in the memory space corresponding to the address, copy the data and put it into the new memory space. Passing the value as a parameter points the formal parameter to the new memory space. This method is called passing by value and will generate new memory space.
  2. The address of the memory space corresponding to the argument is directly passed to the formal parameter, so that the argument and the formal parameter point to the same memory space at the same time. This kind of pointing should be based on the concept of pointer. This is called passing by reference.

It can be seen that passing by value uses different memory space, so even if the variable names of arguments and formal parameters are the same, the memory addresses they point to are different.

Therefore, the modification of variables inside the function passed by value will not affect the outside of the function, and vice versa.

In awk, if the variable type of the passed parameter is numeric or string, it is passed by value.

# awk 'func f(a){a=10} BEGIN{a=5;print a;f(a);print a}'
5
5
# awk 'func f(a){a="alonggege"} BEGIN{a="alongdidi";print a;f(a);print a}'
alongdidi
alongdidi

Pass by reference because the same memory space is used, and the address of memory is passed when passing parameters. Therefore, the modification of variables inside the function passed by reference will affect the outside of the function, and vice versa. If the parameter passed is an array, it is passed by reference.

# awk 'func f(a){a["name"]="alonggege"} BEGIN{a["name"]="alongdidi";print a["name"];f(a);print a["name"]}'
alongdidi
alonggege

Scope of variable in function

Let’s review the funccatstr.awk code first.

# cat funcCatStr.awk
func cat(str,count){
    for(i=1;i<=count;i++){
        newStr=newStr""str
    }
    return newStr
}
BEGIN{print cat("-",5)}
# awk -f funcCatStr.awk
-----

We added some code.

func cat(str,count){
    for(i=1;i<=count;i++){
        newStr=newStr""str
    }
    return newStr
}
Begin {print cat ("-", 5); print cat ("+", 5)} # red font is the new part.

At this time, we may take it for granted that the output result should be:

-----
+++++

However:

# awk -f funcCatStr.awk
-----
-----+++++

The reason for this result is related to the scope of the variable. In awk, the variables defined inside the function belong to global variables. Therefore, the variable newstr inside the function is a global variable. After the first function call, its value is “—-“. After the function returns, because it is a global variable, the variable will not be released and will be operated on the basis of “—-” during the second function call.

We can look at print after the first function call.

# cat funcCatStr.awk
... ...
BEGIN{print cat("-",5);print newStr;print cat("+",5)}
# awk -f funcCatStr.awk
-----
-----
-----+++++

In awk, the keywords of local variables are not explicitly defined. If you want a variable to have the characteristics of a local variable, you can place the variable in the position of the parameter when defining the function, that is, the position of the formal parameter.

# cat funcCatStr.awk
func cat(str,count    ,newStr){
    for(i=1;i<=count;i++){
        newStr=newStr""str
    }
    return newStr
}
BEGIN{print cat("-",5);print newStr;print cat("+",5)}
# awk -f funcCatStr.awk
-----

+++++

Because newstr is not a real parameter, it is placed in the position of the formal parameter only because we want to make it have the characteristics of local variables. Therefore, the real formal parameter is written in the front and the dummy parameter as a local variable is written in the back, separated by multiple spaces.

If we see the definition of a function like this, we should understand that the function only supports two parameters, and do not pass parameters to C and D, because they exist only as local variables.

func f(a,b    ,c,d){...}

Here we should also understand that all formal parameters, whether as real formal parameters or local variables, have the characteristics of local variables.

cat funcCatStr.awk
func cat(str,count    ,newStr){
    for(i=1;i<=count;i++){
        newStr=newStr""str
    }
    return newStr
}
BEGIN{print cat("-",5);print cat("+",5);print str;print count}
[[email protected] awk]# awk -f funcCatStr.awk
-----
+++++
    #Print the formal parameter STR, and the result is an empty string.
    #Print the formal parameter count, and the result is an empty string.

 

actual combat

Write a function that reads all the data of the file at once

# cat funcReadFile.awk 
func readFile(file    ,RSBak,data){
    RSBak=RS
    RS="^$"
    if((getline data

Write a function that can reread the file

When processing a file, if we encounter some conditions (such as reading line 3), we ask to read the file again.

PS: personally, I think this example is strange and the requirements are strange.

# cat funcRewind.awk 
func rewind(){
    for(i=ARGC;i>ARGIND;i--){
        ARGV[i]=ARGV[i-1]
    }
    ARGC++
    nextfile
}

NR = = 3{# if it is changed to FNR, it will fall into an endless loop.
    print
    rewind()
}

{
    print
}

# awk -f funcRewind.awk a.txt 
ID  name    gender  age  email          phone
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
ID  name    gender  age  email          phone
1   Bob     male    28   [email protected]     18023394012
2   Alice   female  24   [email protected]  18084925203
3   Tony    male    21   [email protected]    17048792503
4   Kevin   male    21   [email protected]    17023929033
5   Alex    male    18   [email protected]    18185904230
6   Andy    female  22   [email protected]    18923902352
7   Jerry   female  25   [email protected]  18785234906
8   Peter   male    20   [email protected]     17729348758
9   Steven  female  23   [email protected]    15947893212
10  Bruce   female  27   [email protected]   13942943905

Format the output of the array

When we have an array, we can’t directly use the print statement to output all of it.

# awk 'BEGIN{arr["name"]="alongdidi";arr["age"]=29;arr["gender"]="male";print arr}'
awk: cmd. line:1: fatal: attempt to use array `arr' in a scalar context

Now we write a custom function to output the data of this array. The output format is as follows.

{
    arr["name"]="alongdidi"
    arr["age"]=29
    arr["gender"]="male"
}
# cat funcA2S.awk
func a2s(arr    ,str){
    for(i in arr){
        str=str""(sprintf("\tarr[\"%s\"]=%s\n",i,arr[i]))
    }
    return "{\n"str"}"
}

BEGIN{
    arr["name"]="alongdidi"
    arr["age"]=29
    arr["gender"]="male"
    print a2s(arr)
}
# awk -f funcA2S.awk
{
    arr["age"]=29
    arr["name"]=alongdidi
    arr["gender"]=male
}

Identify a file with a file name equal to

Generally speaking, if there is such a cli, awk will assign a value to the identification variable a = B. If “a = B” is really a file, we only need to bring a relative path to it.

awk -f xxx.awk a=b a.txt c.txt
awk -f xxx.awk ./a=b a.txt c.txt
# awk '{print}' a=b
^C
# awk '{print}' ./a=b
aaa
bbb
ccc

Idea:

  • All cli parameters are saved in argv, so look for a file whose file name looks like a variable assignment.
  • Law of variable assignment:
    • be equal to.
    • Variable names contain numbers, letters, and underscores and can only start with letters or underscores.
  • Replace the corresponding parameter in argv after finding it.
  • To provide a switch option, after all, file names are not encountered every time, such as variable assignment.
# cat funcRecogAssignFile.awk
func recog(argv,argc    ,i){
    for(i=1;i

Identification time

I’m studyingTime class built-in functionLater, we can identify some common time formats in log files based on the existing time class built-in functions and convert them to epoch values (i.e. time stamps). This will contribute to our subsequent in-depth operation and maintenance work.

Note: if you only need time comparison accurate to days, you can simply write as follows:

awk '/2019-11-08/{print}' access.log
sed -nr '/2019-11-08/p' access.log

In the operation and maintenance work, the time formats in the log files are generally in the following two forms:

2019-11-11T03:42:42+08:00
Sat 26. Jan 15:36:24 CET 2013

This date time format cannot be directly compared and must be converted to an epoch value first. However, this format cannot be directly converted into an epoch value by mktime(), which needs to be processed first.

mktime("YYYY MM DD HH MM SS [DST]"[,utc-flag])

Therefore, we can customize two functions to convert the above two formats to epoch values.

str1ToTime()

2019-11-11T03:42:42+08:00

Idea:

  1. Converts the string to the format “Y M D H M s” recognized by mktime(). Note: you need to use sprintf () to build this format.
  2. Then use mktime () to output the timestamp.
# cat str1ToTime.awk
func str1ToTime(str    ,newStr,Y,M,D,H,m,S,arr){
    newStr=gensub("[-:T+]+"," ","g",str) # 2019 11 11 03 42 42 08 00
    split(newStr,arr)
    Y=arr[1]
    M=arr[2]
    D=arr[3]
    H=arr[4]
    m=arr[5]
    S=arr[6]
    # print mktime(Y M D H m S)
    # Do not write like this, otherwise mktime() return -1!!!
    # Use sprintf() instead!!!
    return mktime(sprintf("%d %d %d %d %d %d",Y,M,D,H,m,S))
}

BEGIN{
    print str1ToTime("2019-11-11T03:42:42+08:00")
    print str1ToTime("2021-11-11T03:42:42+08:00")
    print (str1ToTime("2019-11-11T03:42:42+08:00") < str1ToTime("2021-11-11T03:42:42+08:00"))
}

# awk -f str1ToTime.awk
1573414962
1636573362
1

It should be noted that the date and time information can not be used directly after being stored in the array. Sprintf () conversion must be used.

Mktime (Y M D H M s) # this causes mktime () to receive six parameters. In fact, it can only receive two, one of which is optional.
Mktime ("Y M D H M s") # in this way, the variable cannot be replaced, but the literal value of the string is recognized.

str2ToTime()

Sat 26. Jan 15:36:24 CET 2013

Idea:

  • Similar to str1totime(), the difference is that you need to identify the month “Jan”, so you need to write a mapping function in advance.
# cat str2ToTime.awk
func str2ToTime(str    ,Y,M,D,H,m,S,arr){
    patsplit(str,arr,"[[:alnum:]]+")
    Y=arr[8]
    M=monthMap(arr[3])
    D=arr[2]
    H=arr[4]
    m=arr[5]
    S=arr[6]
    return mktime(sprintf("%d %d %d %d %d %d",Y,M,D,H,m,S))
}

func monthMap(mon    ,mrr){
    mrr["Jan"]=1
    mrr["Feb"]=2
    mrr["Mar"]=3
    mrr["Apr"]=4
    mrr["May"]=5
    mrr["Jun"]=6
    mrr["Jul"]=7
    mrr["Aug"]=8
    mrr["Sep"]=9
    mrr["Oct"]=10
    mrr["Nov"]=11
    mrr["Dec"]=12
    return mrr[mon]
}

BEGIN{
    print str2ToTime("Sat 26. Jan 15:36:24 CET 2013")
    print mktime("2013 01 26 15 36 24")
}
# awk -f str2ToTime.awk
1359185784
1359185784

str3ToTime()

This mapping function can be simply optimized so that it does not occupy so many rows of data.

The short string of each month occupies 3 characters. Based on this rule, we can use index () and simple mathematical calculation to calculate the number of months corresponding to the month abbreviation string.

# cat str3ToTime.awk
func str3ToTime(str    ,Y,M,D,H,m,S,arr){
    patsplit(str,arr,"[[:alnum:]]+")
    Y=arr[8]
    M=monthMap(arr[3])
    D=arr[2]
    H=arr[4]
    m=arr[5]
    S=arr[6]
    return mktime(sprintf("%d %d %d %d %d %d",Y,M,D,H,m,S))
}

func monthMap(mon){
    return (index("JanFebMarAprMayJunJulAugSepOctNovDec",mon)+2)/3
}

BEGIN{
    print str3ToTime("Sat 26. Jan 15:36:24 CET 2013")
    print mktime("2013 01 26 15 36 24")
}
# awk -f str3ToTime.awk
1359185784
1359185784

Real log file

The original blogger [Junma Jinlong] provided two log files with basically the same content. In addition to the log field, let’s take the corresponding five lines for processing.

Log file placementBaidu network diskYes, the extraction code is jtlg.

# cat access1.log
111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-"
111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /videos/index/ HTTP/1.1" 301 169 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-"
50.7.235.2 - - [2019-11-07T03:11:32+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
50.7.235.2 - - [2019-11-07T03:11:33+08:00] "GET / HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
54.36.149.32 - - [2019-11-07T03:15:03+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"

# cat access2.log
111.202.100.141 - - [07/Nov/2019:03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-"
111.202.100.141 - - [07/Nov/2019:03:11:02+08:00] "GET /videos/index/ HTTP/1.1" 301 169 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-"
50.7.235.2 - - [07/Nov/2019:03:11:32+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
50.7.235.2 - - [07/Nov/2019:03:11:33+08:00] "GET / HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
54.36.149.32 - - [07/Nov/2019:03:15:03+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"

Objective: output log information after “2019-11-07 03:11:32”.

For access1.log:

# cat access1.awk
BEGIN{
    compareTime=mktime("2019 11 07 03 11 32")
}

{
    patsplit($4,arr,"[[:digit:]]{1,4}")
    Y=arr[1]
    M=arr[2]
    D=arr[3]
    H=arr[4]
    m=arr[5]
    S=arr[6]
    time=mktime(sprintf("%d %d %d %d %d %d",Y,M,D,H,m,S))
    if(time>compareTime){
        print
    }
}
# awk -f access1.awk access1.log 
50.7.235.2 - - [2019-11-07T03:11:33+08:00] "GET / HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
54.36.149.32 - - [2019-11-07T03:15:03+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"

For access2.log:

# cat access2.awk
func monthMap(mon){
    return (index("JanFebMarAprMayJunJulAugSepOctNovDec",mon)+2)/3
}

BEGIN{
    compareTime=mktime("2019 11 07 03 11 32")
}

{
    patsplit($4,arr,"[[:alnum:]]{1,4}")
    Y=arr[3]
    M=monthMap(arr[2])
    D=arr[1]
    H=arr[4]
    m=arr[5]
    S=arr[6]
    time=mktime(sprintf("%d %d %d %d %d %d",Y,M,D,H,m,S))
    if(time>compareTime){
        print
    }
}
# awk -f access2.awk access2.log 
50.7.235.2 - - [07/Nov/2019:03:11:33+08:00] "GET / HTTP/1.1" 301 169 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0" "-"
54.36.149.32 - - [07/Nov/2019:03:15:03+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"