Linux text processing three swordsman’s awk learning notes 08: array

Time:2021-11-24

array

We’ve seen it in basharray。 The main difference between awk array and Bash array is that it supports associative array, while bash supports numeric index array.

Suppose such an array exists.

arr=["zhangsan","lisi","wangwu"]

The subscript of the numeric index is a numeric value starting from 0.

arr[0] ==> "zhangsan"
arr[1] ==> "lisi"
arr[2] ==> "wangwu"

The array index can only store one kind of information. If you want to store multiple kinds of information, you usually store multiple pieces of information to the array elements and use a unified separator.

arr=["zhangsan:18","lisi:28","wangwu:38"]

Then do additional processing according to the delimiter when processing array elements. The subscript of an associative array is a string, which means that it naturally stores more information than a numeric index array.

arr["zhangsan"] ==> 18
arr["lisi"] ==> 28
arr["wangwu"] ==> 38

The index (subscript) of the associative array can also be a numeric value, but it will be converted into a string internally.

Data structures that store data in the form of key value, such as association index, are also called map, hash and dictionary in other programming languages.

The array order of associative arrays is difficult to determine.

  • Even if the string index appears to be sequential, the string is internally converted to other encodings.
  • Therefore, it is also independent of the order in which the user stores the array.

Awk arrays also support multidimensional arrays and arrays of arrays.

Creation, access and assignment of arrays

There is no special statement for the creation of arrays in awk. When we access the array or assign values to the array elements for the first time, the array is automatically created.

Arr [IDX] # accessing array elements
Arr [IDX] = elements # assign values to array elements

IDX here should not be written as index. Awk has a built-in function called index ().

We mentioned earlier that the index of an array, even a numeric value, is automatically converted to a string. Therefore, there will be some pitfalls to pay attention to.

Arr [1] and arr [“1”] are equivalent. Awk converts the value 1 to the string “1” and stores it as an index.

# awk 'BEGIN{arr[1]=10;arr["1"]=20;print arr[1];print arr["1"]}'
20
20

When converting subscripts from numeric values to strings, they are converted according to predefined variablesCONVFMT(default is%.6g)To convert.

# awk 'BEGIN{arr[123.45678]="alongdidi";print arr[123.45678]}'
Alongdidi # can retrieve data as it is stored.
# awk 'BEGIN{arr[123.45678]="alongdidi";print arr["123.45678"]}'
    #The value is saved and the string is retrieved. Even if it remains "the same", it can not be retrieved because it is converted according to convfmt.
# awk 'BEGIN{arr[123.45678]="alongdidi";print arr["123.457"]}'
Alongdidi # uses string fetching, which must be up to the actual string after conversion.

If we access an element that does not exist in the array (the index is used for the first time), the index is created and its value is empty. This is what we don’t want to see. The associative array itself can be used to store two orders. If empty information is stored, it is invalid data, which will cause a waste of memory space. Too many empty elements will also affect the performance of the array.

# awk 'BEGIN{arr[1];arr[2];print length(arr)}'
2

Length of array

Use the length() function to get / return the length of the array. It can also be used to get the length of numeric values and strings.

# awk 'BEGIN{arr["name"];arr["age"]=29;print length(arr)}'
2
# awk 'BEGIN{print length(100),length("alongdidi"),length(3.1415)}'
3 9 6

Deletion of array elements

Delete arr [IDX] # deletes a specific element in the array
Delete arr # deletes all elements in the array
# awk 'BEGIN{arr[1];arr[2];print length(arr);delete arr[1];print length(arr)}'
2
1
# awk 'BEGIN{arr[1];arr[2];print length(arr);delete arr;print length(arr)}'
2
0

Array judgment

There are two ways to determine whether a variable name is an array.

Typeof (VaR) # if it is an array, the string "array" is returned.
Isarray (VaR) # if it is an array, the value 1 is returned
# awk 'BEGIN{arr[1];print typeof(arr)}'
array
# awk 'BEGIN{arr[1];print isarray(arr)}'
1

Even if we delete all elements in the array, the type of the variable is still an array. Therefore, I think the array should not be deleted and can only be used as an array later, otherwise an error will be reported.

# awk 'BEGIN{arr[1];delete arr;print typeof(arr)}'
array
# awk 'BEGIN{arr[1];delete arr;print isarray(arr)}'
1
# awk 'BEGIN{arr[1];delete arr;arr="alongdidi";print typeof(arr)}'
awk: cmd. line:1: fatal: attempt to use array `arr' in a scalar context

Judgment of array elements

As mentioned above, the value of an unassigned element is empty, so someone may use this method to judge whether an array element exists.

if(arr[idx]=="") {
    print "Element is not exist."
} else {
    print "Element is exist."
}

There are two problems with this.

  • The array element itself may be null, that is, the array element exists but its value is not null.
  • Although the array element does not exist, it exists and occupies the space of the array after judgment, but its value is empty.

We can use this method to determine whether the array elements exist.

if("idx" in arr) {
    print "Element is exist."
} else {
    print "Element is not exists."
}

IDX here is a specific index name. “IDX” in arr will return 1 to indicate that the array element exists, and 0 to indicate that the array element does not exist. The double quotation marks of IDX are very important. Adding a string is used to judge the array elements, and not adding a variable is used to traverse the array.

# awk 'BEGIN{arr["name"];print("age" in arr)}'
0
# awk 'BEGIN{arr["name"]="alongdidi";if("name" in arr){print "exist"}}'
exist
# awk 'BEGIN{arr["name"];if("name" in arr){print "exist"}}'
exist
# awk 'BEGIN{arr["name"];if("age" in arr){print "exist"}else{print "not exist"}}'
not exist

Of course, if you use delete to delete elements, the array elements naturally do not exist. Unlike deleting an array, after deleting an array, its variable name is still an array.

# awk 'BEGIN{arr["name"];print("name" in arr);delete arr["name"];print("name" in arr)}'
1
0

Traversal of array

The traversal of awk array is similar to that of bash array. There are such methods as “for I in arr…” to traverse the array. Let’s not discuss traversal in this way.

Let’s take a look at an example, assuming that the index of the associative array is numeric (it will eventually be converted into a string).

# awk 'BEGIN{arr[1]=10;arr[2]=20;arr[3]=30;arr[4]=40;for(i=1;i<=length(arr);i++){print i"-->"arr[i]}}'
1-->10
2-->20
3-->30
4-->40

When the values are continuous, there is no problem doing so. But what if the values are discontinuous?

# awk 'BEGIN{arr[1]=10;arr[2]=20;arr[3]=30;arr[8]=80;for(i=1;i<=length(arr);i++){print i"-->"arr[i]}}'
1-->10
2-->20
3-->30
4-->
5-->
6-->
7-->
8-->80

The reason for this result is that in the first three loops of the for loop, the result of length (ARR) is 4, but at the fourth time, we try to output:

4-->arr[4]

Although there is no arr [4] element, this reference creates the element, but the value of the element is empty. After the element is created, the length (ARR) of the next cycle becomes 5. And so on, which eventually led to this result.

Fortunately, the span of arr [3] and arr [8] is not very large. If the span is large, many empty elements will be created (performance degradation). Because awk supports associative arrays, if the index of the “last” element is a string, an endless loop will occur.

awk 'BEGIN{arr[1]=10;arr[2]=20;arr[3]=30;arr["name"]="alongdidi";for(i=1;i<=length(arr);i++){print i"-->"arr[i]}}'

We can prevent awk from creating empty elements and judge the purity of elements in advance.

# awk 'BEGIN{arr[1]=10;arr[2]=20;arr[3]=30;arr[8]=80;for(i=1;i<=length(arr);i++){if(i in arr){print i"-->"arr[i]}}}'
1-->10
2-->20
3-->30

But in this case, in the above example, we output less arr [8], so the traversal fails.

However, awk has its own way of traversing arrays, which is similar to bash’s way of traversing arrays.

for(idx in arr) {
    print idx"-->"arr[idx]
}

IDX here is a variable name used to store the associative array index, so don’t wrap it in double quotes.

# awk 'BEGIN{arr[1]=10;arr[2]=20;arr[3]=30;arr[8]=80;for(i in arr){print i"-->"arr[i]}}'
1-->10
2-->20
3-->30
8-->80
# awk 'BEGIN{arr["name"]="alongdidi";arr["country"]="china";arr["age"]=29;arr["gender"]="male";for(i in arr){print i"-->"arr[i]}}'
age-->29
country-->china
name-->alongdidi
gender-->male

Note: the traversal order is different from the order we think we understand, as mentioned earlier.

Traversal order

By default, the traversal order of the array is considered unordered and unpredictable. However, we can use the predefined variable procinfo [“sorted_in”] to specify the order of traversal.

This value can be a user-defined function or awk predefined collation.

Define the traversal order of the array according to the rules of the user-defined function. The author did not explain this part in the video. It may be less used or more complex, leaving it blank temporarily.

The default value is @ unsorted, which means unordered, and the character @ is a fixed character. The composition of the remaining values is as follows.

@x_y_z

x: Specifies whether the array is sorted based on Index (ind) or value (VAL).

y: Specify how to compare. Compare by string (STR), numeric (Num), and type. If it is by type, in ascending order, the numeric — > string — > array.

z: Specify ascending (ASC) or descending (DESC).

@Unsorted: unordered.
@ind_ str_ ASC: sort in ascending order by string comparison based on index.
@ind_ str_ Desc: sort in descending order by string comparison based on index.
@ind_ num_ ASC: sort by numeric comparison in ascending order based on index. Values that cannot be converted to values are treated as values 0.
@ind_ num_ Desc: sort in descending order by numerical comparison based on index. Values that cannot be converted to values are treated as values 0.
@val_ type_ ASC: sort in ascending order by data type comparison based on elements.
@val_ type_ Desc: sort in descending order by data type comparison based on elements.
@val_ str_ ASC: sort in ascending order by string comparison based on elements.
@val_ str_ Desc: sort in descending order by string comparison based on elements.
@val_ num_ ASC: sort elements in ascending order by numerical comparison.
@val_ num_ Desc: sort by numeric comparison based on elements in descending order.

Examples are as follows.

# cat sortArray.awk 
BEGIN{
    PROCINFO["sorted_in"]="@ind_num_desc"
    arr[1]="one"
    arr[2]="two"
    arr[3]="three"
    arr["a"]="aa"
    arr["b"]="bb"
    arr[10]="ten"
    for(idx in arr){
        print idx"-->"arr[idx]
    }
}
# awk -f sortArray.awk
10-->ten
3-->three
2-->two
1-->one
b-->bb
a-->aa

Multidimensional array

The numeric index array can only store one piece of valid data information, that is, the value of the element, and the value of the index is generally a meaningless non positive integer.

Associative array can hold two copies of valid data information, namely index and element value.

When two copies of valid information are not enough, we will consider storing the redundant information in the index or element value. If it is stored in the element value, we need to do additional element value segmentation. If it is stored in the index value, it can be realized only through multi-dimensional array.

The use method is

arr[x,y]

The array index is divided into x and y, which realizes the requirement of saving one more piece of information. Although we use commas to separate X and y when writing, we use the predefined variable subsep to connect X and Y inside awk.

If we set the value of subsep to the @ sign, we can use arr directly[“ [email protected] “] to reference array elements.

# awk 'BEGIN{SUBSEP="@";arr["x","y"]="alongdidi";print arr["x","y"];print arr["[email protected]"]}'
alongdidi
alongdidi

The default value of the predefined variable is “\ 034”, which is a non printable character. We only need to know this feature of awk multidimensional array. When we use it, we still use commas to separate it. There is no need to modify this value.

The use of multi-dimensional arrays is mainly different from one-dimensional arrays, and the others are the same.

arr[x]
arr[x,y]
if(x in arr)
if(x,y in arr)

Next, let’s look at an example of the use of multidimensional arrays. Suppose we have a data as follows:

# cat d.txt
1 2 3 4 5 6
2 3 4 5 6 1
3 4 5 6 1 2
4 5 6 1 2 3

We expect to rotate it 90 ° clockwise.

4 3 2 1
5 4 3 2
6 5 4 3
1 6 5 4
2 1 6 5
3 2 1 6

Talk is simple, show me the code!

# cat multiDimensionalArray.awk 
{
    for(i=1;i<=NF;i++){
        arr[NR,i]=$i
    }
}
END{
    for(j=1;j<=NF;j++){
        for(i=NR;i>=1;i--){
            printf "%d ",arr[i,j]
        }
        printf "\n"
    }
}
# awk -f multiDimensionalArray.awk d.txt 
4 3 2 1 
5 4 3 2 
6 5 4 3 
1 6 5 4 
2 1 6 5 
3 2 1 6

Nested Array

Nested arrays are arrays of arrays. The author didn’t explain this in the video and skipped it temporarily. It should be more complex content and less estimated for daily use. The above multi-dimensional array estimation is rarely used.

 

Array combat

Remove duplicate lines

First look at the contents of the sample file x.log.

# cat x.log 
ABC # 3
def
GHI # 2
abc
ghi
xyz
    #2 blank lines
mnopq
abc

Idea 1: save $0 to the associative array index in main, and then traverse the array index in end.

# cat quchong.awk 
{
    arr[$0]
}
END{
    for(i in arr){
        print i
    }
}
# awk -f quchong.awk x.log 

def
mnopq
abc
ghi
xyz

Disadvantages: the output order cannot be guaranteed to be the same as the data in the original file.

Idea 2: in order to ensure the order of data, when we encounter $0, we judge whether it is the index of the array. If so, do nothing, otherwise we will output it and add it to the index of the array.

# awk '{if(!($0 in arr)){print $0;arr[$0]}}' x.log 
abc
def
ghi
xyz

mnopq

Count the number of occurrences of the row

If a new array element is referenced, it is equal to creating the array element and assigning a null value. If the value is self incremented, it is equivalent to self incrementing 0, and the result is 1.

# awk 'BEGIN{arr[1];print arr[1]}'

# awk 'BEGIN{arr[1]++;print arr[1]}'
1
# awk 'BEGIN{++arr[1];print arr[1]}'
1

Based on these two characteristics, we can perform statistical operations.

# awk '{arr[$0]++}END{for(i in arr){print i" count is:"arr[i]}}' x.log
 Count is: 2 # note that this is an empty line.
def count is:1
mnopq count is:1
abc count is:3
ghi count is:2
xyz count is:1

Count the number of words

You can count the number of rows, so you can count the number of words. However, the previous index of the array saved row data and was replaced with word data.

Let’s rewrite x.log and add a few words.

abc
def
ghi def def def
abc
ghi abc
xyz
 abc
mnopq
abc
 mnopq
# awk '{for(i=1;i<=NF;i++){arr[$i]++}}END{for(idx in arr){print idx":"arr[idx]}}' x.log 
def:4
mnopq:2
abc:5
ghi:2
xyz:1

Count the number of TCP connection states

Netstat – tanp: used to display network connection information. If there is not enough information, you can repeat several SSH connections yourself.

# netstat -tanp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:6012          0.0.0.0:*               LISTEN      9238/sshd: [email protected] 
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      715/rpcbind         
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN      1455/dnsmasq        
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1143/sshd           
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      1136/cupsd          
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1369/master         
tcp        0      0 127.0.0.1:6010          0.0.0.0:*               LISTEN      1797/sshd: [email protected] 
tcp        0      0 127.0.0.1:6011          0.0.0.0:*               LISTEN      9192/sshd: [email protected] 
tcp        0      0 192.168.152.100:22      192.168.152.1:53246     ESTABLISHED 9192/sshd: [email protected] 
tcp        0     52 192.168.152.100:22      192.168.152.1:56142     ESTABLISHED 1797/sshd: [email protected] 
tcp        0      0 192.168.152.100:22      192.168.152.1:53247     ESTABLISHED 9238/sshd: [email protected] 
tcp6       0      0 ::1:6012                :::*                    LISTEN      9238/sshd: [email protected] 
tcp6       0      0 :::111                  :::*                    LISTEN      715/rpcbind         
tcp6       0      0 :::22                   :::*                    LISTEN      1143/sshd           
tcp6       0      0 ::1:631                 :::*                    LISTEN      1136/cupsd          
tcp6       0      0 ::1:25                  :::*                    LISTEN      1369/master         
tcp6       0      0 ::1:6010                :::*                    LISTEN      1797/sshd: [email protected] 
tcp6       0      0 ::1:6011                :::*                    LISTEN      9192/sshd: [email protected]

Generally, we want to make the sort with high connection number come first, so it involves the sort problem of traversing the array.

# netstat -tanp | awk '/^tcp/{arr[$6]++}END{PROCINFO["sorted_in"]="@val_num_desc";for(i in arr){print i":"arr[i]}}'
LISTEN:15
ESTABLISHED:3

Take the maximum value according to the field

Suppose there is such a file:

# cat version.txt 
file 10
dir 10
file 20
dir 20
file 10
dir 10
file 300
dir 999
file 30
dir 99

We expect output:

file 300
dir 999

Make sure that the number after file and dir is the largest.

# awk 'arr[$1]

 

Recommended Today

Apache sqoop

Source: dark horse big data 1.png From the standpoint of Apache, data flow can be divided into data import and export: Import: data import. RDBMS—–>Hadoop Export: data export. Hadoop—->RDBMS 1.2 sqoop installation The prerequisite for installing sqoop is that you already have a Java and Hadoop environment. Latest stable version: 1.4.6 Download the sqoop installation […]