# What is a data structure?

A data structure is a collection of data elements with specific relationships. The relationship between elements is called the logical structure of data, and the storage of data elements and the relationship between elements is called storage structure or physical structure. In general, carefully selected data structures can bring higher operation or storage efficiency.
## Classification of data structures

The logical structure of data structure is mainly divided into linear structure and nonlinear structure.

The storage structure is mainly divided into sequential storage, chain storage, index storage and hash storage

Sequential storage: a group of storage units with continuous addresses are used for sequential storage (e.g. array).

Chained storage: the nodes linked by pointers are used for storage. The addresses of nodes do not need to be continuous (e.g. linked list).
Index storage: establish an index table, and determine the node storage address through the index number of the index table.

Hash storage: hash storage, also known as hash storage, is a search technology that tries to establish a corresponding relationship between the storage location of data elements and key codes. The basic idea of hash storage is that the storage address of a node is determined by the key value of the node. In addition to searching, it can also be used to store.

# Linear table

## Definition of linear table

Linear structure is a basic data structure, which is mainly used to describe the data relationship with a single precursor and successor in the objective world. The characteristic of linear structure is that there is a linear relationship between data elements, that is, the elements are “arranged one by one”.
Linear table is the simplest, most basic and most commonly used linear structure. Sequential storage and chain storage are usually used. The main basic operations are insertion, deletion and search.

A linear table is a finite sequence of n (n > = 0) data elements with the same characteristics. The characteristics of non empty linear table are as follows.

(1) There is only one element called “first”.

(2) There is only one element called “last”.

(3) Except for the first element, each element in the sequence has only one direct precursor.

(4) Except for the last element, each element in the sequence has only one direct successor.

## Storage structure of linear table

The storage structure of linear table is divided into sequential storage and chain storage.

Sequential storage of linear table refers to the sequential storage of data elements in linear table with a group of storage units with continuous addresses, so that two logically adjacent elements are also adjacent in physical location, as shown in the figure:

The advantage of using sequential storage structure in linear table is that the elements in the table can be accessed randomly, but the disadvantage is that the elements need to be moved for insertion and deletion operations.

The chained storage of linear table uses nodes to store data elements. The basic node structure is as follows:

Among them, the data field is used to store the value of the data element, and the pointer field stores the direct precursor and direct successor information of the current element. The information in the pointer field is called a pointer (or chain).

The address of the node storing each data element is not required to be continuous, so the logical relationship between the elements must be stored while storing the data elements.

In addition, node space is applied only when needed, without prior allocation.

A linked list is formed between nodes through pointer fields. If there is only one pointer field in the node, it is called linear linked list (or single linked list), as shown in the figure:

In the storage structure of the linked list, you only need a pointer (called the head pointer, such as the head in the above figure) to point to the first node to access any element in the table in sequence.

Insertion and deletion in the chain storage structure are essentially the modification of related pointers.

When a linear list uses a linked list as the storage structure, it cannot randomly access the data elements (it needs to traverse the data elements), but the insertion and deletion operations do not need to move the elements.

There are several other linked list structures according to the setting mode of the pointer field in the node.

Two way linked list. Each node contains two pointers, indicating the direct predecessor and direct successor of the current node element respectively. Its characteristic is that it can traverse the linked list from any element node in the list from two directions.

Circular linked list. On the basis of single Necklace list (or two-way linked list), make the pointer of the end node point to the first node in the list to form a circular linked list. Its characteristic is that it can traverse the whole linked list from any node in the list.

Static linked list. The chain storage structure of linear table is described with the help of array. The subscript table of the array element is used to represent the pointer of the node where the element is located.

# Stack

## Definition of stack

Stack is a linear data structure that can only realize data storage and retrieval by accessing one end of it. In other words, the stack operates according to the “last in, first out” rule. Therefore, stack is also called last in first out (LIFO) linear table.

One end of the stack for inserting and deleting operations is called the top of the stack, and correspondingly, the other end is called the bottom of the stack. A stack without data elements is called an empty stack.

## Basic operation of stack

① Initstack (s): create an empty stack.

② Stack empty (s): returns the “true” value when stack s is empty, otherwise returns the “false” value.

③ Push (s, x): add element X to the top of the stack and update the top pointer.

④ Pop (s): delete the stack top element from the stack and update the stack top pointer. If you need to get the value of the top stack element, you can define pop (s) as a function that returns the value of the top stack element.

⑤ Read stack top element (s): returns the value of the stack top element without modifying the stack top pointer.

## Storage structure of stack

Sequential storage (sequential stack): the sequential storage of the stack refers to the sequential storage of data elements from the top of the stack to the bottom of the stack with a group of storage units with continuous addresses, and the pointer top is attached to indicate the position of the elements at the top of the stack. In this storage mode, the storage space of the stack needs to be defined (or applied) in advance, and the capacity of the stack space is limited. Therefore, in the sequential stack, when an element is loaded into the stack, it is necessary to judge whether the stack is full (there is no free unit in the stack space). If the stack is full, the element will overflow into the stack.

Chain storage (chain stack)): the chained storage of stack refers to the use of linked list to store the data elements in the stack, which solves the problem of possible overflow. Because the insertion and deletion of elements in the stack are only carried out at the top end of the stack, there is no need to set the head node. The head pointer of the linked list is the top pointer of the stack.

## Application of stack

The typical applications of stack include expression evaluation and bracket matching. Stack plays an important role in the implementation of computer language and the transformation of recursive process into non recursive process

# queue

## Definition of queue

Queue type is a first in first out (FIFO) linear table. It only allows elements to be inserted at one end of the table and deleted at the other end of the table. In a queue, the end at which an element is allowed to be inserted is called the end of the queue, and the end at which an element is allowed to be deleted is called the front.

## Basic operation of queue

① Initialize queue (q): create an empty queue Q.

② Empty queue (q): when the queue is empty, return the “true” value, otherwise return the “false” value.

③ Enqueue (Q, x): add element X to the end of queue Q and update the end of queue pointer.

④ Dequeue out (q): delete the queue header element from queue Q and update the queue header pointer.

⑤ Read queue header element frontque (q): returns the value of the queue header element without updating the queue header pointer.

## Storage structure of queue

Sequential storage: the sequential storage structure of the queue is also called sequential queue. It also uses a group of storage units with continuous addresses to store the elements in the queue. Since the insertion and deletion of elements in the queue are limited to both ends of the table, set the queue head and tail pointers to indicate the current queue head and tail elements respectively.

In the sequential queue, in order to reduce the complexity of operation, only the tail pointer is modified when the element enters the queue, and only the head pointer is modified when the element leaves the queue. Let the capacity of the sequential queue Q be 6, the queue head pointer be front and the queue tail pointer be rear. The relationship between the head and tail pointers and the elements in the queue is shown in the figure below:

Since the storage space of the sequential queue is set in advance, the queue end pointer will have an upper limit value. When the queue end pointer reaches the upper limit, the queue entry operation of new elements cannot be realized only by modifying the queue end pointer. At this time, the sequential queue can be assumed to be a ring structure through division and remainder operation, which is called circular queue.

Set the capacity maxsize of the circular queue Q. initially, the queue is empty, and both q.rear and q.front are equal to 0.

When the element is queued, modify the end of queue pointer q.rear = (q.rear + 1)% maxsize.

When the element is out of the queue, modify the queue header pointer q.front = (q.front + 1)% maxsize.

When the queue is empty and full, the positions pointed to by the queue head and tail pointers of the circular queue are the same. At this time, the status of the queue cannot be determined only according to the relationship between q.rear and q.front.

In order to distinguish between empty and full teams, the following two methods can be adopted:

First, set a flag bit to distinguish whether the queue is empty or full when the values of the head and tail pointers are the same;

The second is to sacrifice a storage unit. It is agreed that “the next position of the position indicated by the tail pointer of the queue is the queue head pointer” indicates that the queue is full, and the same values of the head and tail pointers indicate that the queue is empty.

Chained storage: chained storage of queues is also called chained queues. Here, in order to facilitate operation, add a header node to the chain queue and point the header pointer to the header node. Therefore, the judgment condition of empty queue is that the values of head pointer and tail pointer are the same, and both refer to the head node.

## Application of queue

Queue structure is often used to deal with situations that need to queue, such as print queue for processing print tasks in operating system, computer simulation of discrete events and so on.

# strand

## Definition of string

String (string) is a special linear table whose data elements are characters. The object of non numerical problems in computer is often string data. String is a finite sequence composed of only characters. It is a linear table with limited value range. It is generally recorded as s = ‘A1A2 An ‘, where s is the string name and the character sequence enclosed in single quotation marks is the string value.

## Some basic concepts of string

Empty string: a string with a length of zero. An empty string does not contain any characters.

Space string: a string of one or more spaces. Although space is a white space character, it is also a character, which should be taken into account when calculating the string length.

Substring: a sequence composed of continuous characters of any length in the string is called substring. Strings containing substrings are called primary strings. The position of the substring in the main string refers to the position of the first character of the substring in the main string when the substring first appears. An empty string is a substring of any string.

String Equality: two strings are equal in length and have the same characters in their corresponding positions.

String comparison: when comparing the size of two strings, it is based on the ASCII code value (or other character coding set) of the character. The comparison operation starts from the first character of the two strings. The string with the larger code value of the character is the larger; If one of the strings ends first, the larger the string length is.

## Basic operation of string

① Assignment operation strassign (s, t): assign the value of string s to string t.

② Join operation concat (s, t): connect the string t at the tail of s to form a new string.

③ Find string length STRLENGTH (s): returns the length of string s.

④ String comparison strcompare (s, t): compares the size of two strings. The return values – 1, 0 and 1 represent s respectivelyTthree cases.

⑤ Substring (s, start, len): returns a string sequence of length len starting from start in string s.

## String storage structure

Sequential storage: the sequential storage structure of string refers to a character sequence in which a group of storage units with continuous addresses are used to store string values. Since the elements in the string are characters, the storage space of the string can be defined through the character array provided by the program language, or the space of the string can be dynamically applied according to the needs of the string length.

Linked storage: when the linked list stores the characters in the string, each node can store one character or multiple characters. At this time, the storage density should be considered. In the chain storage structure, the selection of node size is as important as that of array space in sequential storage method, which directly affects the processing efficiency of string.

## Pattern matching of strings

The sub string location operation is usually called string pattern matching. It is one of the most important operations in various string processing systems. Substrings are also called pattern strings.

(1) Simple pattern matching algorithm:

This algorithm is also called brute force algorithm. Its basic idea is to compare with the first character of the pattern string from the first character of the main string. If it is equal, continue the subsequent comparison of pairs of strings one by one, otherwise re compare with the first character of the pattern string from the second character of the main string, Until each character in the pattern string is equal to a continuous character sequence in the main string, it is called successful matching. If the same substring as the pattern string cannot be found in the main string, the matching fails.

(2) Improved pattern matching algorithm:

The improved pattern matching algorithm is also called KMP algorithm. Its improvement is that whenever the characters compared are not equal in the matching process, it is not necessary to trace back the string position pointer of the main string, but use the obtained “partial matching” results to “slide” the pattern string to the right as far as possible, and then continue the comparison.

Let the pattern string be “P0… PM-1”. The idea of KMP matching algorithm is: when the character PJ in the pattern string is not equal to the corresponding character Si in the main string, because the first j characters (“P0… PJ-1”) have been matched, if “P0… Pk-1” in the pattern string is the same as “pj-k… PJ-1”, PK can be compared with Si, so I does not need to go back.

In KMP algorithm, the sliding of substring is realized according to the next function value of pattern string. If next [J] = k, next [J] means that when PJ in the mode string is not equal to the corresponding character in the main string, PK in the mode string is compared with the corresponding character in the main string.

The next function is defined as follows:

# tree

## Definition of tree

Tree structure is a very important nonlinear structure. In this structure, a data element can have two or more direct successor elements. Tree can be used to describe the hierarchical relationship widely existing in the objective world.

A tree is a finite set of n (n ≥ 0) nodes. When n = 0, it is called an empty tree. In any non empty tree (n > 0), there is and only one node called root; The other nodes can be divided into m (m ≥ 0) disjoint finite sets T1, T2, TM, where each ti is a tree and is called the subtree of the root node.

The definition of tree is recursive, which shows the inherent characteristics of the tree itself, that is, a tree is composed of several sub trees, and the sub tree is composed of smaller sub trees.

## Basic concepts of tree

(1) Parents, children and brothers: the root of the subtree of a node is called the child of the node; Accordingly, the node is called the parent of its child node. Nodes with the same parents are brothers.

(2) Degree of node: the number of subtrees of a node is recorded as the degree of the node.

(3) Leaf node: also known as terminal node, it refers to the node with degree 0.

(4) Internal node: a node whose degree is not 0 is called a branch node or a non terminal node. Outside the root node, the branch node is called the internal node.

(5) Level of node: the root is the first level, and the child of the root is the second level, and so on. If a node is in layer I, its child node is in layer I + 1.

(6) Height of tree: the maximum level of a tree is recorded as the height (or depth) of the tree.

(7) Ordered (unordered) tree: if the subtrees of nodes in the tree are regarded as ordered from left to right, that is, they cannot be exchanged, the tree is called ordered tree, otherwise it is called unordered tree.

## Definition of binary tree

A binary tree is a finite set of n (n ≥ 0) nodes. It is either an empty tree (n = 0), or it is composed of a root node and two disjoint binary trees that become left and right subtrees respectively.

The main difference between a tree and a binary tree is that the subtree of a binary tree node should distinguish between a left subtree and a right subtree. Even if the node has only one subtree, it should clearly indicate whether the subtree is a left subtree or a right subtree. In addition, the maximum node degree of binary tree is 2, and the degree of nodes is not limited in the tree.

## Properties of binary tree

(1) There are at most 2 ^ {I-1} nodes on layer I (I ≥ 1) of binary tree.

(2) A binary tree with height K has at most 2 ^ {k}-1 nodes (K ≥ 1).

(3) For any binary tree, if the number of terminal nodes is n_ {0}, the number of nodes with degree 2 is n_ {2} , then n_ {0}=n_ {2}+1。

(4) The depth of a complete binary tree with n nodes is log ν n + 1.

If a binary tree with depth K has a node, it is called a full binary tree. Continuous numbering of nodes in the full binary tree: the agreed numbering starts from the root node, from top to bottom, from left to right. A binary tree with depth K and N nodes is called a complete binary tree if and only if each node corresponds to the nodes numbered from 1 to N in the full binary tree with depth K. The schematic diagram of full binary tree and complete binary tree is shown in the figure

## Storage structure of binary tree

Sequential storage: a set of storage units with continuous addresses are used to store the nodes of the binary tree. The nodes must be arranged into an appropriate linear sequence, and the mutual positions of the nodes in this sequence can reflect the logical relationship between the nodes.

For a complete binary tree with depth K, except layer K, the other layers contain the largest number of nodes, that is, the number of nodes in each layer is just twice that of the previous layer. From the number of a node, the numbers of its parents, left children and right children can be deduced.

Assuming that there is a node numbered I, there are:

If I = 1, the node is the root node and has no parents; If I > 1, the parent node of the node is I / 2 (take an integer).

If 2I ≤ n, the left child number of the node is 2I, otherwise there is no left child.

If 2I + 1 ≤ n, the right child number of the node is 2I + 1, otherwise there is no right child.

The sequential storage structure of binary tree is shown in the figure

The sequential storage structure of complete binary tree is simple and saves space. For general binary tree, the sequential storage structure should not be used. Because the general binary tree must also be stored in the form of a complete binary tree, that is to add some “virtual nodes” that do not actually exist, which will cause a waste of space.

Linked storage: because the nodes of the binary tree contain data elements, the roots of the left subtree, the roots of the right subtree, parents and other information, the binary tree can be stored with a trigeminal linked list or a binary linked list (that is, a node contains three or two pointers), and the head pointer of the linked list points to the root node of the binary tree.

## Traversal of binary tree

Traversal is the process of accessing each node in the tree according to a certain policy and only once. Due to the recursive nature of binary tree, a non empty binary tree can be regarded as composed of root node, left subtree and right subtree. Therefore, if you can traverse these three parts in turn, you will traverse the whole binary tree. According to the Convention of traversing the left subtree first and then the right subtree, three traversal methods of binary tree can be obtained according to the different access root node positions.

Preorder traversal: traversal is performed in root left right order.

Middle order traversal: traverse in the order of left root right.

Post order traversal: traverse in the order of left right root.

The traversal of binary tree is essentially the process of linearizing a nonlinear structure, which makes each node (except the first and last) have and only one direct precursor and direct successor in these linear sequences.

# Optimal binary tree

## Definition of optimal binary tree

Optimal binary tree, also known as Huffman tree, is a kind of tree with the shortest weighted path length. Path is the path from one node to another in the tree. The number of branches on the path is called path length.

The path length of a tree is the sum of the path lengths from the root to each leaf. The weighted path length of a node is the product of the path length from the node to the tree root and the weighted path length of the node.

The weighted path length of the tree is the sum of the weighted path lengths of all leaf nodes in the tree, which is recorded as

Where, n is the number of weighted leaf nodes, W_ {k} Is the weight of leaf node, l_ {k} Is the path length from leaf node to root.

The following figure shows a binary tree with 4 leaf nodes, of which the weighted path length of the binary tree shown in figure (b) is the smallest.

The Huffman algorithm for constructing the optimal binary tree is as follows:

(1) According to the given n weights {W1, W2,…, WN}, a set of N binary trees f = {T1, T2,…, TN}, in which there is only one root node with weight wi in each tree Ti, and its left and right subtrees are empty.

(2) In F, two trees with the smallest weight are selected as the left and right subtrees to construct a new binary tree, and the weight of the root node of the newly constructed binary tree is the sum of the weight of the root nodes of its left and right subtrees.

(3) Delete the two trees from F and add the new binary tree to F.

Repeat steps (2) and (3) until there is only one tree in F, which is the optimal binary tree (Huffman tree).

## Huffman coding

If a binary code of the same length is compiled for each character, it is called equal length coding. For example, the 26 characters in the English character set are represented by a 5-bit binary string, and a character coding table is constructed according to the equal length coding format. The sender encodes the original information according to the coding table and sends the message. The receiver divides the received binary code into groups of 5 bits. The corresponding characters can be obtained by looking up the coding table of characters to realize decoding.

The implementation method of the equal length coding scheme is relatively simple, but after encoding the original text in the communication, the code string of the obtained message is too long, which is not conducive to improving the communication efficiency. Therefore, it is hoped to shorten the total length of the code string. If the coding with different length is designed for each character, and the coding as short as possible is adopted for the message with more times, the total length of the transmitted message code string can be reduced.

To design codes with different lengths, the following conditions must be met: the code of any character is not the prefix of the code of another character. This code is also called prefix code.

For a given character set D and character usage frequency w, the method of constructing the optimal prefix code is as follows: take D as the leaf node and w as the weight of the leaf node, construct an optimal binary tree, and then mark the left branch and the right branch of each node in the tree with 0 and 1, Then the code of the character represented by each leaf node is a string composed of 0 and 1 on the path from root to leaf.

The process of Huffman decoding is as follows: starting from the root node, determine whether to enter the left branch or the right branch according to 0 and 1 in the binary bit string (coding sequence) (the current code is 0 into the left subtree of the current node, and 1 into the right subtree). When reaching the leaf node, translate a character. If the bit string is not finished, it will be traced back to the root node to continue the above decoding process.

For example, with the character set {a, B, C, D, e} and the corresponding weight set {0.3,0.25,0.15,0.22,0.08}, it is obtained after constructing the optimal binary tree according to the Huffman algorithm for constructing the optimal binary tree

If the coding sequence is 101110000100, the translated character sequence is “edaac”.

# Binary sort tree

## Definition of binary sort tree

Binary sort tree, also known as binary search tree, is either an empty tree or a binary tree with the following properties.

(1) If its left subtree is not empty, the values of all nodes on the left subtree are less than those of the root node.

(2) if its right subtree is not empty, the values of all nodes on the right subtree are greater than those of the root node.

(3) The left and right subtrees themselves are binary sort trees.

As shown in the figure:

## Search process of binary sort tree

When the binary sort tree is not empty, the given value is compared with the keyword value of the root node. If it is equal, the search is successful; If they are not equal, when the keyword value of the root node is greater than the given value, the next step is to search in the left subtree of the root, otherwise search in the right subtree of the root. If the search is successful, the search process takes a path from the tree root to the found node; Otherwise, the lookup process ends in an empty subtree.

## Insert node in binary tree

Binary sort tree is constructed by successively inputting data elements and inserting them into the appropriate position of binary tree. The specific process is: each element is read in, a new node is established. If the binary sort tree is not empty, the value of the new node is compared with the value of the root node. If it is less than the value of the root node, it is inserted into the left subtree, otherwise it is inserted into the right subtree; If the binary sort tree is empty, the new node will be the root node of the binary sort tree. If the keyword sequence is {46, 25, 54, 13, 29, 91}, the construction process of the whole binary sort tree is shown in the figure.

## Delete node in binary tree

Deleting a node in a binary sort tree cannot delete all the subtrees with the node as the root. Only this node can be deleted and still maintain the characteristics of a binary sort tree. In other words, deleting a node in a binary sort tree is equivalent to deleting an element in an ordered sequence. After deleting the leaf node, you need to modify the pointer of the left and right subtree of its parent node to ensure that the structure of the whole tree is not damaged.

# balanced binary tree

Balanced binary tree is also called AVL tree. It is either an empty tree or a binary tree with the following properties. Its left subtree and right subtree are balanced binary trees, and the absolute value of the height difference between the left subtree and the right subtree is no more than 1. If the balance factor (BF) of a binary tree node is defined as the height of the left subtree of the node minus the height of its right subtree, the balance factors of all nodes on the balanced binary tree can only be – 1, 0 and 1. As long as the absolute value of the balance factor of one node in the tree is greater than 1, the binary tree is unbalanced.

By analyzing the search process of binary sort tree, it can be seen that the search efficiency can reach the best only when the shape of the tree is relatively uniform. Therefore, we hope to keep the binary sort tree as a balanced binary tree in the process of constructing the binary sort tree.

The basic idea of keeping the balance of binary sort tree is: whenever a node is inserted into binary sort tree, first check whether the balance is broken due to insertion. If so, find out the minimum unbalanced binary tree, and adjust the relationship between nodes in the minimum unbalanced subtree to achieve a new balance while maintaining the characteristics of binary sort tree. The so-called minimum unbalanced subtree refers to the subtree nearest to the insertion node and taking the node with the absolute value of the balance factor greater than 1 as the root.

## Insert operation on balanced binary tree

Suppose that the pointer of the root node of the smallest tree out of balance due to the insertion of nodes in the binary sort tree is a, that is, the node referred to by a is the ancestor node closest to the insertion node and the absolute value of the balance factor exceeds 1, then the law of adjustment after losing balance can be included in the following four cases.

(1) Ll type one-way right-handed balance treatment. As shown in the figure below, since a new node is inserted into the left subtree of the left subtree of * a (i.e. node a), the balance factor of * a increases from 1 to 2, resulting in the loss of balance of the subtree with * a as the root. Therefore, a clockwise rotation to the right is required.

(2) RR type one-way left-handed balance treatment. As shown in the figure below, a new node is inserted into the right subtree of the right subtree of * a (i.e. node a), which changes the balance factor of * a from – 1 to – 2, resulting in the loss of balance of the subtree with * a as the root. Therefore, a left counterclockwise rotation operation is required.

(3) LR type first left and then right two-way rotation balance processing. As shown in the figure below, since a new node is inserted into the right subtree of the left subtree of * a (i.e. node a), the balance factor of * a increases from 1 to 2, resulting in the loss of balance of the subtree with * a as the root node. Therefore, it is necessary to rotate twice (rotate left first and then right).

(4) RL type right to left two-way rotation balance processing. As shown in the figure below, since a new node is inserted into the left subtree of the right subtree of * a (i.e. node a), the balance factor of * a changes from – 1 to – 2, resulting in the loss of balance of the subtree with * a as the root node. Therefore, it is necessary to rotate twice (first right rotation and then left rotation).

## Delete operation on balanced binary tree

Deleting on a balanced binary tree is more complex than inserting. If the two subtrees of the node to be deleted are not empty, replace the node with the last node traversed in the middle order on the left subtree of the node (or the first node on its right subtree). The situation is transformed into that the node to be deleted has only one subtree before processing. When a node is deleted, the balance factors of all nodes on the path from the deleted node to the tree root need to be updated. For each node with a balance factor of + – 2 located on the path, balance processing should be carried out.

PS: some contents of the data structure have not been summarized, and the rest will be supplemented later