Some summary of database

Time:2022-1-25

Article catalogue


preface

Database is something that every back-end programmer must learn. It is necessary to write an album to summarize the knowledge points. It is also convenient to review the knowledge points and future interviews. The content does not involve SQL statements.
Mysql database specification (Alibaba Songshan java development manual)


Tip: the following is the main body of this article for reference only

1、 Basic concepts of database

1. What is a database

database(DB) is actually a pile of places for storing documents and things.
Mysql, SQL server, Oracle, redis, etc., which we are familiar with, are calledDatabase management system(DBMS)。
The database can be divided into:Relational databaseandNon relational database

Some summary of database

2. Differences between SQL and MySQL

This is a concept that many novices are confused about. It can be understood as follows:
SQL isA language, it is used to operate relational database and can be used to add, delete, modify and query.
MySQL isA softwareFor example, some simple operations cannot be queried through code every time, so the function of SQL query can be realized through mouse click through this software.

3. Three paradigms of database

Some summary of database
The three paradigms of database are only the basic idea, so that a table structure with less redundancy and reasonable structure can be established. However, in the actual design, it is still necessary to analyze and design according to the actual situation of requirements, take into account the consideration of performance and requirements, and there is no need to pursue the three paradigms

2、 Database index

1. Why use index

Advantages:
1. By creating a unique index, you can ensure the uniqueness of each row of data in the database table.
2. It can improve the efficiency of data retrieval and reduce the IO cost of database, which is similar to the catalogue of books.
3. Sort the data through the index column, reduce the cost of data sorting and reduce the consumption of CPU.
4. The indexed columns will be sorted automatically, including [single column index] and [combined index], but the sorting of combined index is more complex.
5. If you sort according to the order of index columns, the efficiency of the corresponding order by statement will be greatly improved.
inferiority:
1. Spatially: indexes need to occupy physical space. In addition to the data space occupied by data tables, each index also needs to occupy a certain physical space. If cluster indexes are to be established, the space required will be larger.
2. Time: Although the index will improve the query efficiency, it will reduce the efficiency of updating the table. For example, every time a table is added, deleted or modified, MySQL should not only save the data, but also save or update the corresponding index file.

2. When to use index

  1. Do not create indexes for columns that are rarely used or referenced in queries. Because these columns are rarely used, increasing the index will reduce the maintenance speed of the system and increase the space demand.
  2. Columns with few data values should not be indexed. Because these columns have few values and low discrimination, such as gender in the personnel table, a large proportion of data rows need to be searched in the table during query. Increasing the index can not significantly speed up the retrieval speed.
  3. Columns defined as text, image and bit data types should not be indexed. This is because the data volume of these columns is either quite large or has few values.
  4. When the modification performance is much greater than the retrieval performance, the index should not be created. At this time, because the two are contradictory, when the index is added, the retrieval performance will be improved, but the modification performance will be reduced.
  5. Indexes must be created for data columns with foreign keys.

3. Data structure and algorithm in index

Database indexes are classified according to structure, mainly including b-tree index, hash index and bitmap index.

Hash index:

The hash index adopts a certain hash algorithm (common hash algorithms include direct addressing method, square centering method, folding method, divisor remainder method and random number method) to convert the database field data into a fixed length hash value, which is stored in the corresponding position of the hash table together with the row pointer of the data. In case of hash collision (the hash values of two different keywords are the same), It is stored in the form of linked list under the corresponding hash key.

When searching, it does not need to search level by level from root node to leaf node like B + tree. It only needs one hash algorithm to locate the corresponding position immediately. The speed is very fast, and the average retrieval time is O (1).

Bitmap index

B-tree index is good at dealing with columns with many different values, but it will become difficult to use when dealing with columns with small cardinality. If the cardinality of the column queried by the user is very small, that is, there are only a few fixed values, such as gender, marital status, administrative region, etc., either do not use the index, scan all records row by row during query, or consider establishing bitmap index. When searching for data, just search all 1 values in the relevant bitmap (and, or operation can be carried out according to the query requirements).

B-tree index:

Some summary of database
The characteristics of an m-order B-tree are as follows:

  • Each node has at most M child nodes;
  • In addition to the root node and leaf node, each node has at least m / 2 (rounded up) child nodes;
  • All leaf nodes are located on the same layer;
  • Each node contains k elements (keywords), where M / 2 ≤ K < m, where M / 2 is rounded down;
  • The elements (keywords) in each node are arranged from small to large;
  • The value of the child left node of each element is less than or equal to the element, and the value of the right node is greater than or equal to the element.

B + tree index:

Some summary of database
characteristic:

  • All non leaf nodes only store keyword information;
  • All specific data exist in the leaf node;
  • All leaf nodes contain the information of all elements;
  • There is a chain pointer between all leaf nodes.
    This is the most commonly used structure

Comparison between B + tree and B tree:

B + tree B tree
The node with n subtrees contains n keywords; B-tree is n sub trees with n-1 keywords
All leaf nodes contain the information of all keywords and pointers to the records containing these keywords, and the leaf nodes themselves are linked in the order of the size of the keywords from small to large. The leaf node of B-tree does not include all the information to be found
All non terminal nodes can be regarded as the index part, and the node only contains the largest (or smallest) keyword in the root node of its subtree The non terminal node of B-tree also contains valid information to be found

The difference between B + tree and hush index

  • Hash index is faster to perform equivalent query (generally), but it is unable to perform range query;
  • Hash index does not support sorting with index;
  • Hash index does not support fuzzy query and leftmost prefix matching of multi column index. The principle is also because hash function is unpredictable;
  • The hash index can not avoid querying data back to the table at any time, while the B + tree can only complete the query through the index when it meets some conditions (cluster index, overlay index, etc.);
  • Although the hash index is fast in equivalent query, it is unstable and its performance is unpredictable. When a key value has a large number of duplicates, hash collision occurs, and the efficiency may be very poor; The query efficiency of B + tree is relatively stable. For all queries, it is from root node to leaf node, and the height of the tree is low.

4. The leftmost prefix matching principle of MySQL

When building a federated index (multi column index) in mysql, it will follow the leftmost prefix matching principle, that is, the leftmost priority. When retrieving data, it will match from the leftmost side of the federated index. For example, if there is a three column index (a, B, c), the index has been established on (a), (a, b), (a, B, c). Therefore, when creating a multi column index, the where clause should be selected according to the business requirements
The most frequently used column is placed on the far left. According to the leftmost prefix matching principle, MySQL will match to the right until it meets the range query (>, <, between, like). For example, when using the query criteria where a = 1 and B = 2 and C > 3 and d = 4, if the index in the order of (a, B, C, d) is established, D cannot use the index. If the index in the order of (a, B, D, c) is established, it can be used, And the order of a, B and D in the where clause can be adjusted arbitrarily. If the index order is (a, b), the index cannot be used by directly using the query condition where b = 1 according to the leftmost prefix matching principle.

5. Cluster index

Clustered index, also known as clustered index, is not an index type, but a way of data storage. Specifically, clustering index refers to putting data storage and index together, and finding the index will find the data.

In mysql, only InnoDB table supports cluster index. The data of InnoDB table itself is a cluster index. Non leaf nodes are stored in the order of primary key, and leaf nodes store primary key and corresponding row records. Therefore, the full table sequential scanning of InnoDB table will be very fast.
characteristic
1. Because the index and data are stored together, it has higher retrieval efficiency;
2. Compared with non clustered index, clustered index can reduce the IO times of disk;
3. The physical storage of tables is based on the structure of clustered indexes, so a data table can only have one clustered index, but can have multiple non clustered indexes;
4. Generally speaking, cluster indexes will be created on frequently used and sorted fields.
Some summary of database

III Database optimization

1. Structural optimization

  1. Decompose a table with many fields into multiple tables
    For tables with many fields, if some fields are used less frequently, these fields can be separated to form a new table.
  2. Add intermediate table
    For the table that needs frequent joint query, the query efficiency is improved by establishing an intermediate table. Specifically, insert the data that needs joint query into the intermediate table, and then change the original joint query to the query of the intermediate table.
  3. Add redundant fields
    As we all know, when designing data tables, we should try to follow the specification of paradigm theory, reduce redundant fields as much as possible, and make the database design look exquisite and elegant. However, the higher the degree of normalization of the table, the more the relationship between the table and the table, the more the query needs to be connected, and the worse the performance. Therefore, adding redundant fields reasonably can improve the query speed.

2. Sub warehouse and sub table

The amount of data in the database is not necessarily controllable. With the development of time and business, there will be more and more tables in the database, and the amount of data in the tables will be larger and larger. Accordingly, the cost of data operations, such as adding, deleting, modifying and querying, will be larger and larger; In addition, if distributed deployment is not carried out, and the resources of a server (CPU, disk, memory, IO, etc.) are limited, the amount of data and data processing capacity that the database can carry will encounter bottlenecks. Therefore, from the perspective of performance and availability, the database will be split. Specifically, the data originally stored in one database will be stored in blocks on multiple databases, and the data originally stored in one table will be stored in blocks on multiple tables, that is, sub database and sub table. You can also add default query conditions, such as chat records, which are within one week by default, so as to increase query efficiency
Some summary of database

Disadvantages:
1.Transaction issues

After dividing databases and tables, it becomes a distributed transaction. If you rely on the distributed transaction management function of the database itself to execute transactions, you will pay a high performance price; If the control is assisted by the application program, the logical transaction of the program will be formed, which will cause the burden of programming.

2.Cross database and cross table join problem

After the implementation of database and table splitting, it is difficult to avoid dividing the data with strong logical correlation into different tables and different libraries. At this time, the table association operation will be limited. We cannot join tables in different databases or tables with different table granularity. As a result, the business that can be completed by one query may need to be completed by multiple queries.

3.Additional data management burden and data computing pressure

The most common problems of additional data management burden are the positioning of data and the repeated execution of data addition, deletion, modification and query. These can be solved by application programs, but they will inevitably lead to additional logical operations.

3. MySQL master-slave replication

Master-slave replication refers to transferring DDL and DML operations in the master database to the slave database through binary logs, and then re executing (redoing) these logs, so that the data in the slave database is consistent with the master database. MySQL supports one-way and asynchronous replication. During the replication process, one server acts as the master server and one or more other servers act as slave servers.

Role of master-slave replication

  1. When there is a problem with the master database, you can switch to the slave database;
  2. It can separate reading and writing at the database level to realize load balancing;
  3. Real time data backup can be performed on the slave database.

Principle of master-slave replication

The master-slave replication of MySQL is an asynchronous replication process (generally it feels real-time). The data will be copied from one MySQL database (Master) to another MySQL database (slave). The whole master-slave replication process is completed with the participation of three threads, There are two threads (SQL thread and I / O thread) on the slave side and another thread (I / O thread) on the master side.

  1. Master side: turn on the binlog recording function – record all statements that have changed the database data and put them into the binlog of the master;

  2. Slave end: start an I / O thread – responsible for pulling binlog content from the master and putting it into its own relay log;

  3. Slave side: SQL execution thread – read the relay log and execute the SQL events in the log in sequence.

Specific examples are as follows:

[document description]
Main library:
Binlog binary log file
From library:
Relaylog relay log
master. Info main library information file
relaylog. Info relaylog application information
Main library:
Binlog_ Dump Thread : DUMP_ T – binary dump thread, which sends binary logs to the slave library after the master-slave connection
From library:
slave_ IO_ Thread : IO_ T – receive the master database and request the master database to update binlog
slave_ SQL_ Thread : SQL_ T – read the relay log from the library and execute the events in the log

Some summary of database

  1. Execute the change master to command from the library (connection information of the main library + starting point of replication)
  2. The above information will be recorded from the library to the master Info file
  3. Execute the start slave command from the library to start IO immediately_ T and SQL_ T
  4. Slave library IO_ T. Read Master Get the location information of IP, port, user, pass and binlog from the information in the info file
  5. Slave library IO_ T requests to connect to the main library, which provides a dump_ T. Responsible for and IO_ T interaction
  6. IO_ T requests a new binlog from the main database according to the location information or (gtid) of the binlog
  7. Main library through dump_ T sends the latest binlog to the IO of the slave library through the network TP_ T
  8. IO_ T receives the new binlog log, stores it in the TCP / IP cache, immediately returns the ack to the main database and updates the master info
  9. IO_ T dumps the data in TCP / IP cache to disk relaylog
  10. SQL_ T read relay Info to get the location information of the relaylog that has been applied last time
  11. SQL_ T will play back the latest relaylog according to the last location point and update the relay again Info information
  12. The application relaylog will be automatically cleared from the library to avoid taking up too much disk space

Read write separation based on master-slave replication

The implementation of MySQL read-write separation is mainly based on master-slave replication. Through routing, the application’s write requests to the database are only made on the master, and the read requests are made on the slave.

Specifically, there are four implementation schemes:

Scheme 1: proxy based on MySQL

An agent layer is added between the application and the database. The agent layer receives the application’s request for the database and forwards it to different instances according to different request types (i.e. read or write). It can realize load balancing while realizing the separation of read and write. The most common MySQL proxies are MySQL proxy, Cobar, MYCAT, Atlas, etc.

Scheme 2: Based on intra application routing

The method based on intra application routing is to implement it in the application and execute SQL for different instances of different request types.

The specific implementation can be based on spring AOP: AOP is used to intercept the Dao layer method of spring project. The type to be executed can be determined according to the method name, and then the master-slave data source can be switched dynamically.

Scheme 3: JDBC driving mode based on MySQL connector Java

By configuring the addresses of the master and slave libraries in the JDBC connecting mysql, the Java program will automatically send the read request to the slave library and the write request to the master library. In addition, the jdbc driver of MySQL can also realize the load balancing of multiple slave libraries.

Scheme 4: sharding JDBC based approach

Sharding sphere is a powerful middleware that separates reading from writing and separates tables and databases. Sharding JDBC is the core module of sharding sphere.

IV Database transaction

Database transaction is a mechanism and a sequence of operations, which contains a set of database operation commands. The execution result must change the database from one consistency state to another consistency state. Transaction submits or cancels the operation request to the system as a whole, that is, this group of database commands are either executed or not executed. Therefore, transaction is an inseparable working logic unit. If any operation fails, the whole group of operations will fail and return to the state before the operation or the previous node.

Therefore, transactions are maintainedLogical data consistencyandRecoverabilityAn important weapon. Lock is the key to realize transaction, which can ensure the integrity and concurrency of transaction

Acid rule:

  • Atomicity: as the name suggests, the smallest processing unit of a transaction can no longer be divided, either all or none
  • Consistency: data integrity must be consistent before and after a transaction
  • Isolation: different concurrent transactions are independent of each other and will not be interfered with each other
  • Persistence: data changes are persistent and not affected by failures

Transaction status:

Transactions experience different states throughout their lifecycle

  • Active state: the initial state of all executing transactions.
  • Partial submission status: due to what it has donechangestore inBuffer of main memoryTherefore, it is called partial submission
  • Failure status: if a check fails in the active state, some errors occur in the active state or partial commit state, and the transaction cannot be further executed, the transaction enters the failed state.
  • Abort state: if any transaction has reached the failed state, the recovery manager rolls back the database to the original state of execution.
  • Submission status: if all operations are successfully executed, then fromPartial submission statusThe transaction entered the commit state. Cannot rollback from this state, it is a new stateConsistent state

Transaction interaction

  • Dirty read

One transaction read uncommitted data from another transaction.

  • Non repeatable read

That is, within a transaction scope, two identical queries will return two different data, because other transactions have modified the data in this interval.

  • Phantom read

Phantom reading refers to a phenomenon that occurs when a transaction is not executed independently. For example, a transaction modifies the data in the table. This modification involves all data rows in the table. At the same time, the first transaction also modifies the data in the table. This modification is to insert a new row of data into the table. Then, the user of the first transaction finds that there are no modified data rows in the table, which is like an illusion.

  • Lost update

Two transactions read the same record at the same time. Transaction a modifies the record first, and transaction B also modifies the record (B does not know that a has modified it). When B submits data, its modification result overwrites the modification result of a, resulting in the loss of update of transaction a.

Isolation level:

In order to avoid the interaction between the above transactions as much as possible and achieve the four characteristics of transactions, the SQL standard defines four different transaction isolation levels, that is, the reading depth level of concurrent transactions on the same resource, from low to high, followed by read-uncommitted, read-committed The four levels of repeatable-read and serializable correspond to the interaction between transactions as follows:
Some summary of database

  • Read uncommitted

At the lowest isolation level, one transaction can read the uncommitted results of another transaction, and all concurrent transaction problems will occur.

  • Read committed

Only after the transaction is committed, its update results will be seen by other transactions, which can solve the problem of dirty reading, but non repeatable reading or phantom reading may still occur. Oracle uses this isolation level by default.

  • Repeatable reading

In a transaction, the reading result of the same data is always the same, regardless of whether other transactions operate on the data and whether the transaction is committed, unless the data is modified by its own transaction. It can solve dirty reading and non repeatable reading. MySQL adopts the repeatable read isolation level by default.

  • Serializable

The transaction is executed serially. The isolation level is the highest and completely obeys acid, sacrificing the concurrency of the system. In other words, all transactions are executed one by one, so all problems of concurrent transactions can be solved.

V Lock mode

Classification of locks:

Shared lock(S) : tell him to read the lock again. Data can be read concurrently, but data cannot be modified. That is, when there is a shared lock on the data resource, all transactions cannot modify the data until the data is read and the shared lock is released.

Exclusive lock(10) : also called exclusive lock and write lock. When adding, deleting and modifying data resources, other transactions are not allowed to operate this resource until the exclusive lock is released, so as to prevent multiple operations on the same resource at the same time.

Update lock(U) : the lock mode to prevent deadlock. When two transactions read and then modify a data resource, deadlock sometimes occurs when using shared lock and exclusive lock, while deadlock can be avoided by using update lock.
The update lock of a resource can only be assigned to one transaction at a time. If the resource needs to be modified, the update lock will become an exclusive lock, otherwise it will become a shared lock.

Intent lock: indicates that SQL server needs to obtain shared locks or exclusive locks on some underlying resources in the hierarchy. For example, a shared intent lock placed at the table level indicates that the transaction intends to place a shared lock on a page or row in the table. Setting an intent lock at the table level prevents another transaction from subsequently acquiring an exclusive lock on the table containing that page.
Intentional locks can improve performance because SQL server only checks intentional locks at the table level to determine whether transactions can safely acquire locks on the table, rather than checking locks on each row or page in the table to determine whether transactions can lock the entire table.
Intention lock includes intention sharing (is), intention exclusive (IX) and intention exclusive sharing (six).

Architecture lock: used when performing operations that depend on the table schema. The types of schema locks are: schema modification (sch-m) and schema stability (sch-s). Schema modification locks are used when performing data definition language (DDL) operations of tables (such as adding columns or removing tables). Schema stability locks are used when compiling queries.

Mass update lock(BU): used when bulk copying data into a table and specifying the tablock prompt. The bulk update lock allows processes to concurrently bulk copy data to the same table, while preventing other processes that do not bulk copy data from accessing the table.

Relationship to transaction isolation level:

stayRead uncommittedAt the isolation level, read dataunwantedAdd a shared lock so that it will not conflict with the exclusive lock on the modified data;

stayRead committedRead operation at isolation levelneedAdd a shared lock, but after the statement is executedreleaseShared lock;

stayRepeatable readingRead operation at isolation levelneedShare lock, but not before the transaction is committedNo releaseShared lock, that is, the shared lock must be released after the transaction is completed;

SerializableIs the most restrictive isolation level because it locks the entire range of keys and holds the lock until the transaction completes.

Ways to resolve deadlocks

  • If different programs access multiple tables concurrently, try to agreeAccess tables in the same order, which can greatly reduce the chance of deadlock;
  • In the same transaction, do as much as possibleLock all resources required at once, reduce the probability of deadlock;
  • For business parts that are prone to deadlock, you can try to useUpgrade lock granularity, passTable lockTo reduce the probability of deadlock.

Optimistic lock and pessimistic lock

  • Pessimistic lock: it is assumed that concurrency conflicts will occur, and all operations that may violate data integrity are shielded. After querying the data, the transaction is locked until the transaction is committed. For long transactions, this may seriously affect the concurrent processing ability of the system. Implementation: use the lock mechanism in the database.

  • Optimistic lock: assuming that there is no concurrency conflict, only check whether there is a violation of data integrity when submitting the operation. Optimistic locking is suitable for application scenarios with more reads and less writes, which can improve throughput. Implementation method: generally, version number mechanism or CAS algorithm will be used

IV Redis

1. Simple understanding of redis

Redis, fully known as remote dictionary server, is essentially a key value type memory database. The whole database is loaded in memory for operation. The database data is written to disk or the modification operation is written to the additional record file through asynchronous operation regularly, and master-slave synchronization is realized on this basis. It supports various types of stored values, including string, list, set, Zset and hash. These data types support push / pop, add / remove, intersection, union, difference and richer operations, and these operations are atomic.

The main disadvantage of redis is that the database capacity is limited by physical memory and cannot be used for high-performance reading and writing of massive data. Therefore, the suitable scenarios for redis are mainly limited to high-performance operations and operations with a small amount of data. Although the redis file event processor runs in a single thread mode, it canI / O multiplexerTo listen to multiple sockets, the file event processor not only realizes the high-performance network communication model, but also can well connect with other modules in redis server that also run with single thread, which maintains the simplicity of single thread design in redis.

2. Redis data type

  1. String is the most basic data type of redis. It is binary safe and can contain any data, such as JPG pictures or serialized objects. The maximum storage capacity is 512 MB.
  2. Hash (hash) is a collection of key value pairs (key = > value), which is especially suitable for storing objects.
  3. List. Redis list is a simple string list. It is sorted according to the insertion order. You can add an element to the head (left) or tail (right) of the list.
  4. Set is an unordered set of string type. It is implemented through hash table. The complexity of addition, deletion and search operations is O (1).
  5. Sorted set, like set, is also a collection of string type elements and does not allow elements to be repeated. The difference is that each element is associated with a double type score (repeatable), which is used to sort the members of the set from small to large.

3. Elimination strategy of redis

LRU algorithm is still very famous. I haven’t learned much about others

  • No eviction: when the maximum memory limit is reached without deleting the policy, if more memory is needed, the error message will be returned directly;
  • Allkey LRU: randomly select multiple keys from the hash table (server. DB [i]. Dict) of all keys, and then use LRU algorithm to eliminate the least recently used data in the selected keys;
  • Volatile LRU: randomly select multiple keys from the hash table with expiration time set (server. DB [i]. Expires), and then use LRU algorithm to eliminate the least recently used data in the selected keys;
  • Volatile random: randomly select keys from the hash table (server. DB [i]. Expires) with expiration time set;
  • Allkey random: randomly select data from all key hash tables (server. DB [i]. Dict);
  • Volatile TTL: randomly select multiple keys from the hash table (server. DB [i]. Expires) with the expiration time set, and then select the data with the shortest remaining time among the selected keys to eliminate it.

summary

Database is the only way for every Summoner to win the throne as soon as possible.