Introduction and overview of NoSQL


1.1.1 1 under the background of Internet era, why do you use nosql1 single machine MySQL

In the 1990s, the number of visitors to a website was generally small, and it could be easily handled with a single database.

At that time, there were more static web pages and less dynamic interactive websites.

Under the above architecture, let's look at the bottleneck of data storage?
 1. The total size of data volume when a machine cannot be placed
 2. Index of data (B + tree) when the memory of a machine cannot be put down
 3. The amount of access (mixed read and write) an instance can't bear

If one or three of the above are satisfied, evolution
2 memcached + MySQL + vertical split

Later, with the increase of visits, almost most of the websites using MySQL architecture began to have performance problems in the database. Web programs no longer focus on functions, but also pursue performance. Programmers began to use cache technology to relieve the pressure of database and optimize the structure and index of database. At the beginning, it was more popular to relieve the pressure of database through file cache. However, when the number of accesses continued to increase, multiple web machines could not share through file cache, and a large number of small file cache also brought higher IO pressure. At this time, memcached naturally becomes a very fashionable technology product.

Memcached, as an independent distributed cache server, provides a shared high-performance cache service for multiple web servers. On memcached servers, we have developed a hash algorithm to extend multiple memcached cache services, and then a unified hash algorithm to solve the problem of increasing or reducing the cache server’s massive cache failures caused by rehash malpractice
3. Separation of primary and secondary reading and writing of MySQL

Because of the increasing write pressure of the database, memcached can only relieve the read pressure of the database. Most websites begin to use master-slave replication technology to achieve read-write separation, so as to improve the read-write performance and the scalability of the read-write library. MySQL's master slave mode has become the standard configuration for websites at this time.

4 tables and databases + horizontal splitting + MySQL Cluster

On the basis of memcached cache, MySQL master-slave replication, and read-write separation, at this time, the write pressure of MySQL master database began to appear bottleneck, and the data volume continued to soar. Because of the use of table lock by MyISAM, there will be serious lock problem in high concurrency. A large number of high concurrency MySQL applications began to use InnoDB engine instead of MyISAM.

At the same time, it has become popular to use sub table sub database to alleviate the expansion of write pressure and data growth. At this time, the sub table sub database has become a hot technology, which is a hot interview issue and also a hot technical issue discussed by the industry. At this time, MySQL launched a table partition that is not very stable, which also brings hope to companies with general technical strength. Although MySQL has launched MySQL Cluster Cluster, its performance can not meet the requirements of the Internet, but it only provides a very large guarantee on high reliability.

5. Scalability bottleneck of MySQL

Mysql database also often stores some large text fields, which makes the database table very large. When doing database recovery, it will lead to very slow and not easy to recover the database quickly. For example, the size of 10 million 4 KB text is close to 40 Gb. If you can save these data from mysql, MySQL will become very small. Relational database is very powerful, but it can not cope with all the application scenarios well. MySQL has poor scalability (it needs complex technology to realize), high IO pressure under big data, and difficult table structure change, which are exactly the problems faced by developers using MySQL at present.

What is it like today??

7 why NoSQL

Today, we can easily access and retrieve data through third-party platforms (such as Google, Facebook, etc.). Users' personal information, social network, geographical location, user generated data and user operation logs have multiplied. If we want to mine these user data, the SQL database is not suitable for these applications, but the development of NoSQL database can deal with these big data well.

1.1.2 2 what is it

NoSQL (NoSQL = not only SQL), which means "not only SQL",

It generally refers to non relational database. With the rise of Internet Web 2.0 sites, traditional relational databases have been unable to cope with Web 2.0 sites, especially the super large-scale and high concurrency of SNS type Web 2.0 pure dynamic sites, which has exposed many problems that are difficult to overcome, while non relational databases have developed rapidly due to their own characteristics. The production of NoSQL database is to solve the challenges brought by large-scale data collection and multiple data types, especially the application problems of big data, including the storage of large-scale data.
(Google or Facebook, for example, collect trillions of bits of data for their users every day.). These types of data storage do not need a fixed pattern, and can scale horizontally without redundant operations.
1.1.3 what can be easily expanded

There are many kinds of NoSQL database, but one common feature is to remove the relational feature of relational database.

There is no relationship between the data, so it is very easy to expand. It also brings scalable capabilities at the architecture level.
Large data volume and high performance

NoSQL database has very high read and write performance, especially in large data volume, and also performs well.

This is due to its indifference and the simple structure of the database.
In general, MySQL uses query cache, which fails every time a table is updated. It is a large-scale cache,
In the application of Web 2.0 with frequent interaction, cache performance is not high. The cache of NoSQL is record level,
It is a fine-grained cache, so NoSQL will have much higher performance at this level
Diverse and flexible data models

NoSQL does not need to set up fields for the data to be stored in advance, and can store customized data formats at any time. In a relational database,

Adding and deleting fields is a very troublesome thing. If it’s a very large data table, adding fields is a nightmare
Traditional RDBMS vs nosqlrdbms vs NoSQL

  • Highly organized structured data
  • Structured query language (SQL)
  • Data and relationships are stored in separate tables.
  • Data manipulation language
  • Strict consistency
  • Basic affairs
  • Represents more than just SQL
  • No declarative query language
  • No predefined patterns

-Key value pair storage, column storage, document storage, graphic database

  • Final consistency, not acid attribute
  • Unstructured and unpredictable data
  • CAP theorem
  • High performance, high availability and scalability

1.1.4 4 where to get to redis


1.1.5 how to play kV


1.2 3V + 3 high 1.2.1 3V massive volume in the era of big data

  Real time velocity

1.2.2 3 high concurrency of Internet demand

Gao Ke
   High performance

1.3 the current classic application of NoSQL

Architecture development history
Evolution process

The fifth generation

5th generation architecture mission

Related to us, storage problems of multiple data sources and data types

1 basic information of goods

Name, price, delivery date, manufacturer, etc
  Relational database: at present, Taobao is de-o (i.e. taking out Oracle). Note that the MySQL used in Taobao is modified by the big cattle themselves

Why go to IOE

In 2008, Wang Jian joined Alibaba to become the group's chief architect, now the chief technology officer. The former executive vice president of Microsoft Asia Research Institute was positioned by Ma Yun as: to help Alibaba group establish a world-class technical team, and be responsible for the group's technical structure and infrastructure technology platform.

After joining Alibaba, Wang Jian, with technical gene and scholar style, put forward the idea of “de IOE” (removing IBM minicomputers, Oracle databases and EMC storage devices in the process of it construction) in Alibaba group, and began to implant the essence of cloud computing into it gene.

Wang Jian sums up the relationship between the "de IOE" campaign and Alibaba cloud: "de IOE" has completely changed the foundation of Alibaba Group's IT architecture, which is the basis for Alibaba to embrace cloud computing and produce computing services. The essence of "de IOE" is distribution, which makes the commonality PC architecture available everywhere possible and the first condition for cloud computing to land.

2 product description, details and evaluation information (multi text)

Multi text message description class, poor IO read / write performance
  In the document database mongdb

3. Pictures of products

Product image display
  Distributed file system:
  Taobao's own TFS
  GFS of Google
  HDFS of Hadoop

4 key words of goods

Search engine, Taobao internal

5. Hot spot and high frequency information of commercial band

Memory database

6. Transaction, price calculation and accumulated points of commodities

External system, external third-party payment interface

Summarize the difficulties and solutions of large Internet Applications (big data, high concurrency, diverse data types)
Diversity of data types
Data source diversity and change reconstruction
Data source transformation and data service platform does not need large area reconstruction
terms of settlement:

Draw pictures for students to introduce EAI and unified data platform service layer
What did Ali and Taobao do? UDSL

What’s this:

What does it look like?



Hotspot cache:

1.4 introduction to NoSQL data model 1.4.1 how do you design a traditional relational database based on an e-commerce customer, order, order and address model?

ER diagram (1:1 / 1: n / N: n, common for main and foreign keys, etc.):

NoSQL how do you design
What is bson?

Bson () is a kind of binary storage format of JSON, which is called binary JSON for short. Like JSON, it supports embedded document objects and array objects to draw the data model built by bson

{ “customer”: { “id”: 1136, “name”: “Z3”, “billingAddress”: [ { “city”: “beijing” } ], “orders”: [ { “id”: 17, “customerId”: 1136, “orderItems”: [ { “productId”: 27, “price”: 77.5, “productName”: “thinking in java” } ], “ship Pingaddress “: [{” city “:” Beijing “}]” orderpayment “: [{” ccinfo “:” 111-222-333 “,” txnid “:” asdfadcd334 “,” billingaddress “: {” city “:” Beijing “}],}]}}} the comparison between the two, the problems and difficulties are as follows The aggregation model can be used to deal with

Highly concurrent operations are not recommended to have associated queries. Internet companies use redundant data to avoid associated queries

Distributed transactions cannot support too many concurrent transactions
Think about the relational model database. How do you look it up If you follow our newly designed bson, is it cute to query
1.4.2 aggregation model

KV key value
Column family

As the name implies, data is stored in columns. The biggest feature is that it is convenient to store structured and semi-structured data and to compress data,

It has a great IO advantage for queries on a column or columns.


1.5 four categories of NoSQL database 1.5.1kv key value: typical introduction

Sina: BerkeleyDB + redis
Meituan: redis + tail
Alibaba and Baidu: Memcache + redis
1.5.2 document database (more bson formats): typical introduction


Mongodb is a database based on distributed file storage. Written in C + +. It aims to provide scalable high-performance data storage solutions for web applications.
   Mongodb is a product between relational database and non relational database. It has the most abundant functions and is the most like relational database.

1.5.3 column storage database
Cassandra, HBase
distributed file system
1.5.4 graph relational database

It's not about graphics, it's about relationships, like social networks in the circle of friends, ad recommendation systems

Social network, recommendation system, etc. Focus on building a relationship map
Neo4J, InfoGrid

1.5.5 comparison of the four

1.6 what are the traditional acid of cap + base1.6.1 in distributed database
Relational databases follow acid rules

Transaction is transaction in English, which is similar to real world transaction. It has the following four characteristics:

1. Atomicity

Atomicity is easy to understand, that is to say, all operations in a transaction are either completed or not done. The condition for a successful transaction is that all operations in the transaction are successful. As long as one operation fails, the whole transaction fails and needs to be rolled back. For example, bank transfer, which transfers 100 yuan from account a to account B, is divided into two steps: 1) withdraw 100 yuan from account a; 2) deposit 100 yuan into account B. These two steps can be completed together or not. If we only complete the first step, the second step will fail, and the money will be inexplicably reduced by 100 yuan.

2. C (consistency) consistency

Consistency is also easy to understand, that is to say, the database should always be in a consistent state, and the operation of transactions will not change the original consistency constraints of the database.

3. I (isolation) independence

The so-called independence means that concurrent transactions will not affect each other. If the data to be accessed by one transaction is being modified by another transaction, as long as another transaction is not committed, the data it accesses will not be affected by the uncommitted transaction. For example, there is a transaction that transfers 100 yuan from a account to B account. If B queries his account at this time, he will not see the newly increased 100 yuan

4. D (durability) persistence

Persistence means that once a transaction is committed, its changes will be permanently saved in the database, even if there is a downtime, it will not be lost.

A (atomicity) atomicity
C (consistency) consistency
I (isolation) independence
D (durability) persistence
1.6.2 CAPC: consistency
A: availability
P: partition tolerance
1.6.3 3 in 2 of cap

Cap theory means that in the distributed storage system, only the above two points can be realized at most.

However, the current network hardware is bound to suffer from delay and packet loss
Partition tolerance is what we have to achieve.
Therefore, we can only balance consistency and availability. No NoSQL system can guarantee these three points at the same time.
C: strong consistency A: high availability P: distributed tolerance
CA traditional Oracle Database
AP’s choice of most website architectures
CP Redis、Mongodb
Note: trade-offs must be made in distributed architecture.

There is a balance between consistency and availability. In fact, most web applications do not need strong consistency.

So sacrificing C for P is the direction of distributed database products

Choice between consistency and availability

For Web 2.0 sites, many of the main features of relational databases are often useless
   Many web real-time systems do not require strict database transactions, and have low requirements for read consistency. In some cases, the requirements for write consistency are not high. Allows for final consistency.

The real-time requirement of database

For a relational database, it is certain that you can read out the data after inserting a piece of data, but for many web applications, it does not require such high real-time performance. For example, after sending a message, it will take a few seconds or even a dozen seconds for my subscribers to see that this dynamic is completely acceptable.
   Requirements for complex SQL queries, especially multi table associated queries

Any web system with large amount of data is very taboo to associated query of multiple large tables and complex report query of data analysis type, especially SNS type website, which avoids this situation from the perspective of demand and product design. More often, it is only the primary key query of a single table, and the simple condition paging query of a single table. The function of SQL is greatly weakened.
1.6.4 the core of the classic cap graph cap theory is that a distributed system can not meet the three requirements of consistency, availability and fault tolerance of partitions at the same time,

At most, only two can be satisfied at the same time.
     Therefore, according to the cap principle, NoSQL database can be divided into three categories: CA principle, CP principle and AP principle

Ca – single point cluster, a system that satisfies consistency and availability, is usually not very strong in scalability.
CP – a system that satisfies consistency and partition tolerance, usually with low performance.
AP – systems that meet availability, partition tolerance, and may generally have lower requirements for consistency.

1.6.5 what is base

Base is a solution to the problem caused by the strong consistency of relational database and the decrease of availability.
   Base is actually the abbreviation of the following three terms:

Basic available
Soft state
Finally consistent
Its idea is to make the system relax the requirement of data consistency at a certain time in order to improve the overall scalability and performance of the system. Why do we say that? The reason is that because of the geographical distribution and extremely high performance requirements of large-scale systems, it is impossible to use distributed transactions to complete these indicators. In order to obtain these indicators, we must use another way to complete them. Here, base is the solution to this problem
1.6.6 introduction to distributed + cluster

distributed system
Distributed system

It consists of multiple computers and communication software components connected by computer network (local network or wide area network). Distributed system is a software system based on network. Because of the characteristics of software, distributed system has a high degree of cohesion and transparency. Therefore, the difference between the network and the distributed system lies more in the high-level software (especially the operating system), rather than the hardware. The distributed system can be applied to different platforms such as PC, workstation, LAN and WAN.

In short:

1 distributed: different service modules (projects) are deployed on different servers. They communicate and call with each other through RPC / RMI to provide external services and intra group cooperation.
     2 cluster: different servers are deployed with the same service module, which can be uniformly scheduled through distributed scheduling software to provide external services and access.