Translated by: Yuan Yuhan
In March 1971, Intel launched the world’s first general-purpose microprocessor, [Intel 4004], which costs $60 and has about 2300 transistors.
Today, the latest iPhone has nearly 12 billion transistors and costs just over $60.
The changes brought about by 50 years are huge, but some things remain unchanged.
In the 1990s, a large number of programming languages began to be introduced, among which java was introduced in 1996. Years have passed, SQL still has its own place in many programming languages, even as popular as 50 years ago.
This article will share why relational databases are introduced and why SQL is becoming more and more popular, and think about what we can learn from it.
The early history of database management — IDs and CODASYL
In 1962, Charles W. Bachman worked on a small team at General Electric.
A year later, Charles W. Bachman’s team launched what people call the first database management system, the integrated data store (IDS).
Ten years later, Bachmann’s contribution to IDS computing won the Turing prize, which has become the Nobel Prize in computer science.
What is IDS?
In the early 1860s, computer science just began to become an academic field.
It was also during this period that American Standard Code for information exchange (ASCII) was introduced.
To understand IDS, we need to understand the two main forces driving its development
1、 Implementation of disk storage
Mobile RAMAC 305
In 1956, IBM introduced the first commercial hard disk drive: RAMAC 305.
In the past, when using the tape drive, we need to move in turn in the tape to retrieve specific data, but the introduction of disk drive enables programmers to jump directly to a location on the disk to retrieve and update data, which brings great convenience to programmers.
Only in this way, developers must be very aware of the storage location on the disk. Limited by the file management system in the early operating system, the convenience brought by the disk can only be enjoyed by experienced programmers.
So developers need a solution to simplify the use of disk drives.
2、 The migration of programming language from low level to high level
At the same time, on the adoption curve of computer science, people began to change from innovators to early adopters. In the early days, low-level programming languages such as assembly were popular, but with the further use of the language, more programmers switched to high-level programming languages such as COBOL for the sake of usability.
By this time, we may be able to guess why IDS exists. It solves the problem of disk storage and facilitates the use of high-level programming language.
IDS allows developers to use high-level programming languages (such as COBOL) to build applications that can input and retrieve data from disk storage. Therefore, IDS has won the honor as the first database management system.
CODASYL — a new standard of database management
In 1969, the Committee on data system language (CODASYL) issued a report in which database management standards were proposed. Bachmann was a member of the Committee. These standards mainly came from IDS.
The CODASYL data model introduces many core functions of the database management system we use today
- Data definition language (DDL)
- Data manipulation language (DML)
Most importantly, IDS and CODASYL introduce a new data modeling method, which affects the final development of SQL, that is network data model.
Network data model and relation model
Data model is the standard to describe and model world data.
Previous hierarchical data model used tree structure to describe data, so that the relationship was limited to one to many combination.
The new network model allows records to have multiple parents, thus creating a “graph structure”. After multiple parents exist, the network model can model many to one and many to many relationships.
In a network model, the relationships between tables are stored in a set. Each group has an owner and one or more members, just as a teacher has one or more students.
One of the main advantages of the network model is that the related records in the collection are directly connected through pointers. Collection is implemented with next, prior and owner responders, which is similar to the link of list, which is convenient for developers to consult.
The low-level nature of the network data model provides a performance advantage, but because each record must store additional pointers to its previous record and parent record, it costs us a lot to store.
The arrival of relational model
An example of relational model
In 1970, eight years after IDs was launched, Edgar F. Codd introduced the relational model in his pioneering paper “data relational model for large shared databases”, which also won him the Turing prize.
Edgar Codd showed that all data in a database can be represented by tuples (rows in SQL), which are grouped into relationships (tables in SQL). To describe database queries, he invented a first-order predicate logic called tuple relational calculus.
Tuple relational calculus introduces a declarative language for querying data. In a declarative programming language, programmers don’t need to say how to execute, they just need to say what they want to do to achieve their goals.
This new declarative language is much easier for developers. The relational model exposes all data. Developers can retrieve all data from a table or read a row with a single command (thanks to the query optimizer). From then on, people say goodbye to the days of using maze like pointers to find data.
Relationship and network data model
Normalization is the process of decomposing tables to eliminate redundancy and reduce the amount of data on disk. Relational database reduces the high storage cost of network database by normalizing data.
However, in order to deal with standardized data, relational database must load tables into memory, and use the computing power to “join” tables together, which increases the CPU cost.
By introducing how to use the relational model to facilitate teachers to find all classes and students, you will understand why there is such a high CPU cost
When the teacher starts to input the data to be searched, the database system will first retrieve all related classes, and then it will perform the second operation to retrieve the student data. In this process, all the data will be stored in memory, and a third operation will be run to merge the data before the result is returned.
Performance comparison between relational model and network model
In the case study of using actual data, raima found that the insertion performance of the network database model is 23 times that of the relational model, and the query performance is 123 times that of the relational model.
So why is relational database the leading database solution?
The relational model is more flexible to modify, and its declarative syntax simplifies the work of programmers.
Moore’s law greatly improves the performance of the computer, and the computing cost continues to decline. Finally, the computing cost of the relational model is offset by the increased productivity.
Today, 50 years later, the most expensive configuration in the data center is CPU.
The rise and rule of SQL
Going down time, we met our favorite SQL.
Four years after Codd published his paper, Donald Chamberlin and Raymond Boyce published a paper called “sequence: a structured English query language”.
They describe sequence as “a set of simple operations on table structure, equivalent to first-order predicate calculus”. IBM saw this potential and quickly developed the first version of sequence as part of system R in the early 1870s.
Later, due to a conflict with the trademark of UK hawk Sidley aircraft company, sequence was renamed SQL.
In 1986, American National Standards Institute (ANSI) and international organization for Standardization (ISO) issued the first formal SQL standard: sql-86, which is an important step after SQL has been used for ten years. The standard divides SQL into several parts
Data definition language (DDL): commands for defining and modifying schemas and relationships
Data manipulation language (DML): commands for querying, inserting, and deleting database information
Transaction control: command used to specify transaction time
Integrity: commands for setting database information constraints
Views: commands for defining views
Authorization: a command used to specify user access rights
Embedded SQL: a command that specifies how to embed SQL in other languages
From 1974 to today, many people have developed syntax, trying to compete with SQL for the market share of query language. These new grammars usually cater to the specific new technologies of the time
[Lisp-] > [CLSQL .NET-]> [LINQ] Ruby on [Rails-]> [ActiveRecord]
35 years later, SQL is still ubiquitous in databases. How can SQL maintain its dominant position as a query language? What can we learn from its story?
The secret of SQL’s 50 years of rule and what we can learn
Survey of stack overflow developers in 2017
We start this story with Bachmann and IDS, the first database management system. We discuss the transformation of disk storage and advanced programming, and how to need a new way of data processing. Then CODASYL carries out standardized database management, IDS and CODASYL introduce a new network data model, and finally Edgar Codd abandons the relational model.
All these changes have taken place in eight years.
In the next 50 years, how can SQL successfully persist? I think there are four main reasons
1、 Based on first principles
The first principle is a basic proposition that cannot be inferred from any other proposition or hypothesis. For example, the combination of hydrocarbons with oxygen to produce chemical reactions is the most basic principle for powering every car’s internal combustion engine.
In 1970, Codd created a new first principle for database: tuple relational calculus. This new logic preserves the generation of relational model and SQL. Among them, tuple relational calculus is chemical reaction, relational model is internal combustion engine, and SQL is automobile.
2、 Bushnell’s law
First principles alone can’t guarantee success. Assembler is very close to 1 and 0 that programmers can input with one click, but it is still replaced by COBOL (and later C) because it lacks usability.
We see the same story in the transformation from network to relational model. The network model has faster performance, but today, every company uses a relational database because it was very easy to use in the early days.
“The best games are always easy to play, but hard to master” — Nolan Bushnell
In this sentence, Nolan Bushnell tells the secret of getting people to use new products.
In the assembly which is difficult to learn and master, SQL has found a perfect balance. With about 10 SQL commands, anyone can get 80% learning efficiency from 20% knowledge. However, there is still a long way to go to master, index, view and optimization.
3、 Listening to feedback and adapting to the environment
Query language is not an eternal whole, but a standard group that adapts to the environment over time. Year after year, SQL standards are constantly adjusted to absorb users’ feedback and improve itself.
Starting from the initial idea, we have seen 10 different SQL standards, each of which has a very important update. Three of them are very big, listed here:
1. In 1999, regular expression matching, recursive queries (for example, passing closures), triggers, support for process and flow control statements, non scalar types (arrays) and some object-oriented functions (for example, structured types) were added. It supports embedding SQL in Java (SQL / OLB) and vice versa (SQL / JRT).
2. In 2003, XML related functions (SQL / XML), window functions, standardized sequences and columns with automatically generated values (including identity columns) were introduced.
3. In 2016, add row pattern matching, polymorphic table function, and JSON.
SQL also creates a rails architecture that other products can use. There is no mandatory syntax in SQL, it just provides a standard for each database to build its own (T-SQL, mysql, PSQL, etc.).
4、 Using API
The ultimate secret of SQL’s success is the rise of application programming interface (API). The abstraction of the underlying implementation of the API only exposes the objects and operations needed by developers, which greatly simplifies the programming.
Hadoop introduced the distributed file system (HDFS) in 2006. At first, SQL syntax could not be accessed. But in 2013, Apache created Apache impala. Since then, developers can use SQL to query HDFS database. API makes SQL adapt to new technology with its special syntax.
SQL is one of the most widely used programming languages today. It began with the dawn of modern computing, and was given life by the winner of the two Spirit Awards.
SQL can maintain its dominant position because it is based on first principles, the emergence of bushner’s law, its adaptability and the use of API.
If you have any other opinions, you can leave a comment and share with us the reasons why SQL has been successful for many years.