This article is the author’s first original article in InfoQ. It is mainly written in succession by using weekend time, and it is also considered as a recent work. Now reprinted back to their own blog, please give us more advice.
On November 4, Microsoft officially released its new generation of database product SQL Server 2019, which has brought heavy features such as big data cluster and data virtualization. This release is only two years away from the last large version of SQL Server 2017. Such an iterative speed is quite amazing for highly complex database systems. Two years ago, InfoQ published a long article, “SQL Server 2017 is officially released. How can Microsoft’s old database carry on the past and open up the future?”? 》, this time we will cooperate with the author of this article again to provide you with an in-depth understanding of the progress and features of SQL Server 2019.
Since the 21st century, the battlefield of data platform has been full of flames and splendors. The so-called generation of talented people, represented by mongodb, redis, neo4j and other NoSQL databases, and the big data solutions of hive, impala, Presto and other Hadoop systems are in the same fashion for a while. Under the impact of these young generations, relational database, as the backbone of data architecture, not only has not been defeated, but in recent years, there has been a trend of return of the king, more and more brave. In today’s design and architecture of various key systems, relational databases still occupy the core position with stable performance and rich characteristics.
SQL server is an outstanding representative of relational database and one of the “three giants” of enterprise level commercial database, which is as famous as Oracle and DB2. Decades of development and honing have made it very mature and stable, and with the development of the times, it is constantly integrated with new technology trends, which makes it very comprehensive. In particular, the last version of SQL Server 2017 has brought this legendary database into the wide Linux world, further expanding its potential customer base and usage scenarios.
We can simply review the rich features of SQL server. In the article two years ago, we mentioned that SQL server has 18 kinds of martial arts, including traditional row storage, updatable column storage, memory table, graph database, machine learning, etc. Many of these advanced features, some of which are still hard to catch up with open source databases, or are unable to integrate perfectly in the same database. This is the value of commercial database: win favor with high stability, high performance and high integration, help customers to support and solve key business problems, but also simplify the technical architecture and reduce the maintenance burden.
In just two years, Microsoft has developed and built a new SQL Server 2019 on the basis of the previous generation, which is amazing for highly complex database systems. Although the fast-paced release is related to the radical version strategy in the industry, you must also be curious about how a highly mature commercial database system can make progress in such a short period of time? In what ways have they responded to the changing market? This article will explore with you.
Combined with the new features of SQL Server 2019, we will analyze and discuss from three aspects: core engine enhancement, data virtualization and SQL Server big data cluster, which is the highlight of this version.
Core engine enhancements
Let’s start with the core engine. HTAP (hybrid transaction / analytical processing) hybrid load capacity is the trend of the database world today. SQL server is one of the industry leaders in this regard. The previous version has realized the simultaneous support of OLTP and OLAP workload through the perfect integration of row storage and column storage in a single engine. Users can not only query and connect row and column storage tables at the same time, but also add nonclustered column storage indexes to a row storage table, so that a single table can support both OLTP and OLAP working modes and query scenarios.
SQL Server 2019 continues to strengthen the support for the mixed load capacity. Through the silent improvement of embellishment, relevant engines will be further mature, and daily use will be more convenient. For example, in terms of column storage indexes, it is now allowed to create or rebuild clustered column storage indexes Online – which will greatly facilitate the maintenance and use of large column storage tables in the production environment, save storage space and improve subsequent query performance. In the production environment that I contact, there are often column storage tables that are fragmented due to the update of some rows. However, in order to ensure the continuity of online business, only relatively light reorganize command can be used for simple maintenance. This problem is expected to be completely solved after the database upgrade.
The graph data engine introduced in the previous generation of SQL Server 2017 has also been greatly enhanced in SQL Server 2019. The improvement includes not only supporting graph data table and index partition with multiple filegroups at the storage level, but also adding the extremely important support of arbitrary length pattern. Finally, users can express the jumping connection relationship of any number of nodes. Let’s take a look at an official query example for the figure diagram:
SELECT PersonName, Friends FROM ( SELECT Person1.name AS PersonName, STRING_AGG(Person2.name, '->') WITHIN GROUP (GRAPH PATH) AS Friends, LAST_VALUE(Person2.name) WITHIN GROUP (GRAPH PATH) AS LastNode FROM Person AS Person1, friendOf FOR PATH AS fo, Person FOR PATH AS Person2 WHERE MATCH(SHORTEST_PATH(Person1(-(fo)->Person2)+)) AND Person1.name = 'Jacob' ) AS JacobReach WHERE JacobReach.LastNode = 'Alice'
It is easy to understand that this query will help to determine whether Jacob and Alice are connected, and give the shortest relationship path between them, and the length of this path is uncertain. The key point of T-SQL syntax is the match clause: a shortest path method is used to find the shortest distance between two given nodes in the calculation graph. It should be noted that the input parameters of the method support the indefinite length pattern similar to the regular expression syntax. By a + sign, it is cleverly expressed that – (FO) – > person2 enclosed by the plus sign is the repeatable part through the friendof relationship for multiple continuous pathfinding.
The above features are one of the most common advanced query scenarios in graph database applications, which are mathematically called transitive closures. The addition of this feature means that SQL Server 2019 has finally entered the room in terms of graph query ability, and gradually has the strength to compete with special graph database.
Commercial databases have always paid more attention to the development of hardware field, and constantly maximize performance potential through the latest hardware. Persistent memory (often abbreviated as pmem) has become one of the hotspots of server-side hardware because of its far superior IO capability to SSD hard disk. Intel and other manufacturers have vigorously planned and developed enterprise level persistent memory hardware products such as octane DC. For this reason, SQL Server 2019 has launched the hybrid buffer pool feature, which enables persistent memory as the storage layer between DRAM memory and SSD hard disk to play a significant role in accelerating the performance of page buffer pool. After the user selects to turn on this feature, the page buffer pool can be extended to the storage space on the pmem device, and SQL server can directly access the data pages on the pmem device through memory mapped io. In many cases, this can avoid data pages from traditional disk copy to DRAM frequently, and bypass the storage protocol stack overhead of the operating system when accessing pages, so as to achieve huge performance improvement.
Persistent memory and memory mapping access (from Microsoft official documents)
It is worth mentioning that Oracle’s next generation of Oracle 20c, which will be released next year, will also provide support for persistent memory, which is similar to what the heroes of SQL Server think.
Let’s look at programming language integration. In the past, extending the functions of SQL Server was the patent of C ා /. Net. For example, users can call UDF written by. Net through CLR integration of SQL server. With the continuous development of Microsoft’s open strategy in recent years, more languages have entered the SQL server system. Two years ago, we introduced the python / R integrated environment in SQL Server 2017 to facilitate machine learning. In SQL Server 2019, Java became the main integration and support object. With the new language extensions, Java classes and methods can be executed directly on the SQL server. Users only need to implement the abstract class AbstractSqlServerExtensionExecutor of Microsoft Java extension SDK (Microsoft Extensibility SDK for Java) to enable the Java code encapsulated by itself to be invoked through sp_execute_external_script storage process in the database T-SQL context.
The new functions mentioned above are only part of the new capabilities of SQL Server 2019 engine. In fact, there are many remarkable improvements in the new version, such as the approximate aggregate function of “adopt” count “distinct, memory optimized tempdb metadata, UTF-8 character encoding support, batch mode on rowstore support for row storage, row mode memory grant feedback in row mode, etc. These features are distributed in all aspects of the storage execution engine, further improving the ability and depth of SQL server.
As mentioned earlier, supporting multiple models and multiple paradigms has become one of the important goals pursued by commercial databases in order to establish and maintain the core position in the overall data architecture of enterprises. However, in reality, heterogeneous data sources always exist objectively, so from another perspective, how to strengthen and facilitate the interconnection between heterogeneous data sources has gradually become an important consideration and evaluation standard in modern database products.
For data interconnection, it is easy to think of ETL tools like SSIS and azure data factory for timing data transmission. This is an effective method, but there are limitations such as data timeliness and data repetition. Now a more advanced concept than building ETL channel is data virtualization. So-calledData virtualizationAs the name implies, no matter where the data is stored in any specific format, it can be managed and accessed in a unified abstraction. Technically speaking, the data virtualization system with database as the core mainly points to and defines the underlying data with explicit external tables.
In SQL Server 2019, Microsoft put forward data virtualization as the core concept and main construction goal of the product, and made key support and strengthening through the built-in POLYBASE technology at the functional level. POLYBASE is not a new face in fact, it first appeared in SQL Server 2012 parallel data warehouse, serving the software and hardware integrated distributed MPP database version. POLYBASE component gives the ability of defining external tables pointing to Hadoop / HDFS data at the database level, and becomes an important bridge between relational database and Hadoop big data ecosystem. In SQL Server 2016, POLYBASE really became mature and well-known, officially appeared in the standard SQL server, greatly expanding the audience.
In addition to Hadoop and azure blob storage, SQL server, Oracle, Teradata, mongodb and ODBC support are added to the new version. If POLYBASE was only a minor subsidiary function before, it has been the core capability under the spotlight in SQL Server 2019, which emphasizes data virtualization.
Data virtualization capabilities (from Microsoft official documents)
Let’s see a simple example of using POLYBASE to configure remote mongodb data source in SQL Server 2019 to understand the landing form of data virtualization.
CREATE DATABASE SCOPED CREDENTIAL MongoCredential WITH IDENTITY = 'username', SECRET = 'password'; CREATE EXTERNAL DATA SOURCE MongoDBSource WITH ( LOCATION = 'mongodb://<server>[:<port>]', PUSHDOWN = ON, CREDENTIAL = MongoCredential ); CREATE EXTERNAL TABLE MyMongoCollection( [_id] NVARCHAR(24) NOT NULL, [column1] NVARCHAR(MAX) NOT NULL, [column2] INT NOT NULL -- ..., other columns to be mapped ) WITH ( LOCATION='dbname.collectionname', DATA_SOURCE= MongoDBSource );
As you can see, by defining the three core configurations of credential, data source and external table with T-SQL, you can easily map the sets and fields in mongodb to SQL server, and then query the virtual external table. POLYBASE even supports nested structures such as objects and arrays in mongodb, allowing complex fields to be flattened when external tables are defined. In addition, although the example here is for mongodb, if you need to connect to other types of data sources, the steps of configuration are similar, but the meaning and form of related parameters are different.
It is worth noting that the external tables supported by POLYBASE are the same as the general data tables, and can be joined with other tables, which greatly facilitates the integration between heterogeneous data sources, and in many cases can avoid the trouble of data handling. Of course, for some scenarios where direct query is inconvenient for performance reasons, simple SQL statements can also be used to synchronize external table data to SQL Server easily.
At the technical implementation level, POLYBASE is born out of the MPP architecture scene, so it actually has a good parallel expansion capability – this feature is particularly important when the remote data volume is huge, which can greatly accelerate the execution of queries. Users can set up multiple SQL server instances (divided into header nodes and calculation nodes) and group them into POLYBASE scale out groups to work together to read and process external big data in parallel. From this perspective, POLYBASE module has made SQL server have some typical characteristics of distributed analytical database.
POLYBASE scale out group architecture (from Microsoft official documents)
Another feature of POLYBASE is that it has certain query push down ability. When the remote end can support it, the query processor will send qualified predicates to the data source for nearby processing, which not only improves query performance but also reduces the burden of network io. For example, in the Hadoop oriented reading scenario, sometimes POLYBASE will choose to use MapReduce to read and filter the original file according to the statistical information, and finally only need to return part of the result data instead of the full data.
In conclusion, the concept of data virtualization and the enhancement of POLYBASE technology are expected to help the new generation of SQL Server become the center of data architecture. By combining and integrating a variety of heterogeneous data sources, SQL Server 2019 can effectively reduce the complexity of enterprise architecture, and also play an important role in application scenarios such as data hot and cold stratification, unified data Lake construction, etc.
SQL Server big data cluster
The most important feature of SQL Server 2019 is probably SQL Server big data cluster. With the bold design of creatively integrating open-source big data technology components such as Hadoop and spark directly into SQL server and integrating seamlessly under the kubernetes system, SQL Server big data cluster attracted wide attention after it was announced last year and started Limited Preview. Because everyone is very curious: how will big data, Hadoop, spark, containerization, cloud native, these hot technical words react with a traditional commercial database?
In essence, SQL Server big data cluster is not only a new feature of SQL Server 2019, but also a new product form and deployment mode. It has the following important characteristics: (1) deploy and link SQL Server in the form of multiple instances to realize distributed storage, processing and calculation of data; (2) fully container SQL server and realize the arrangement and management of underlying computing resources based on kubernetes; (3) additionally built-in standard HDFS distributed file system based on its own distributed storage（ 4) Standard spark is additionally provided as a distributed computing engine at the computing level. The architecture overview is as follows:
SQL Server big data cluster architecture (from Microsoft official document)
It can be seen that SQL Server big data cluster represents the latest architecture idea of Microsoft data platform, from simple and external interconnection to comprehensive integration with open source platform technology; from technology docking and compatibility to you in me and me in you. This is not only a bold attempt, but also an amazing product idea. Its advantages are obvious: from the perspective of enterprise customers, the design of all in one greatly simplifies the architecture, and users can build their own one-stop big data platform based on this, with both open source and commercial technology; from the perspective of Microsoft, it ensures the smooth operation of open source workload in SQL Server Cluster and system, similar to a commercial Hadoop release, which is undoubtedly beneficial It continues to be commercially successful in the open source era.
If you want to experience SQL Server 2019, the easiest way is to first establish an azure kubernetes service (aks) cluster (of course, it also supports other cloud or local k8s clusters), and then deploy SQL Server big data cluster to kubernetes with the help of azdata command line tool. The author has carried out relevant hands-on experiments and architecture observation, and found that SQL Server big data cluster has a strong point in technical implementation, as listed below:
- The control, calculation, storage and other nodes are fully containerized. When deployed, the corresponding image can be automatically downloaded from the Microsoft container registry and run.
- The master instance of big data cluster supports multi node deployment and high availability, which is realized by combining the underlying failover capability provided by k8s with the availability group in SQL server.
- The bottom layer of distributed storage is composed of disks attached to VM cluster, which provides two different choices, data pool and storage pool, corresponding to private and open source technologies respectively. When using, the external table is defined to point to the address under sqldatapool or sqlhdfs protocol for mounting and access. Two different kinds of storage can be used in combination and cooperate with each other.
- Data pool provides SQL Server’s own distributed storage capacity. Generally, it can provide high data loading performance in combination with the round-robin data distribution strategy. In the actual scenario, it can be used as the landing choice for external data access, or as the persistent storage of large query result set.
- The pod corresponding to the storage pool is highly integrated with spark, HDFS datanode and SQL server instances, and provides a complete HDFS file system, which is perfectly compatible with the column storage format of parquet and other open-source systems. It can also mount and use Amazon S3, azure data Lake storage Gen2 and other cloud storage services through the HDFS tiering function. When querying, SQL server can use The information provided by namenode can be read locally at high speed with respect to data locality, and it can also support predicate pushdown in many cases.
- It is of great significance for big data cluster to fully integrate spark operation environment, which means that it can use standard spark technology stack to read and write storage pool and share the same data with SQL Server in place. It is verified that the integrated spark version of this release is 2.4, which is the latest large version.
- Big data cluster automatic installation includes elastic search and kibana components to help monitor the key indicators and health status of each link of the system.
- In terms of tool support, cross platform azure Data Studio can be used to connect SQL Server big data cluster. SQL Server 2019 special plug-in greatly facilitates self-service query, cluster management, external table creation, etc. You can also use the popular jupyter notebook in azure Data Studio to connect to the cluster, and conduct exploratory data analysis and machine learning model training through SQL, python / pyspark or Scala / spark scripts.
Limited to space, more content will not be expanded here. If you are interested in some key details and practical hands-on exercises, you can pay more attention to the follow-up articles of the WeChat public number “cloud pick up” for more information.
In terms of pricing, although SQL Server big data cluster still belongs to the category of commercial database and takes up a large number of CPU cores, users do not need to worry too much about the high expenditure on authorization fees. The SQL Server team has carefully designed a cost-friendly pricing strategy, which is mainly reflected in that in addition to the master instance requiring enterprise or Standard Version authorization, most of the other computer / data / storage nodes only need to be charged in the way of “big data node” which is specially designed and much cheaper, which will greatly reduce the user’s choice of SQL Server big data cluster The cost burden of.
Looking back, although SQL Server big data cluster is a brand new capability, Microsoft may have started the relevant layout early. Because it is easy to find some achievements of previous versions of SQL server, which is exactly the technical premise for the emergence of big data cluster. For example, the POLYBASE technology that has been accumulated for many years mentioned above is the key to seamless interaction between SQL server and big data technology stack; for example, the Linux version that SQL Server 2017 began to introduce is an important foundation for successful realization of containerized encapsulation.
In recent years, Microsoft is gradually getting returns after embracing open source. Embracing open source can not only shorten the distance with the community and users, but also win more design flexibility for the latest technology and product development. The SQL Server big data cluster, which is completely containerized, arranged with kubernetes and integrated with spark, HDFS and other open-source components, is undoubtedly a successful model of this “reform and opening up” and “fetchism” strategy.
Of course, everything has two sides. For the integrated architecture mode such as SQL Server big data cluster, some people in the industry hold different views. They believe that over integration and encapsulation is not necessarily the direction of architecture evolution in the cloud era. They prefer the architecture of computing and storage separation, so that each data component can focus on one thing. This is a matter of different opinions. Perhaps the original intention of SQL Server big data cluster design is to focus on large customers based on local deployment, while attracting enterprise solution providers that are very sensitive to mobility and cross cloud adaptation. For these scenarios, SQL Server big data cluster is a very competitive choice. We believe that the market will give us the final answer.
The world is about to enter a new decade. SQL Server 2019 released at the end of 2019 shows Microsoft’s vision and ambition for the next decade in the new era. In particular, the launch of SQL Server big data cluster will promote the landing of a batch of new big data platforms, inspire the industry to think about the architecture mode of big data in the future, as well as the harmonious coexistence of business technology and open source world.
It is worth mentioning that SQL Server 2019, like SQL Server 2017, has a Linux oriented version and provides official support services with Linux manufacturers. In fact, the feature coverage of SQL server for Linux has been improving silently. The 2019 version brings important features such as data replication, active directory integration, and POLYBASE on Linux to Linux. If you still hold a wait-and-see attitude towards the first Linux version two years ago, SQL Server 2019 has improved its compatibility and function set for Linux. It’s a better SQL server for Linux. Maybe it’s time to “get on the bus”.
Microsoft, which takes the cloud as the first, will consider to gradually synchronize the new capabilities of the new generation version to the PAAS service of azure cloud, in addition to the original SQL Server 2019 of cloud. In fact, azure SQL database has started to support some new features of SQL Server 2019, such as “adopt” count “distinct. You only need to manually set the compatibility level of the database to 150 of the corresponding 2019 version. Another example is POLYBASE, which was only supported by SQL data warehouse on azure (mainly used to access blob storage). In the future, this feature is likely to be updated and enhanced in the cloud, and it is also expected to expand to more data services such as SQL databases or SQL managed instance.
Finally, we briefly summarize the development strategy of SQL Server 2019 as follows: first, we continue to consolidateThe native multi-mode kernel supports multiple data architecture paradigms,Secondly, continuous improvementData virtualization technology POLYBASEIn order to strengthen the external connection, and finally through embracing and integratingOpen source big data technology systemAchieve overall integration. This is a steadily developing and progressive product evolution idea. I don’t know if you, as a user, have been touched? Let’s wish SQL Server 2019 good luck.
“Picking up relics in the cloud” focuses on introducing cloud computing products and technologies from the perspective of users, insists on taking the practical experience as the core output content, and combines the product logic to conduct in-depth interpretation of application scenarios. Welcome to scan the bottom two dimensional code to pay attention to the WeChat public number.