Introduction to the author
Pan Juan, senior DBA of Jingdong digital science, is mainly responsible for the development of database middleware, database operation and maintenance automation platform, and production database operation and maintenance. Participated in the escort work of Jingdong’s 6.18 and 11.11 promotion activities for many times. He was responsible for the design and development of the database automation platform of Jingdong digital science, and now focuses on the development of Apache shardingsphere distributed database middleware. Willing to study and explore in database, automation, distributed, middleware and other related fields.
Security control has always been an important part of governance, and data desensitization belongs to the category of security control. For Internet companies and traditional industries, data security has always been a highly valued and sensitive topic.
Data desensitization refers to the deformation of some sensitive information through desensitization rules to achieve reliable protection of sensitive privacy data. Personal data, such as ID number, mobile phone number, card number, customer number, and other sensitive data, such as ID card number, mobile phone number, card number and customer number, need to be desensitized according to the relevant department regulations.
In the real business scenario, the relevant business development team often needs to implement and maintain a set of encryption and decryption system according to the needs of the company’s security department. When the desensitization scenario changes, the self maintained desensitization system often faces the risk of reconstruction or modification. In addition, for the online business, how to achieve seamless desensitization transformation transparently, safely and with low risk without modifying business logic and business SQL?
Apache shardingsphere provides a complete, safe, transparent and low-cost data desensitization integration solution according to the industry’s desensitization requirements and business transformation pain points.
Apache shardingsphere is an ecosystem composed of an open source distributed database middleware solution. It is composed of sharding JDBC, sharding proxy and sharding sidecar (under planning), which are independent of each other but can be deployed and used together. They can provide standardized data fragmentation, distributed transaction and distributed governance functions, and can be applied to various application scenarios such as Java isomorphism, heterogeneous language, container, cloud native, etc.
Data desensitization module is a sub function module of shardingsphere distributed governance. It parses the SQL input by the user, rewrites the SQL according to the desensitization configuration provided by the user, so as to encrypt the original data, and store the original data (optional) and ciphertext data in the underlying database at the same time. When the user queries the data, it takes out the encrypted data from the database, decrypts it, and finally returns the decrypted original data to the user.
Apache shardingsphere distributed database middleware automates and transparently the process of data desensitization, so that users don’t need to pay attention to the details of data desensitization, and use desensitized data like ordinary data. In addition, shardingsphere can provide a relatively complete set of solutions for desensitization transformation of existing online business or desensitization function of new online business.
Demand scenario analysis
The demand for data desensitization can be divided into two situations in real business scenarios
New business launchedThe security department stipulates that the user sensitive information, such as bank and mobile phone number, should be encrypted and stored in the database, and then decrypted when used. Because it is a new system, so there is no stock data cleaning problem, so the implementation is relatively simple.
Online businessBefore, plaintext has been stored in the database. Relevant departments suddenly need to desensitize and rectify the online business. This kind of scenario generally needs to deal with three problems:
- How to desensitize the historical data, that is, how to wash the data;
- How to desensitize the new data and store it in the database without changing the business SQL and logic;
- How to realize the migration of business system between plaintext and ciphertext data safely, seamlessly and transparently.
The encryption JDBC provided by shardingsphere is deployed together with the business code. The business side needs to carry out JDBC programming for encrypt JDBC. Because encrypt JDBC implements all JDBC standard interfaces, the business code can be compatible without additional modification. At this point, encryption JDBC is responsible for all the interaction between the business code and the database. Business only needs to provide desensitization rules. As a bridge between business code and underlying database, encrypt JDBC can intercept user behavior and interact with database after modifying behavior.
Encrypt JDBC intercepts the SQL initiated by the user, parses and understands the SQL behavior through the SQL syntax parser, and then finds out the fields that need to be desensitized and the encryption and decryption device used according to the desensitization rules passed in by the user, encrypts and decrypts the target fields, and then interacts with the underlying database. Shardingsphere will encrypt the plaintext requested by the user and store it in the underlying database; when the user queries, it will extract the ciphertext from the database and decrypt it and return it to the end user. Shardingsphere shields the desensitization of data, so that users do not need to perceive the process of SQL parsing, data encryption and data decryption, just like using ordinary data.
Before explaining the whole process in detail, we need to understand the desensitization rules and configuration, which is the basis of understanding the whole process. Desensitization configuration is mainly divided into four parts: data source configuration, encryptor configuration, desensitization table configuration and query attribute configuration
- Data source configuration: refers to the configuration of datasource.
- Encryptor configuration: refers to the encryption strategy used for encryption and decryption. At present, shardingsphere has built-in two encryption and decryption strategies: AES / MD5. Users can also implement a set of encryption and decryption algorithms by implementing the interface provided by shardingsphere.
- Desensitization table configuration: it is used to tell shardingsphere which column is used to store cipher column, which column is used to store plain column, and which column users want to use for SQL writing.
- Configuration of query attributes: when plaintext data and ciphertext data are stored in the underlying database table at the same time, the attribute switch is used to decide whether to query plaintext data in the database table directly for return or query ciphertext data for return after decryption by encrypt JDBC.
How to understand which column users want to use for SQL writing (logiccolumn)?
We can understand it from the meaning of the existence of encrypt JDBC. The ultimate goal of encrypt JDBC is to shield the underlying desensitization of data, that is to say, we do not want users to know how the data is encrypted and decrypted, how to store plaintext data in plaincolumn, and how to store ciphertext data in ciphercolumn. In other words, we don’t want users to know the existence and use of plancolumn and ciphercolumn. Therefore, we need to provide users with a conceptual column, which can be separated from the real column of the underlying database. It can be a real column or not in the database table, so that users can change the column names of plancolumn and ciphercolumn of the underlying database at will. Or delete the plaincolumn and choose never to store plaintext but only ciphertext. As long as the user’s SQL is written for this logical column, and the correct mapping relationship between logiccolumn, plancolumn and ciphercolumn is given in desensitization rules.
Why do you do this? The answer is at the end of the article, that is, in order to make the online business seamlessly, transparently and safely carry out data desensitization migration.
For example, if there is a table in the database called t_ User, there are actually two fields PWD in this table_ Plain, used to store plaintext data and PWD_ Cipher is used to store ciphertext data, and defines the logiccolumn as PWD. Then, users should write SQL for logiccolumn, that is, insert into t_ user SET pwd = ‘123’。 Shardingsphere receives the SQL, and through the desensitization configuration provided by the user, it finds that pwd is a logiccolumn, so it desensitizes the logical column and its corresponding plaintext data. It can be seen that shardingsphere transforms the column name and data desensitization mapping between the user oriented logical column and the plaintext column and ciphertext column of the underlying database. As shown in the figure below:
This is the core meaning of encrypt JDBC, that is, according to the desensitization rules provided by users, the user SQL is separated from the underlying data table structure, so that the user’s SQL writing is no longer dependent on the real database table structure. The connection, mapping and transformation between users and underlying database are handled by shardingsphere. Why are we doing this? Or that sentence: in order to make the online business seamless, transparent and secure data desensitization migration.
In order to make readers more clearly understand the core processing flow of encrypt JDBC, the following picture shows the processing flow and conversion logic when using encrypt JDBC to add, delete, modify and query, as shown in the following figure:
Apache shardingsphere provides a complete set of desensitization solutions for new business online and old business transformation. The implementation principle and design idea are explained above. So how to combine desensitization rules, desensitization process with real business scenarios? The following will combine the specific implementation with the actual scene, and provide a detailed introduction of the desensitization transformation of the online business and the desensitization use of the new online business.
After understanding the desensitization process of shardingsphere, the desensitization configuration and process can be combined with the actual scene. All the design and development are to solve the pain points of business scenarios. So how to use shardingsphere to meet the business requirements mentioned before?
New online business
Business scenario analysis: new online business is relatively simple because everything starts from scratch and there is no problem of historical data cleaning.
Solution Description: after selecting an appropriate encryptor, such as AES, you only need to configure the logical column (write SQL for users) and ciphertext column (store ciphertext data in data table). The logical column and ciphertext column can be the same or different. The recommended configuration is as follows (displayed in yaml format)
encryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: pwd: cipherColumn: pwd encryptor: aes_encryptor
With this configuration, encrypt JDBC only needs to convert logiccolumn and ciphercolumn. The underlying data table does not store plaintext, but only ciphertext, which is also the requirement of security audit. If users want to store plaintext and ciphertext together in the database, they just need to add the plaincolumn configuration. The overall treatment process is shown in the figure below
Online business transformation
Business scenario analysis: as the business has been running online, there must be a large number of plaintext historical data in the database. Now the problem is how to encrypt and clean the historical data, how to encrypt the incremental data, and how to make the business seamless and transparent migration between the new and old data systems.
Solution description: before providing solutions, let’s brainstorm: first of all, since the old business needs to be desensitized, it must store very important and sensitive information. The information is valuable and the business foundation is relatively important. If there’s a mistake, we’ll see you again. Therefore, it is impossible to stop business as soon as it comes up, prohibit new data writing, find an encryptor to encrypt and clean all historical data, and then deploy the previously reconstructed code online, so that it can encrypt and decrypt the stock and incremental data online. It must be cool to talk about such a simple and crude way according to historical experience.
As like as two peas, the other way is to rebuild a set of pre production environment that is exactly the same as the production environment, and then store the original data of the production environment through the relevant migration and cleaning tools to store the original data, then add the new data to the database of the preprocessing environment through the MySQL master slave replica system and the business development tools. The code that can be encrypted and decrypted after construction is deployed to the pre delivery environment. In this way, the production environment is a query modification environment with plaintext as the core, and the pre sending environment is a query modification environment with ciphertext as the core. After comparison for a period of time, the production flow can be switched to the pre delivery environment by night operation.
This scheme is relatively safe and reliable, but the time, manpower, capital and cost are relatively high, mainly including: pre release environment construction, production code rectification, development of relevant auxiliary tools, etc. Unless there is no way to go, business developers usually go from getting started to giving up.
What business developers want most is to reduce the cost of capital, do not modify the business code, and migrate the system safely and smoothly. Therefore, the desensitization function module of shardingsphere was applied. It can be divided into three steps
1) Before system migration
Suppose that the system needs to_ The PWD field of the user is desensitized, and the business side uses encrypt JDBC to replace the standardized JDBC interface, which basically does not need additional modification (we also provide access methods such as springboot, springnamespace, yaml, etc., to meet the needs of different business sides). In addition, a set of desensitization configuration rules is provided as follows:
encryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: pwd: plainColumn: pwd cipherColumn: pwd_cipher encryptor: aes_encryptor props: query.with.cipher.column: false
According to the above desensitization rules, the first step is to set the desensitization rules in the database table t_ A new field in user is called PWD_ Cipher, that is, ciphercolumn, is used to store ciphertext data. At the same time, we set plancolumn to PWD to store plaintext data, and logiccolumn to PWD.
Because the previous code SQL is written with PWD, that is, SQL is written for logical columns, so the business code does not need to be changed. Through encrypt JDBC, the plaintext will be written to PWD column for the new data, and the plaintext will be encrypted and stored in PWD at the same time_ Cipher column. At this time, due to query.with.cipher . column is set to false. For business applications, the plaintext column pwd is still used for query storage, but it is stored in the underlying database table PWD_ Ciphertext data of new data is additionally stored in cipher, and its processing flow is shown in the following figure:
When the new data is inserted, it is encrypted as ciphertext data through encrypt JDBC and stored in ciphercolumn. Now we need to deal with historical data. Because Apache shardingsphere does not provide relevant migration tools at present, the business party needs to encrypt the plaintext data in PWD and store it in PWD_ cipher。
2) System migration in progress
The newly added data has been stored in the ciphertext column and the plaintext in the plaintext column by encrypt JDBC; after the historical data is encrypted and cleaned by the business party, the ciphertext is also stored in the ciphertext column. That is to say, both plaintext and ciphertext are stored in the current database, only because of the query.with.cipher . column = false, so ciphertext has never been used.
Now, in order to enable the system to query ciphertext data, we need to change the query.with.cipher . column is set to true. After restarting the system, we found that the business of the system is normal, but encrypt JDBC has begun to extract the data of the ciphertext column from the database, decrypt it and return it to the user; for the user’s needs of adding, deleting and modifying, the original data will still be stored in the plaintext column, and the encrypted data will be stored in the ciphertext column.
Although the business system takes out the data in the ciphertext column and decrypts it and returns it, it still stores a copy of the original data in the plaintext column when storing it. Why? The answer is: to be able to roll back the system. As long as ciphertext and plaintext always exist at the same time, we can switch business query to ciphercolumn or plaincolumn freely through switch configuration. That is to say, if the system is switched to the ciphertext column for query, it is found that the system reports an error and needs to be rolled back. So just put query.with.cipher . column = false, encrypt JDBC will be restored, that is, plancolumn will be used to query again. The processing flow is shown in the figure below
3) After system migration
Due to the requirements of the security audit department, it is generally impossible for the business system to keep the plaintext column and ciphertext column of the database permanently synchronously. We need to delete the plaintext column data after the system is stable. That is, we need to delete the plancolumn, PWD, after the system is migrated. That’s the problem. Now the business code is written in SQL for PWD. The PWD in the underlying data table is deleted and replaced by PWD_ Cipher decrypts the original data, doesn’t it mean that the business side needs to rectify all SQL so as not to use the PWD column that will be deleted? Remember the core of our encryption JDBC?
This is also the core meaning of encrypt JDBC, that is, according to the desensitization rules provided by users, the user SQL is separated from the underlying database table structure, so that the user’s SQL writing is no longer dependent on the real database table structure. The connection, mapping and transformation between users and underlying database are handled by shardingsphere.
Yes, because of the existence of logiccolumn, users write SQL for this virtual column, and encrypt JDBC can map this logical column to the ciphertext column in the underlying data table. So the desensitization configuration after migration is as follows:
encryptRule: encryptors: aes_encryptor: type: aes props: aes.key.value: 123456abc tables: t_user: columns: PWD: # PWD and PWD_ Transformation mapping of cipher cipherColumn: pwd_cipher encryptor: aes_encryptor props: query.with.cipher.column: true
The treatment process is as follows:
So far, the online business desensitization rectification solutions are all described. We provide Java, yaml, springboot and springnamespace for users to choose to access, and strive to meet the different access needs of services. At present, the solution has been continuously launched in Jingdong Digital Technology Co., Ltd., providing internal basic service support.
Middleware desensitization service advantages
- Automatic & transparent data desensitization process, users do not need to pay attention to the details of desensitization;
- Provide a variety of built-in, third party (aks) desensitization strategies, users only need simple configuration to use;
- Provide desensitization strategy API interface, users can implement the interface, so as to use user-defined desensitization strategy for data desensitization;
- Support switching different desensitization strategies;
- For the online business, it can realize the synchronous storage of plaintext data and ciphertext data, and decide whether to use plaintext column or ciphertext column for query through configuration. Without changing the business query SQL, the online system can safely and transparently migrate the data before and after encryption.
Applicable scenario description
- User projects are programmed in Java language;
- The back-end databases are mysql, Oracle, PostgreSQL and SQL server;
- Users need to desensitize one or more columns in the database table (data encryption & decryption);
- Compatible with all common SQL.
- Users need to process the original stock data and wash quantity in the database by themselves;
- Use desensitization function + sub database and sub table function, some special SQL does not support, please refer to SQL usage specification;
- Desensitization fields cannot support comparison operations, such as greater than or less than, order by, between, like, etc;
- Desensitization fields cannot support calculation operations, such as AVG, sum, and calculation expressions.
This paper introduces how to use encrypt JDBC to access. The access form can also choose to use spring boot, spring namespace and so on. This form of access end is mainly oriented to Java isomorphism, and deployed in the production environment together with business code. For heterogeneous languages, shardingsphere also provides an encrypt proxy client. Encrypt proxy is a server-side product that implements the binary protocol of MySQL and PostgreSQL. Users can independently deploy the encrypt proxy service, and access the desensitized virtual database server by using the third-party database management tools such as Navicat, Java connection pool and command line, just like using ordinary MySQL and PostgreSQL databases.
Desensitization function belongs to the function category of Apache shardingsphere distributed governance. In fact, the Apache shardingsphere ecosystem has other capabilities, such as data fragmentation, read-write separation, distributed transactions, monitoring and governance, etc. You can even choose any variety of functional modules for superposition, such as data desensitization + data fragmentation, or data fragmentation + read-write separation, or monitoring governance + data fragmentation, etc. In addition to the overlay selection at the functional level, shardingsphere also provides various forms of access end, such as sharding JDBC or sharding proxy, to meet the needs of different scenarios.
Write at the end
Shardingsphere only supports database and table splitting at the beginning, and now it has formed an ecosystem with core functions such as data fragmentation, distributed governance and distributed transaction. This also indicates that it is not only a distributed database middleware, not only has the ability of database and table splitting, but also forms a comprehensive solution ecosystem with data fragmentation, distributed governance and distributed transactions as the core. You are welcome to learn more about it on the official website and pay attention to us in GitHub.
Official website: https://shardingsphere.apache…