Data desensitization: a real solution to data warehouse security and privacy protection

Time:2021-4-21

Abstract:How to strengthen the data security and privacy protection at the technical level, and put forward more functional requirements for the data warehouse products themselves, is also the most effective way of data security construction.

This article is shared from Huawei cloud community《Gaussdb (DWS) security: a real way of privacy protection — data desensitization》Author: Wo Hua dada.

introduction

The advent of the era of big data has subverted the operation mode of traditional formats and stimulated new production potential. Data has become an important factor of production and the carrier of information. The flow of data also hides higher-order value information. For data controllers and data processors, how to maximize the value of data flow is the original intention and significance of data mining. However, with the exposure of a series of information leakage incidents, data security has attracted more and more attention.

Countries and regions gradually establish and improve data security and privacy protection related laws and regulations to provide legal protection for user privacy protection. How to strengthen the data security and privacy protection at the technical level, and put forward more functional requirements for the data warehouse products themselves, is also the most effective way of data security construction.

What is data desensitization?

Mobile phone name Data Masking, as the name implies, is to shield sensitive data, and to transform some data to some sensitive information such as ID number, cell phone number, card number, customer name, customer address, e-mail address, salary, etc., through the desensitization rule, and realize the reliable protection of privacy data. Common desensitization rules in the industry include replacement, rearrangement, encryption, truncation and masking. Users can also customize desensitization rules according to the desired desensitization algorithm.
Data desensitization: a real solution to data warehouse security and privacy protection

Generally, a good implementation of data desensitization needs to follow the following two principles: first, keep the meaningful information before desensitization as much as possible for the application after desensitization; second, prevent hackers from cracking to the greatest extent.

Data desensitization is divided into static data desensitization and dynamic data desensitization. Desensitization of static data is the “moving and simulation replacement” of data. It is the process of desensitization of data extraction, which is sent to the downstream link for random access, reading and writing. After desensitization, the data is isolated from the production environment to meet the business needs and ensure the safety of the production database. Dynamic data desensitization, which can access sensitive data in real time, can implement different desensitization schemes for different roles, different permissions and different data types, so as to ensure that the returned data is available and safe.

The data desensitization function of gaussdb (DWS) abandons the high dependence and high cost of desensitization in business application layer, internalizes data desensitization into the security capability of database products, and provides a complete, safe, flexible, transparent and friendly data desensitization solution, which belongs to dynamic data desensitization. After the user identifies the sensitive field, the desensitization strategy can be created by binding the built-in desensitization function based on the target field. The redaction policy corresponds to the table object one by one. A desensitization strategy consists of three key elements: table object, effective condition and desensitization column desensitization function pair. It is the set of all desensitization columns on the table object. Different fields can use different desensitization functions according to data characteristics. When and only when the effective condition is true, the query statement will trigger the desensitization of sensitive data, and the desensitization process is built-in in the SQL Engine, which is transparent to the users of the generation environment.
Data desensitization: a real solution to data warehouse security and privacy protection

How to desensitize with data?

Dynamic data desensitization is to realize real-time desensitization processing according to whether the effective conditions are met in the process of query statement execution. The effective condition is usually the judgment of the current user role. The visible range of sensitive data is preset for different users. System administrator, with the highest authority, can see any field of any table at any time. Determining the role of restricted users is the first step in creating desensitization strategy.

Mobile phone name and ID number, mobile phone number, e-mail address and so on are sensitive to the sensitive information of the user’s individual, including the name, ID card number, mobile phone number and e-mail address. In the banking system, the customer may also involve bank card number, expiration date, payment password and so on. The medical system, as a patient, may also involve medical information and so on. Therefore, identifying and combing the sensitive fields of specific business scenarios is the second step of creating desensitization strategy.

A series of common desensitization function interfaces are built in the product, which can specify parameters according to different data types and data characteristics, so as to achieve different desensitization effects. Desensitization function can use the following three built-in interfaces, and support user-defined desensitization function. The three built-in desensitization functions can cover the desensitization effect of most scenes, and the user-defined desensitization function is not recommended.

  • MASK_ None: no desensitization treatment, only for internal test.
  • MASK_ Full: complete desensitization to a fixed value.
  • MASK_ Partial: use the specified desensitization character to partially desensitize the content within the desensitization range.

Different desensitization column can use different desensitization function. For example, the mobile phone number usually displays the last four digits of the tail number, and the front one uses the “*” Replacement; The amount is uniformly displayed as a fixed value of 0, and so on. The third step of creating desensitization strategy is to determine the desensitization function to be bound to desensitization column.

Taking EMP, Alice, matu and July as examples, this paper briefly introduces the process of data desensitization. Among them, EMP contains the employee’s name, mobile phone number, email address, paycard number, salary and other privacy data. Alice is the human resource manager, and matu and July are ordinary employees.

Assume that the table, the user, and the user’s permission to view the table emp are all in place.

  • (1) Create desensitization strategy mask_ EMP, only Alice is allowed to view all employee information, but matu and July are not visible to pay card number and salary. Field card_ No is a numeric type, using mask_ Full desensitization to a fixed value of 0; field card_ String is a character type, using mask_ Partial desensitizes the original data according to the specified input / output format; the field salary is a numeric type, and uses the digit 9 to desensitize all the digits before the penultimate digit.
postgres=# CREATE REDACTION POLICY mask_emp ON emp WHEN (current_user != 'alice')
ADD COLUMN card_no WITH mask_full(card_no),
ADD COLUMN card_string WITH mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV','VVVV-VVVV-VVVV-VVVV','#',1,12), 
ADD COLUMN salary WITH mask_partial(salary, '9', 1, length(salary) - 2);

Switch to matu and July to view EMP.

postgres=> SET ROLE matu PASSWORD '[email protected]';
postgres=> SELECT * FROM emp;
 id | name |  phone_no   | card_no |     card_string     |        email         |   salary   |      birthday       
----+------+-------------+---------+---------------------+----------------------+------------+---------------------
  1 | anny | 13420002340 |       0 | ####-####-####-1234 | [email protected]      | 99999.9990 | 1999-10-02 00:00:00
  2 | bob  | 18299023211 |       0 | ####-####-####-3456 | [email protected]    |  9999.9990 | 1989-12-12 00:00:00
  3 | cici | 15512231233 |         |                     | [email protected] |            | 1992-11-06 00:00:00
(3 rows)
postgres=> SET ROLE july PASSWORD '[email protected]';
postgres=> SELECT * FROM emp;
 id | name |  phone_no   | card_no |     card_string     |        email         |   salary   |      birthday       
----+------+-------------+---------+---------------------+----------------------+------------+---------------------
  1 | anny | 13420002340 |       0 | ####-####-####-1234 | [email protected]      | 99999.9990 | 1999-10-02 00:00:00
  2 | bob  | 18299023211 |       0 | ####-####-####-3456 | [email protected]    |  9999.9990 | 1989-12-12 00:00:00
  3 | cici | 15512231233 |         |                     | [email protected] |            | 1992-11-06 00:00:00
(3 rows)
  • (2) Due to the work adjustment, matu entered the human resources department to participate in the recruitment of the company, and all the information of the employees was visible, so the effective conditions of the strategy were modified.
postgres=> ALTER REDACTION POLICY mask_emp ON emp WHEN(current_user NOT IN ('alice', 'matu'));

Switch to users matu and July and re view the EMP.

postgres=> SET ROLE matu PASSWORD '[email protected]';
postgres=> SELECT * FROM emp;
 id | name |  phone_no   |     card_no      |     card_string     |        email         |   salary   |      birthday       
----+------+-------------+------------------+---------------------+----------------------+------------+---------------------
  1 | anny | 13420002340 | 1234123412341234 | 1234-1234-1234-1234 | [email protected]      | 10000.0000 | 1999-10-02 00:00:00
  2 | bob  | 18299023211 | 3456345634563456 | 3456-3456-3456-3456 | [email protected]    |  9999.9900 | 1989-12-12 00:00:00
  3 | cici | 15512231233 |                  |                     | [email protected] |            | 1992-11-06 00:00:00
(3 rows)
postgres=> SET ROLE july PASSWORD '[email protected]';
postgres=> SELECT * FROM emp;
 id | name |  phone_no   | card_no |     card_string     |        email         |   salary   |      birthday       
----+------+-------------+---------+---------------------+----------------------+------------+---------------------
  1 | anny | 13420002340 |       0 | ####-####-####-1234 | [email protected]      | 99999.9990 | 1999-10-02 00:00:00
  2 | bob  | 18299023211 |       0 | ####-####-####-3456 | [email protected]    |  9999.9990 | 1989-12-12 00:00:00
  3 | cici | 15512231233 |         |                     | [email protected] |            | 1992-11-06 00:00:00
(3 rows)
  • (3) Employee information phone_ No, email and birthday are also private data, update desensitization policy mask_ EMP, three desensitization columns are added.
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN phone_no WITH mask_partial(phone_no, '*', 4);
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN email WITH mask_partial(email, '*', 1, position('@' in email));
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN birthday WITH mask_full(birthday);

Switch to user July and view employee table emp.

postgres=> SET ROLE july PASSWORD '[email protected]';
postgres=> SELECT * FROM emp;
 id | name |  phone_no   | card_no |     card_string     |        email         |   salary   |      birthday       
----+------+-------------+---------+---------------------+----------------------+------------+---------------------
  1 | anny | 134******** |       0 | ####-####-####-1234 | ********163.com      | 99999.9990 | 1970-01-01 00:00:00
  2 | bob  | 182******** |       0 | ####-####-####-3456 | ***********qq.com    |  9999.9990 | 1970-01-01 00:00:00
  3 | cici | 155******** |         |                     | ************sina.com |            | 1970-01-01 00:00:00
(3 rows)
  • (4) Considering the friendliness of user interaction, gaussdb (DWS) provides system view redaction_ Policies and redaction_ Columns, convenient for users to view more desensitization information directly.
postgres=> SELECT * FROM redaction_policies;
 object_schema | object_owner | object_name | policy_name |            expression             | enable | policy_description 
---------------+--------------+-------------+-------------+-----------------------------------+--------+--------------------
 public        | alice        | emp         | mask_emp    | ("current_user"() = 'july'::name) | t      | 
(1 row)
postgres=> SELECT object_name, column_name, function_info FROM redaction_columns;
 object_name | column_name |                                             function_info                                             
-------------+-------------+-------------------------------------------------------------------------------------------------------
 emp         | card_no     | mask_full(card_no)
 emp         | card_string | mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV'::text, 'VVVV-VVVV-VVVV-VVVV'::text, '#'::text, 1, 12)
 emp         | email       | mask_partial(email, '*'::text, 1, "position"(email, '@'::text))
 emp         | salary      | mask_partial(salary, '9'::text, 1, (length((salary)::text) - 2))
 emp         | birthday    | mask_full(birthday)
 emp         | phone_no    | mask_partial(phone_no, '*'::text, 4)
(6 rows)
  • (5) Suddenly one day, when employee information can be shared within the company, the desensitization strategy mask of EMP will be deleted directly_ EMP.
postgres=> DROP REDACTION POLICY mask_emp ON emp;

For more usage details, please refer to gaussdb ( DWS ) eight . one . 1 product documentation.

The secret behind the realization of data desensitization

Data desensitization: a real solution to data warehouse security and privacy protection

Gaussdb (DWS) data desensitization function, based on the existing implementation framework of SQL Engine, realizes real-time desensitization processing without external perception in the process of restricted users executing query statements. The internal implementation is shown in the figure above. We regard redaction policy as a binding rule on the table object, and traverse query in the query rewriting stage of optimizer If each targetentry of the targetlist in the tree involves a desensitization column of the base table, and the current desensitization rule is in effect (that is, it meets the effective condition of the desensitization policy and is enabled), it is concluded that the targetentry involves the VaR object to be desensitized. At this time, traverse the desensitization column system table PG_ redaction_ Column, find the desensitization function bound to the corresponding desensitization column, and replace it with the corresponding funcexpr. After the above rewriting of query tree, the optimizer will automatically generate a new execution plan, the executor will execute according to the new plan, and the query result will desensitize the sensitive data.

Compared with the original statement, the statement execution with data desensitization increases the logic processing of data desensitization, which will inevitably bring extra cost to the query. This part of the cost is mainly affected by the data size of the table, the number of desensitization columns involved in the query target column, and the desensitization function used by the desensitization column.

For simple query statements, take TPCH table customer as an example to test the above factors, as shown in the figure below.
Data desensitization: a real solution to data warehouse security and privacy protection

chart ( a )、( b ) According to the type and characteristics of the fields, the Chinese base table customer adopts mask_ Full desensitization function, there are also using mask_ Partial desensitization function. MASK_ Full for any length and type of raw data, only desensitization into a fixed value, so the output results compared with the original data, great difference. chart ( a ) It shows the execution time of simple query in desensitization and non desensitization scenarios under different data scales. The solid icon is non desensitization scene, and the hollow icon is restricted user, which is desensitization scene.

It can be seen that the larger the data scale, the greater the difference between the query time with desensitization and the original statement. chart ( b ) It shows the effect of the number of desensitized columns on the performance of statement execution under 10x data scale. When one desensitization column is involved, the query with desensitization is slower than the original statement. It is found that this column uses mask_ Partial desensitization function, the query result only changes the format of the result, the length of the result content does not change, in line with the theoretical conjecture that “the execution of statements with desensitization will have corresponding performance degradation”. As the number of desensitized columns increases, we find a strange phenomenon that desensitization scenario is faster than the original statement. Further tracing the desensitization function associated with desensitization column in multi column scenario, it is found that it is precisely because of the use of mask_ The desensitization column of full desensitization function results in a lot of time saving in the output result set compared with the original data, so the simple query with data desensitization under multi column query can speed up a lot.

In order to support the above conjecture, we adjust the desensitization function and use mask for all desensitization columns_ Partial desensitizes the original data partially, so that the external readability of the original data can be retained in the desensitization results. Therefore, as shown in figure (c), when desensitization columns are all associated with partial desensitization function, the statement with data desensitization is about 10% worse than the original statement. Theoretically, this kind of deterioration is acceptable. The above tests are only for simple query statements. When the statements are complex enough to have aggregate functions or complex expression operations, the performance degradation may be more obvious.

summary

The data desensitization function of gaussdb (DWS) product is an important technical breakthrough in internalizing database products and consolidating data security capability, mainly covering the following three aspects:

  1. A simple and easy-to-use data desensitization strategy syntax;
  2. A series of built-in desensitization functions with flexible configuration can cover the desensitization effect of common privacy data;
  3. A complete and convenient desensitization strategy application scheme enables the original statement to realize desensitization in real time, transparently and efficiently.

In a word, this data desensitization function can fully meet the data desensitization demands of customer business scenarios, support the desensitization effect of common privacy data, and realize the reliable protection of sensitive data.

Click follow to learn about Huawei’s new cloud technology for the first time~

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]