Federal learning one

Time:2021-9-2

The concept of federal learning

`In short, federal learning is   Different data sources   Data   Joint training  , Get a better model`
`In the process of machine learning, all participants can carry out joint modeling with the help of other data`
`All parties do not need to share data resources. When the data is not local, they conduct joint training and establish a shared machine learning model`

The significance of federal learning

`Federated learning is a machine learning method to protect data and solve data islands`
`In addition to data islands, there may be data privacy security issues`

Classification of federal learning

Federal learning one

  • Enterprise data island

Federal learning one

perhaps

Federal learning one

Application scenarios of Federated learning


Visual horizontal learning system

`Same as target detection task`
`The data sets marked by each organization are different`
`Agencies collaborate without exchanging data`

Federal learning one

Natural language processing horizontal federated learning system – unregistered word (OOV) generation problem

Federal learning one

`Each user equipment only stores a limited size Thesaurus - Data Island`
`OOV word generation contains user sensitive content - user data protection`
`Each mobile device cooperates without exchanging data`

Federal auto insurance pricing learning system

Federal learning one

Business realization process of Federated learning products


Advance notice: This article introduces the implementation method commonly used in the federal learning industry, which does not involve the confidential information of a company

Note: most of the processes described below have been realized, and a few are still in the stage of technical research

The complete process of federal learning products

Federal learning one

Simple description of concept

  • Sample fusion
`Data intersection of both parties`
`a. Initiator sample a   For example, the user's financial information`
`b. Partner sample B   For example, the user's logistics information`
`c. Samples a and B have a common field uid   User number`
`That is, sample a can be taken through the user number to obtain the financial information of the user`
`Go to sample B to obtain the logistics information of the user`
`d. Get intersection according to UID`
`That is, users who obtain both financial information and logistics information`
`However, the data of both sponsors and partners are independent   They don't know each other`
`e. After sample fusion   The sponsor has the integrated financial data`
`Partners also have integrated logistics information`
`But the partners will not have financial information`
`The initiator will not provide logistics information`

Federal learning one

  • Data preprocessing / exploratory analysis

Improve the quality of data and deal with outliers

  • Characteristic box

Calculate the IV indicators of all features, select the model with high IV indicators, and the higher the IV indicators, the more effective the model is

  • PCA principal component analysis

Calculate the analysis index of a feature

  • model training

Conduct training according to the test data of both parties

The sample set data will be divided into two training sets and verification sets

  • Training with training sets
  • Validation using validation knot

    Both sides will adjust the training parameters to each other according to the indicators of their respective models, so as to achieve that the index parameters (KS, AUC and loss) of the evaluation models of both sides are good

Specific implementation process of each process

Only the front-end part (web and Java) is introduced here, and the back-end part (Python) will be introduced one by one later

  • Data preparation stage

    Federal learning one

  • Enter node information
  • Enter partner node
  • Node information includes:
  • be careful

    A the current node can only have one partner, and there can be multiple partners

    B the node number is obtained through MD5 encryption of the node name, and the node numbers of both parties shall be consistent

    C public-private key is used for secure encryption and decryption of communication data between Java and python

    d. The request URL path of interaction between Java, interaction between Python and interaction between Java and Python is obtained from the corresponding node information

  • mode

    A upload data via CSV (small amount of data)

    There are many implementation methods. Here are two

    A-1 upload distributed disk storage system fastdfs

    Federal learning one

    A-2 upload distributed memory file system alluxio

    Federal learning one

    B uploaded by database script (small amount of data)

    C import Clickhouse through hive or MySQL (large amount of data)

    D) by means of database and table (large amount of data)

    The amount of big data is billion. For example, both sides have 1 billion data

    c. D these two methods will be introduced in the following articles

  • The new data set is divided into training and prediction
  • Create your own node and multiple partner nodes

    Initiator node name, initiator Java service root path, initiator Python service root path, initiator encryption public-private key

    Federal learning one

    Partner name, partner public key, partner Java service and path, partner Python service and path

    Federal learning one

  • New project information

    It includes: 2 node information (one is the current node and the other is the partner node), multiple training sample data and multiple test sample data

  • Initiator new task information

    A project can have multiple tasks

    Task information includes: task name and initiator node training sample data

    Federal learning one

  • Partner assistance tasks
  • Data fusion

Federal learning one

Federal learning one

Product prototype diagram of data fusion

  • Data preprocessing

    Exploratory analysis is similar to the front-end operation process

Federal learning one

  • The user triggers the second data preprocessing

    Exploratory analysis is similar to the front-end operation process

Federal learning one

Federal learning one

Data preprocessing – product prototype

  • Characteristic Engineering

    Federal learning one

  • Characteristic box

Federal learning one

Parameter selection and definition of feature box algorithm

Federal learning one

Characteristic box index IV value

  • PCA principal component analysis

    It is similar to the operation flow of the front end of the feature sub box

  • model training

    Federal learning one

Federal learning one

Select model training algorithm

Federal learning one

Model training process

Federal learning one

Model evaluation index

  • model prediction

Federal learning one

Introduction to some technical points and principles

Two implementations of data alignment

Fundamentals of cryptography

Inadvertent transmission encryption ot

concept
`Inadvertent transmission   Transfer) is a cryptographic protocol`
`In this protocol, the message sender sends one message to the receiver from some messages to be sent`
`However, it is still unknown which message was sent afterwards`
`This protocol is also called daze transmission protocol`
Examples
  • 1 out of 2

Federal learning one

  • 1 out of n

Federal learning one

Federal learning one

`1. Alice, the sender, generates two pairs of RSA public and private keys and sends the two public keys puk0 and puk1 to Bob, the receiver`
`2.   Bob generates a random number and encrypts the random number with one of the two public keys received (which secret key depends on which data you want to obtain, for example, if you want to get the message M0)   Use puk0 to encrypt the random number. If you want to get M1, use puk1 to encrypt the random number) and send the ciphertext result to Alice`
`3.   Alice decrypts the received random number ciphertext with her two private keys, obtains two decryption results K0 and K1, XOR the two results with the two pieces of information to be sent (K0 XOR M0, K1 XOR M1), and sends the two results E0 and E1 to Bob`
`4.   Bob uses his real random number and the received E0 and E1 to do XOR operation respectively. Only one of the two results is real data and the other is random number`

Cooperation of RSA algorithm and hash mechanism

To solve the problem of encrypted sample alignment

Federal learning one

  • Cooperation of RSA algorithm and hash mechanism

Federal learning one

`First of all, from a macro perspective, we should ensure that our data will not be obtained by the other party`
`Both a and B need to take actions only they know about the data   To ensure that the other party cannot de push the data`
`For a, the secret operation is realized by hash mechanism and randomly generated RI`
`For B, the secret operation is implemented by hash mechanism and self generated D`
`Step 1: B generates N, e and d by RSA algorithm and sends the public key containing N and e to a`
`Part II: a encrypts its own user data, hash + RI, and then sends the encrypted data ya to B`
`Step 3: after B obtains ya, it is difficult to deduce the user data of a because the principle of hash mechanism and RI are unknown. B takes the d-power of ya to obtain ZA, then encrypts its own user data, takes the hash and then the d-power and then hashes to obtain ZB, and then sends Za and ZB to a`
`Step 4: after a obtains ZB, similarly, it cannot deduce the user data of B, then encrypt its own user data ZA, divide RI, and then hash to obtain da`
`Step 5: Da and ZB are essentially the data obtained after the same operation on the data. Therefore, if the source data is the same, the data after the operation is also the same. Therefore, according to the intersection result of Da and ZB, a can judge what the common data of a and B are. Finally, send the result I to B, and the sample alignment ends`
  • Think about why
`First of all, we must understand that to find the intersection, we must put the data together`
`Together, you must encrypt your own data`
`We also need to understand that common user data must remain the same after encryption, which is temporarily called requirement 1`
`Let's assume that the user data of both sides is hashed only once`
`In this way, although requirement 1 can be guaranteed, the security is not high, and the user may push back the user data`
`Therefore, random variables are introduced`
`We then consider whether only one party can send data and the other party can receive data`
`Suppose B encrypts its own user data, uses random variable D and hash, and sends it to a`
`After receiving from receiver a, a needs to operate its own data`
`Make their own public data equal to the public data in the data transmitted by B to meet requirement 1`
`However, since a does not know D, a cannot achieve public data equality and sample alignment`
`Next, consider sending data to each other`
`A encrypts its own data with random variable RI and hash and sends it to B`
`B operates on the ciphertext sent by A. at this time, the ciphertext has factor D`
`B encrypts its own data, and then sends two parts of data to a`
`A will convert its own ciphertext after receiving two ciphertexts`
`At this time, your ciphertext already contains factor D`
`So theoretically, after a unlocks the "lock" he has added to his data`
`It can be converted to the same form as B data, and the sample alignment can be completed at this time`
`In this process, it is impossible for both parties to deduce each other's data`

Model training encryption process

Homomorphic encryption

The longitudinal model training is useful for additive homomorphic encryption in homomorphic encryption

Federal learning one

`The first expression represents the loss function loss`
`The second equation represents the gradient`
`The following two formulas represent the properties of additive homomorphic encryption - the ciphertext of sum is equal to the sum of ciphertext`
`Because additive homomorphic encryption only supports addition, subtraction, multiplication and division, and does not support exponential operation, loss needs quadratic Taylor expansion at zero`
`The figure on the right shows the actual working process. UA sends the encrypted UA and UA ^ 2 to UB, UA = Wx, that is, the product of the weight of its own data and the eigenvalue`
`UB calculates the product of the weight of its own data and the special value, and then adds it to UA to obtain w * x of all features of the sample`
`Then, D is calculated by combining the sample label y owned by yourself`
`D is the part of the gradient expression where x is removed`
`UB encrypts D and sends it to UA. At this time, UA and UB can calculate their own gradient by multiplying their own eigenvalue X by D`
`After UA and UB calculate the gradient, they encrypt and upload it to arbiter. After receiving it, arbiter decodes and updates the gradient, and then distributes the updated gradient. After receiving it, uaub completes an update of the model`
`The above process is iterated continuously until the loss is lower than the expected value`
`When there are many data provider hosts, in order to reduce the communication cost, the loss is not calculated. Judge whether the training is over by the gradient update amplitude between the two iterations. If the amplitude is very small, the training is over`

epilogue

It will be introduced later

  • How to design and implement a data architecture with hundreds of millions of levels
  • Some big data processing frameworks are used for data processing, such as spark and Flink
  • The advantage of Python lies in how the algorithm supports hundreds of millions of levels of data for model training
  • Core: introduction to each stage of back-end Python
  • Introduction to other federal learning technology points