Learn how to implement adversarial verification to build classifiers to determine whether your data comes from a training or test set. If you can, there is a problem with your data, and the adversary validation model can help you diagnose the problem.
If you’re looking at some winning solutions on kaggle, you might notice a reference to “adversarial verification.”（like this）。 What is it?
In short, we build a classifier to try to predict which rows are from the training set and which rows are from the test set. If two datasets come from the same distribution, that should not be possible. However, if there are systematic differences in the eigenvalues of your training and test data sets, the classifier will be able to successfully learn to distinguish them. The more models you can learn to distinguish them better, the bigger the problem.
But the good news is_ You can analyze the learned model to help you diagnose the problem. And once you understand the problem, you can solve it.
Learning confrontation verification model
First, import some libraries:
For this tutorial, we’ll use kaggle’sIeee-cis credit card fraud detection data set。 First, suppose you have loaded training and test data into panda dataframes and named them as_ Df_ train_ And_ Df_ test_。 Then we’ll do some basic cleanup by replacing the missing values.
For adversarial verification, we want to learn a model that can predict which rows in the training data set and which rows in the test set. Therefore, we create a new target column, where the test samples are marked with 1 and the training samples are marked with 0, as follows:
This is the goal of our training model for prediction. At present, training data set and test data set are separated, and each data set has only one target value label. If we’re here_ This_ If a model is trained in the training set, it will only know that everything is zero. We want to reorganize the training and test data sets, and then create new data sets to fit and evaluate the adversarial verification model. I defined a function for merging, reorganizing, and reparting:
New data set_ adversarial_ train_ And_ adversarial_ test_ It includes a mixture of the original training set and the test set, while the target indicates the original data set_ Note: I have_ TransactionDT_ Add to the feature list_
For modeling, I’ll use catboost. I prepare the data by putting the dataframes into the catboost pool object.
This part is very simple: we just need to instantiate catboost classifier and fit it into our data
Let’s move on and draw the ROC curve on the reserved dataset:
This is a perfect model, which means that there is a clear way to tell you whether any given record is in the training or test set. This violates the assumption that our training and test sets have the same distribution.
Diagnose the problem and iterate
To see how the model does this, let’s look at the most important features:
By far, transactiondt is the most important feature. Given that the original training and test data sets come from different periods (the test set appears in the future of the training set), this makes perfect sense. The model just learned that if transactiondt is larger than the last training sample, it is in the test set.
I include transactiondt just to illustrate this – it is generally not recommended to use the original date as a model feature. But the good news is that the technology was discovered in such a dramatic way. This kind of analysis can obviously help you identify this kind of error.
Let’s eliminate transactiondt and run the analysis again.
Now, the ROC curve is as follows:
It is still a fairly powerful model, AUC > 0.91, but much weaker than before. Let’s look at the feature importance of this model:
Now_ Id_ 31_ It’s the most important function. Let’s look at some values to see what it is.
This column contains the software version number. Obviously, this is conceptually similar to including the original date, because the first appearance of a particular software version will correspond to its release date.
Let’s solve this problem by deleting all non alphabetic characters in the column:
Now, the values of our columns are as follows:
Let’s use this clear column to train a new confrontation verification model
Now, the ROC diagram is as follows:
The performance has decreased from 0.917 AUC to 0.906. This means that it’s hard for the model to distinguish between our training data set and our test data set, but it’s still powerful.
This method is used to evaluate whether the distribution of training set and test set is consistent or not, so as to prevent new test set from appearing and leading to collapse.
If you like this article, please like it and forward it! thank you.
Don’t leave after watching, there are still surprises!
I carefully organized 2TB video lessons and books related to computer / Python / machine learning / deep learning, worth 1W yuan. Focus on WeChat official account “computer and AI”, click on the menu below to get SkyDrive links.