Overview of countermeasure verification

Time:2021-1-7

Overview of countermeasure verification

Learn how to implement adversarial verification to build classifiers to determine whether your data comes from a training or test set. If you can, there is a problem with your data, and the adversary validation model can help you diagnose the problem.

If you’re looking at some winning solutions on kaggle, you might notice a reference to “adversarial verification.”(like this)。 What is it?

In short, we build a classifier to try to predict which rows are from the training set and which rows are from the test set. If two datasets come from the same distribution, that should not be possible. However, if there are systematic differences in the eigenvalues of your training and test data sets, the classifier will be able to successfully learn to distinguish them. The more models you can learn to distinguish them better, the bigger the problem.

But the good news is_ You can analyze the learned model to help you diagnose the problem. And once you understand the problem, you can solve it.

You can use theGitHubupperfindComplete code for this article.

Learning confrontation verification model

First, import some libraries:

Overview of countermeasure verification

Data preparation

For this tutorial, we’ll use kaggle’sIeee-cis credit card fraud detection data set。 First, suppose you have loaded training and test data into panda dataframes and named them as_ Df_ train_ And_ Df_ test_。 Then we’ll do some basic cleanup by replacing the missing values.

Overview of countermeasure verification

For adversarial verification, we want to learn a model that can predict which rows in the training data set and which rows in the test set. Therefore, we create a new target column, where the test samples are marked with 1 and the training samples are marked with 0, as follows:

Overview of countermeasure verification

This is the goal of our training model for prediction. At present, training data set and test data set are separated, and each data set has only one target value label. If we’re here_ This_ If a model is trained in the training set, it will only know that everything is zero. We want to reorganize the training and test data sets, and then create new data sets to fit and evaluate the adversarial verification model. I defined a function for merging, reorganizing, and reparting:

Overview of countermeasure verification

New data set_ adversarial_ train_ And_ adversarial_ test_ It includes a mixture of the original training set and the test set, while the target indicates the original data set_ Note: I have_ TransactionDT_ Add to the feature list_

For modeling, I’ll use catboost. I prepare the data by putting the dataframes into the catboost pool object.

Overview of countermeasure verification

modeling

This part is very simple: we just need to instantiate catboost classifier and fit it into our data

Overview of countermeasure verification

Let’s move on and draw the ROC curve on the reserved dataset:

Overview of countermeasure verification

This is a perfect model, which means that there is a clear way to tell you whether any given record is in the training or test set. This violates the assumption that our training and test sets have the same distribution.

Diagnose the problem and iterate

To see how the model does this, let’s look at the most important features:

Overview of countermeasure verification

By far, transactiondt is the most important feature. Given that the original training and test data sets come from different periods (the test set appears in the future of the training set), this makes perfect sense. The model just learned that if transactiondt is larger than the last training sample, it is in the test set.

I include transactiondt just to illustrate this – it is generally not recommended to use the original date as a model feature. But the good news is that the technology was discovered in such a dramatic way. This kind of analysis can obviously help you identify this kind of error.

Let’s eliminate transactiondt and run the analysis again.

Overview of countermeasure verification

Now, the ROC curve is as follows:

Overview of countermeasure verification

It is still a fairly powerful model, AUC > 0.91, but much weaker than before. Let’s look at the feature importance of this model:

Overview of countermeasure verification

Now_ Id_ 31_ It’s the most important function. Let’s look at some values to see what it is.

Overview of countermeasure verification

This column contains the software version number. Obviously, this is conceptually similar to including the original date, because the first appearance of a particular software version will correspond to its release date.

Let’s solve this problem by deleting all non alphabetic characters in the column:

Overview of countermeasure verification

Now, the values of our columns are as follows:

Overview of countermeasure verification

Let’s use this clear column to train a new confrontation verification model

Overview of countermeasure verification

Now, the ROC diagram is as follows:

Overview of countermeasure verification

The performance has decreased from 0.917 AUC to 0.906. This means that it’s hard for the model to distinguish between our training data set and our test data set, but it’s still powerful.

conclusion

This method is used to evaluate whether the distribution of training set and test set is consistent or not, so as to prevent new test set from appearing and leading to collapse.


If you like this article, please like it and forward it! thank you.

Don’t leave after watching, there are still surprises!

I carefully organized 2TB video lessons and books related to computer / Python / machine learning / deep learning, worth 1W yuan. Focus on WeChat official account “computer and AI”, click on the menu below to get SkyDrive links.

Overview of countermeasure verification