[paper anatomy] cvpr2020 unsupervised domain adaptation via structurally regulated deep clustering
This article is the article of 2020cvpr, the article uses the clustering method to adapt to the domain, and the effect is outstanding in 20 years on the commonly used data sets. Recent work is similar to this idea. Let’s investigate.
Transfer learning solves the problem of inconsistent distribution of source domain and target domain.The mainstream method is to align the features of different domains, which may destroy the internal structure of the target domain.In order to solve this problem, this paper applies clustering constraints by assuming the structural similarity of the domain. Joint network training is carried out by minimizing the prediction probability distribution of the network and the KL divergence of the additional distribution.
At the beginning, I repeat that the method of aligning features in general UDA will destroy the structure of target data discrimination.
The article then puts forward a hypothesis
- There are clusters in each domain.
- Clusters of the same kind in two domains will be close.
The specific method is to replace the onehot with the additional distribution to realize the regularization of the structure source domain?It’s very awkward. In addition, clustering constraints are imposed on the features in the middle of the network to enhance the discrimination of the target domain, and the structure constraints are enhanced by soft selection of non discrete source domain samples.
This passage is the same as the abstract. In fact, I still don’t know what method is used, but let’s continue to read it.
The general method is to align domain features with a distance, such as MMD or confrontation.
The author mentioned that very few UDA work is clustering in the target domain. Clustering is only used as an additional loss (incremental technique).
It is worth mentioning that the author uses the third section to discuss the motivation of the article. The specific topic is (the strategies of transferring versus uncovering the intrinsic target discrimination).
The first is the definition of symbol, s is the source domain, t is the target domain, X is the sample, Z is the intermediate feature, y is the label, K is the number of categories. ϕ F is the classifier. The relationship is Z= ϕ( x), y = f(z)。
It is mentioned in this paper that the method of aligning the feature distribution will greatly weaken the discrimination of the target domain, and the classifier is not very effective for the target domain features with weak discrimination, because the classifier will deviate far from the Oracle target classifier trained by using the ground truth label.
The author also uses a picture to explain:
It doesn’t mean anything.
Then there is the specific method.
Target domain loss
First, the author defines the prediction probability p of the target domain, which is obtained by using f (the network predictions, after software operation). Then an auxiliary probability q is defined. The optimized loss is as follows:
The first is to make p align with Q, and the second is to make negative information entropy for the sum of K (number of categories). In fact, it is to avoid Q becoming the probability distribution of one hot.
Q is defined as:
This constraint is used by many articles on unsupervised clustering. In fact, it is a probability distribution that plays a role of scaling, making the original big Qi bigger and the small smaller. This can play the role of clustering.
Q is fixed, that is, it does not participate in back propagation, and then KL divergence degenerates to cross entropy.
The above constraints are for the output of F, and the author also restricts the intermediate feature Z, which is probably the joint training mentioned earlier.
Define a probability distribution of wearing a hat:
This is the distance between the capped P and the center of the category. It is a t-distribution, which is the probability of clustering.
There are also:
(4) UK is how to calculate, I estimate the use of clustering, the article did not say, can only look at the code. The article also mentions that UK is a trainable parameter. When optimizing (5), it is necessary to participate in back propagation. At this time, it can change with the change of feature space. But the author also mentioned that the UK needs to be reinitialized before each epoch. Clustering articles such as Idec are not used. In fact, I have done similar work before. The reason here is that the characteristics of Z space change too fast. If the UK is updated by loss, it is not accurate. Only those who have run the experiment know this.
Therefore, the loss of target domain samples can be integrated into:
Source domain loss
First, there is a supervised cross entropy
Then the same constraint is applied to Z:
At the same time, the author also gives these two loss weights. Generally speaking, this kind of weighting method is to increase the weight of the samples with strong migration, and it is also OK to use the domain discriminator. Here, the author recalculates a centroid C, which is calculated by using the target domain sample ZK, which makes sense.
A cosine similarity is used.
After my analysis, the framework should be very clear, but there is a question about what the final classifier is, whether to use f or u, or which is better to choose.
The author has done experiments in office-31, imageclef-da and office home, and the results are very good. The results will not be released, and then the analysis experiment is also a routine of eight part essay. There’s nothing to say. There’s a doubt about the establishment study
(w / O structural source regulation) I’m curious about how to do the first item. One is that there is no loss of source domain classification, and the effect is generally not direct. Moreover, the P of the article must be confusing, because f has not been trained.
After reading the article, let’s say a few more sentences at the end. I invested in cvpr2020 in the last two months, and I also used the clustering method at that time. When I was close to the investment, I found this article. How can I see that loss is a bit like that? The effect is similar to mine. At that time, the time is not very good. I guess it will be cannon fodder. Now the data sets of UDA are all rotten, and SOTA is very high. This time, I think of a new idea, and the effect is better than this one. I have done two more data sets, and the sense of innovation is also good. I hope it can impact iccv successfully.