[anatomy] cvpr2020 unsupervised domain adaptation via structurally regulated deep clustering
This article is a 2020cvpr article. The article uses the clustering method for domain adaptation. The effect on common data sets has been outstanding for 20 years. Recent work is similar to this idea. Let’s investigate it.
Transfer learning solves the problem of inconsistent distribution between source domain and target domain.The mainstream method aligns different domain features, which may destroy the internal structure of the target domain.In order to solve this problem, this paper applies clustering constraints to domain adaptation by assuming the structural similarity of the domain. Joint network training is carried out by minimizing the predicted probability distribution of the network and the KL divergence of the additional distribution.
At the beginning, I reiterated that the alignment feature method of general UDA will destroy the structure of target data discrimination.
The article then puts forward a hypothesis:
- Clusters exist in each domain.
- Clusters of the same kind in the two domains will be close.
The specific method is to realize the regularization of structure source domain by replacing one hot with additional distribution?Very awkward。 In addition, clustering constraints are imposed on the features in the middle of the network to enhance the discrimination of the target domain, and soft selection of non discrete source domain samples is used to enhance the structural constraints.
This paragraph is the same as the abstract. In fact, I still don’t know what method the article uses, but let’s continue to read it.
The general method is to align domain features with a distance, such as MMD or confrontation.
The authors mention that very little UDA work is clustering in the target domain. Clustering is only useful as an additional loss (incremental technique).
It is worth mentioning that the author uses the third section to discuss the motivation of the article. The strategies of transferring versus uncovering the intrinsic target discrimination.
The first is the definition of symbols. S is the source domain, t is the target domain, X is the sample, Z is the intermediate feature, y is the label, and K is the number of categories. ϕ Is the feature extractor and F is the classifier. The relationship is Z= ϕ (x), y = f(z)。
The article mentioned that the method of aligning feature distribution will greatly weaken the discrimination of target domain, and the classifier is not very effective for the target domain features with weak discrimination, because the classifier will deviate far from the Oracle target classifier trained with ground truth label (* * why is Oracle? Can anyone explain it).
The author also uses a diagram to explain:
It doesn’t mean anything.
Then there is the specific method.
Target domain loss
Firstly, the author defines the prediction probability p of the target domain, which is obtained by F (the network predictions, after softmax operation). Then an auxiliary probability q is defined. The optimized loss is:
The first term is to align P with Q, and the second term is to sum K (number of categories) and do negative information entropy. In fact, it is to avoid Q becoming the probability distribution of one hot.
Q is defined as:
This constraint is used by many articles on unsupervised clustering. In fact, it is a probability distribution that plays a scaling role, making the originally large Qi larger and the small Qi smaller. This can play the role of clustering.
Q is fixed, that is, it does not participate in back propagation. At this time, KL divergence degenerates into cross entropy.
The above constraints are for the output of F, and the author also restricts the intermediate feature Z, which is probably the joint training mentioned earlier.
Define a probability distribution of wearing a hat:
This is p with a hat. It is calculated from the distance from the category center. It is a t-distribution, which is the probability that clustering classification will be used.
There are also:
(4) How to calculate the UK? I estimate to use clustering. The article doesn’t say it clearly. I can only look at the code. It is also mentioned that UK is a trainable parameter, so it is necessary to participate in back propagation when optimizing (5). At this time, it can change with the change of feature space. However, the author also mentioned that the UK should be reinitialized before each epoch. Clustering articles such as Idec are not used. In fact, I have done similar work before. The reason here is that the characteristics of Z space change too fast. If the UK is updated by loss, it will be inaccurate. Only those who have run the experiment know this.
Therefore, the loss of target domain samples can be integrated into:
Source domain loss
The first is a supervised cross entropy. There is nothing to say:
Then the same constraint is applied to Z:
At the same time, the author also gives these two loss weights. Generally speaking, this weighting method is to increase the sample weight with strong mobility, and it is also OK to use domain discriminator. Here, the author recalculates a centroid C, which is calculated using the target domain sample ZK, which makes sense.
A cosine similarity is used.
After my analysis, the framework should be very clear, but there is a question about what the final classifier is, whether to use f or u, or which is better.
The author has done experiments in office-31, imageclef-da and office home, and the results are very good. The results won’t be released. Then the analysis experiment is also an eight part essay routine. There’s nothing to say. There’s a question about the ablation study
(w / O structural source regulation) I’m curious about how to do the first item. One is that there is no source domain classification loss, the effect is generally not direct, and the P of the article must be chaotic because f has not been trained.
After reading the article, say two more words at the end. I voted for cvpr2020 in the last two months. I also used the clustering method at that time. I found this article when I was close to the investment. How can I see that loss is a bit like, cool and cool, and the effect is similar to mine. At that time, it was not very good than the more urgent one. It is estimated that it should be cannon fodder. Now the data sets of UDA are rotten, and SOTA is a thief. This time I think of a new idea, and the effect is a little better than this one. There are two more data sets, and the sense of innovation is also good. I hope it can impact the success of iccv.