Absrtact: by using the hierarchical relationship constraints between object categories, automatic learning can extract rules to identify different categories from the data. On the one hand, it can explain the prediction process of the model, on the other hand, it also provides a feasible way to introduce artificial prior knowledge.
Benefiting from the breakthrough of deep learning technology, the accuracy of traditional computer vision tasks such as image classification and object detection has also been greatly improved. However, due to the complexity of deep learning model, the current theory of deep learning is not perfect, which leads to two problemsfirstThe working mechanism of the model is not transparent to users, and people can not explain the reason why the model recognition is correct or wrong. Therefore, it is impossible to theoretically prove whether the model can achieve good results in practical application, thus hindering the application of the model in some life-threatening fields (such as medical image analysis, automatic driving, etc.) to a certain extent;secondIt is difficult to integrate people’s long-term experience and knowledge into the model, so it is difficult to impose effective constraints on the model learning process, and the accuracy of the model is far lower than that of human under the real conditions of small training samples, zero training samples and so on.
IEEE Transactions on pattern analysis and machine intelligence (IEEE tpami, impact factor 17.861), a top academic journal in the field of artificial intelligence, recently received the paper “what is a tabby? Interpretable model decisions by learning attribute based classification” In “criteria”, Huawei cloud, together with the Institute of computing, Chinese Academy of Sciences, proposes an exploratory solution to the above two problems. By using the hierarchical relationship constraints between object categories, it automatically learns to extract rules to identify different categories from the data. On the one hand, it explains the prediction process of the model, and on the other hand, it provides a feasible way to introduce artificial prior knowledge The way to go.
First, let’s take a look at how taxonomists classify animals (from Wikipedia) with a set of simple examples:
(1) “Tabby cat” is a kind of “domestic cat” with stripes, spots, lines and spiral patterns on its body surface.
(2) “Domestic cat” is a small, usually skin coated, carnivorous and domesticated “feline”;
(3) Cats are carnivores with flexible claws, slim but muscular bodies and flexible forelimbs.
Figure 1. Schematic diagram of category hierarchy
It can be seen from the above examples that taxonomists adopt a hierarchical approach when classifying animals. In the hierarchy, each category is represented as “parent class + some specific attributes”, such as stripes, spots, lines and spirals, which are the more attributes of “tabby cat” compared with its parent “domestic cat”.
In fact, if you do some compression on the hierarchy, each category can be completely represented by a specific set of attributes. Taking “tabby cat” as an example, after one-level compression: “tabby cat” is a small, carnivorous, domesticated “cat” with stripes, spots, lines and spiral pattern fur on its body surface. As you can see, after one level of compression, “tabby cat” can be represented by “parent class of parent class + more attributes”. Furthermore, after two-stage compression, “tabby cat” is a small, carnivorous, domesticated “carnivore” with flexible claws, slim but muscular body, flexible forelimbs and fur with stripes, spots, lines and spiral patterns. As you can see, after two levels of compression, “tabby cat” can be represented by “parent class of parent class + more attributes”.
By analogy, if the compression process is carried on all the time, “tabby cat” can be expressed in the way of “all the attributes of animal + tabby cat”. For other animals, it is also similar, each animal can be expressed as “animal + all the attributes of this animal”. Because the representation of each animal contains the common component of “animal”, the representation of each animal can be simplified to “all the attributes of this animal”. Similarly, all objects such as “plant”, “artifact” and so on can be completely represented by a set of attributes. Therefore, as long as the attribute definition is good enough, all the possible categories can be distinguished exactly by attributes, and the interpretability of this classification method is very good, and new artificial prior knowledge can be easily introduced.
But in practice, due to the huge number of categories and the difficulty of defining massive attributes, it is impossible to define the attributes of each category manually. So what is the way to achieve similar classification without additional annotation?
In fact, the above reasoning process provides us with two important insights: first, when the attributes are enough and good enough, attributes can be used to accurately distinguish different categories; second, each category must have more attributes than its parent. In view of the requirements of the first insight on the quantity and quality of attributes, recent studies [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 3] The results show that the deep learning model trained by image classification task can spontaneously learn some semantic attributes, so in this way, it is no longer necessary to define attributes manually, and only get enough and good attributes by algorithm automatic learning; according to the requirements of the constraint relationship between categories in the second insight, such categories can be classified The relationship between attributes is formalized to guide the process of learning attributes and make the learned attributes meet the constraints. In this way, it not only solves the problem that attributes are difficult to define and label, but also retains the advantages of attribute based classification scheme in high interpretability and easy to introduce artificial prior knowledge.
Figure 2. Schematic diagram of method framework
Specifically, the author designs a model with two branches in the proposed method, as shown in Figure 2. The upper branch uses image as input and its main function is to learn attributes; the lower branch uses hierarchical structure as input and its main function is to impose constraints on the process of learning attributes
The output of the upper branch is a 1 × D dimension “attribute vector”. Each dimension of the vector represents an attribute, and the value of each dimension indicates whether the image sample has this attribute (0 indicates that the sample does not have this attribute, and a value greater than 0 indicates that the sample has this attribute). At the same time, when the activation value is greater than 0, the, The size of the activation value represents the intensity of the image sample on this attribute;
When training, the goal of loss function is to require the outputs of both branches to correctly predict the most fine-grained category and the corresponding coarse-grained category corresponding to d-Dimension features. In this way, the upper branch can learn D useful attributes for classification tasks, while the lower branch can ensure that the D attributes meet the constraints of the number of attributes between categories, so as to give a human understandable explanation of the principle of model classification.
In this paper, the author has carried out experiments on cifar-100 and ilsvrc
1. Classification accuracy
From the experimental results, although the method proposed in this paper has done a lot of design to improve the interpretability of the model and the convenience of introducing artificial prior knowledge, the classification accuracy still reaches the level of SOTA, which shows that the scheme has practical value in the actual business.
2. The effect of attribute learning
In the aspect of qualitative display results, the author shows the attributes learned by the model in a visual way. In the experimental results, for each attribute, the attributes are represented by displaying the nine image blocks with the largest response value of each attribute in the data set, as shown in Figure 3. From the graph, we can see that the model has learned a lot of non repetitive and meaningful attributes, including simple texture and shape (dotted, round, etc.) attributes, as well as more semantic attributes such as wheels and mountains.
Figure 3. Display of attributes learned by the algorithm. (a) Attributes learned from cifar-100 database; (b) attributes learned from ilsvrc database
From the results of quantitative evaluation, on the ilsvrc data with 1000 categories, the model has learned more than 2600 attributes, far exceeding the 2000 attributes of the baseline model (standard resnet-50 classification model); after removing the repetitive attributes (which may include different situations of the same attributes), the number of attributes learned by the method is close to 140, which is more than 120 non repetitive attributes of the baseline model Property.
Figure 4. Quantitative evaluation results of the number of attributes learned by the model
The visualization results of attribute response area (Figure 5) also show that the attributes learned by the model are basically reliable. The most responsive area (the red part) in the graph is also the area corresponding to the attribute.
Figure 5. Visualization of attribute response area
- Rule learning results and artificial prior introduction
In the experiment, the author shows the classification rules learned by the branches under the model, and represents each category in the form of “parent class + specific attribute combination”, as shown in Figure 6. The results of model learning include:
(1) “Clock” is a kind of round and radial “household electronic equipment”;
(2) Cheetah is a kind of cat with stripes and spots;
(3) “Football” is a kind of “ball” with black spots on a white background.
The explanation rules given by the model are basically in line with human cognition, which indicates that the model can learn classification rules similar to the form of “parent class + specific attribute combination” defined by taxonomists, and can give human understandable explanation to the classification principle of the model.
Figure 6. The interpretation rules learned from the model are displayed. (a) The interpretation rules learned from cifar-100 database; (b) the interpretation rules learned from ilsvrc database
In contrast, if the existing method  wants to give the same form of interpretation results, it needs to manually label the attribute representation of each category, which is obviously unrealistic in large-scale scenes. The author also shows the corresponding comparison results in the experiment (Table 1). From the comparison results, the application scope of the proposed method is obviously wider.
Table 1. Comparison with existing methods 
With the above interpretation rules that people can understand, we can customize the model, remove the rules that the model should not use, and supplement the rules that the model has not learned
This paper attempts to remove the wrong rules learned by the model on the two categories of “ambulance” and “cheetah” in ilsvrc data. The scheme can improve the recognition accuracy of the model on the two categories of “ambulance” and “cheetah” without affecting the recognition effect of other categories;
In all categories of the same database, the author tries to add additional attributes, and the accuracy is improved by about 2 percentage points.
The above two experiments show that although the method proposed by the author only makes some preliminary exploration in the aspect of introducing artificial prior into depth model, it has verified the effectiveness of the combination of depth model and artificial prior knowledge, and gives a basic feasible technical route.
Interpretable deep learning model and the combination of deep learning model and artificial priori are the frontier direction of current academic research, which is of great significance to improve the reliability and generalization ability of deep learning model. This paper also takes a solid step in these two directions: in the interpretable deep learning model, compared with the existing methods, it can not only give the key areas in the image, but also give the regular interpretation, which is more user-friendly and more in line with people’s expectations for the interpretation results; in the introduction of artificial prior knowledge, it goes through a basically feasible path We hope to inspire future researchers. Shanghua is the cloud AI gallery. Developers can learn more about Huawei’s cloud algorithm capabilities and use Huawei’s cloud modelarts platform for training and reasoning.
 C. Huang, C. C. Loy, and X. Tang, “Unsupervised learning of discriminative attributes and visual representations,” in Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5175–5184.
 V. Escorcia, J. C. Niebles, and B. Ghanem, “On the relationship between visual attributes and convolutional networks,” in Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1256–1264.
 S. Vittayakorn, T. Umeda, K. Murasaki, K. Sudo, T. Okatani, and K. Yamaguchi, “Automatic attribute discovery with neural activations,” in European Conference on Computer Vision (ECCV), 2016, pp. 252–268.
 S. J. Hwang and L. Sigal, “A unified semantic embedding: Relating taxonomies and attributes,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 271–279.
Click to learn more
This article is shared from Huawei cloud community “interpretation of IEEE tpami papers on Huawei cloud: Regularized interpretable model helps knowledge + AI integration”, original author: hwcloudai.
Click follow to learn about Huawei’s new cloud technology for the first time~