Title: neural module network
Source: CVPR 2016https://openaccess.thecvf.com/content_cvpr_2016/html/Andreas_Neural_Module_Networks_CVPR_2016_paper.html
1、 Questions raised
Enhanced network interpretability
Classification of VQA model:
(1) Monolithic network: traditional neural network, based on CNN, RNN, etc., designs a fixed network architecture to handle VQA tasks, such as cnn+lstm and a fully connected classifier;
(2) Neural modular network (NMN), this kind of method considers that the problem is a combination of a series of basic modules (such as find, relate, count, etc.), and the functions of these basic modules can be fitted with sub networks. To answer different questions, different module networks need to be selected to form a large network. Therefore, the network structure depends on the problem and is dynamic. Compared with the giant network, this dynamically combined network is more intuitive and interpretable, and the intermediate process is more transparent.
This paper presents the neural module networks (NMN) for the first time. It is not a whole like the traditional neural network model, but a combination of multiple modular networks. Customize a network model according to each question in the VQA dataset. That is to say, the network of NMN model is dynamically generated according to the language structure of question.
2、 Main ideas
2.1 main steps:
Step1: UsingSemantic parserAnalyze each question and obtain the module layout (including the basic calculation modules required to answer the question and the relationship between them) in combination with the analysis.
Setp2：Combine to generate modules for specific tasks and answer questions. The internal modules need to be designed manually, and the information transmitted between modules may be the original image features, attention or classification decisions. All modules in NMN are independent and composable, which makes the calculation different for each problem instance and may not be observed during training.
In this figure, a dog’s attention (attachment module) is first generated, and its output is passed to a position classifier (classify module).
Step3: the final answer uses aCircular network (LSTM)To read the problem input, and combine the output of NMN to obtain the classification results.
2.2 problem definition:
Triples (W, x, y)
w: Natural language problems
The model is completely determined by the set of modules m, and each module has relevant parameters\(\theta_m\)And a predictor P that maps the network layout from the string to the network. given\(\left(w,x\right)\)The model instantiates a network based on P (W), passes x (possibly w) as input, and obtains a distribution on the label (for VQA tasks, the output module is required to be a classifier). Therefore, the model eventually encodes a prediction distribution\(p(y|w,x;\theta)\) 。
2.3 specific implementation:
Part1: module definition
The module operates on three basic data types: image, non normalized attention and label.
TYPE [INSTANCE] (ARG1,…)
Type: Advanced module type (such as attention, classification, etc.).
Instance: a specific instance of the model considered – for example, append[red] locates the red thing and append[dog] locates the dog. Weights can be shared at both the type level and the instance level.
Module types include:
Attention module append[c]: convolute each position in the input image with the weight vector (each C is different) to generate a thermogram or non normalized attention.
Re attention module re attention[c]: a multilayer perceptron with corrected nonlinearity (relus) performs a full connection mapping from one attention to another. For example, re attach[above] should shift attention to the most active area, while re attach[no] should shift attention away from the active area.
Combination module combine[c]: combine two attentions into one. For example, combine[and] should only be activated in the area where both inputs are activated, while combine[except] should be activated in the area where the first input is activated and the second input is not activated.
Measurement module measure[c]: pay attention to it separately and map it to a distribution through labels. Since the attention transferred between modules is not normalized, measure is applicable to evaluate whether the detected object exists or to calculate the collection of objects.
Classification module classify[c]: pay attention to the input images and map them to the distribution on the label.
Part2: from string to module architecture
Analyze each problem with Stanford parser, extract the grammatical relationship between objects in the sentence, and generate an abstract sentence representation; In addition, basic semantics are implemented, such as changing kites to kite and were to be, which reduces the sparsity of module instances.
For example: what color is the trunk is converted to color (truck)
Based on a specific task, the symbolic representation is transformed into a modular network structure.
The leaf node corresponds to the attachment module (using attention)
Internal nodes correspond (according to their degrees) to re attach modules or combine modules
The root node corresponds to the measure module or the classify module that answers yes / no questions
The instantiation of each module is different:
For example, the attachment[cat] and attachment[trunk] parameters are different
Neural module network structure statistics:
NMN module visualization example:
NMN module generalization：
In addition to providing sentences, you can get the module network through parser and layout. You can also directly provide query statements similar to SQL to accurately specify requirements:
Prediction: including LSTM network and NMN module network
The LSTM network allows us to model potential syntax rules in data. Second, it allows us to capture semantic rules.
For example: what is flying and what are flying. Both is and are will be converted to be, so they will eventually be converted to: what (fly); But their answers should be kite and kites respectively.
Both the LSTM module and the NMN module output the distribution on the predicted answer set. The final prediction result of the model is the geometric mean of the two probability distributions.
Finally, LSTM module and NMN module are jointly trained.
Datasets: VQA datasets shape datasets
1. A neural module network NMN is proposed, which provides a general framework for learning neural module sets. These neural module sets can be dynamically combined into arbitrary depth networks.
2. NMN performs well in answering object or attribute questions.
3. The shape data set is proposed.