Paper reading: composite attention networks for machine reasoning


Title: synthetic attention networks for machine reasoning
Source: ICLR 2018
Author’s notes:

1、 Questions raised

Although the current deep neural network models are very effective in learning the “direct mapping between input and output”, their depth, scale and statistical characteristics enable them to deal with noisy and diverse data, but also limit their interpretability and fail to show a coherent and transparent “thinking process” to get their predictions.

The deep learning system lacks reasoning ability. For example, in the following example, the problem needs to be solved step by step – traversing from an object to related objects and iteratively moving towards the final solution.


It is very important to build a coherent multi-step reasoning model to complete the understanding task. The author mentioned some methods proposed by predecessors to combine symbol structure and neural modules, such as neural module network, which has some problems, must rely on the structural representation and functional programs provided by the outside, and needs a relatively complex multi-stage reinforcement learning and training scheme. The rigidity of these model structures and the use of specific operation modules weaken their robustness and generalization ability.

In order to balance the universality and robustness of the end-to-end neural network method with the need to support more explicit and structured reasoning, the author proposes MAC network, which is an end-to-end differentiable reasoning architecture used to perform specific reasoning tasks in sequence.

2、 Main ideas

Given a knowledge base K (an image for VQA) and a task description Q (a problem for VQA), the MAC network decomposes the problem into a series of reasoning steps, and each reasoning step uses a MAC unit.

Its composition mainly includes three parts:

  • input unit
  • Stacked MAC units (perform reasoning tasks)
  • output unit


Part1: input unit

Handle the entered pictures and questions:

Picture: use the pre trained RESNET to extract features, obtain the middle layer conv4 features, and suffix CNN to get the feature representation of each small block of the picture, and finally combine to get the knowledge base:

\[K^{H\times W\times d}=\{k_{h,w}^d|_{h,w=1,1}^{H,W}\},H=W=14

Text: convert the string into a word embedded sequence, and extract features through the d-dimensional Bi LSTM network:

A series of hidden states:\(cw_1,…,cw_s\)

Problem feature representation: splicing of the last hidden state\(\overleftarrow{cw_1},\overrightarrow{cw_s}\), linear transformation is required when inputting MAC unit:\(q_i={W_i}^{d\times2d}q+b_i^d\)

Part2:mac unit

The MAC unit (memory, attention, composition) is a cyclic unit designed to be similar to Gru or LSTM.

Design concept

The internal design of MAC network draws on the knowledge of computer architecture, separates control and memory, and operates through serial execution of a series of instructions:

Step1: the controller obtains instructions and decodes them;

Step2: read information from the memory according to the instruction;

Step3: execute instructions, selectively write to the corresponding memory, and consider the processed information for the next cycle.


Based on this, the MAC unit explicitly separates the memory from the control, and maintains a double hidden state internally: the dimension is\(d\)Control status of\(c_i\)And memory status\(m_i\)It is composed of three operation units working in series to perform a reasoning step:

  • Control unit: selectively process some parts of the problem word sequence at each step to calculate the inference operation (this step calculates the attention to obtain the probability distribution on the word sequence, indicating the degree of attention to each word in this step), and update the control state to indicate the inference operation to be performed by the unit.
  • Read unit: under the guidance of the control state, extract relevant information from the knowledge base (selectively pay attention to some areas in the picture, and also use the attention distribution to represent the extracted information)
  • Write unit: integrates the extracted new information with the memory state of the previous step, stores the intermediate result and updates the memory state – the state is the result of the current reasoning.


initialization: initialize learning parameters\(c_0\)and\(m_0\)

control unit


Input: question word sequence\(cw_1,…,cw_s\), problem characteristics\(q_i\), control status of the previous step\(c_{i-1}\)

Step1: feature representation of splicing problems\(q_i\)And previous control status\(c_{i-1}\)And carry out linear transformation (acquire relevant knowledge);

Step2: generate reasoning operation based on attention\(c_i\): calculate first\(cq_i\)And the similarity of each problem word feature. Then, the attention distribution on the problem word sequence is obtained by linear transformation and softmax function. Finally, the words are weighted and summed based on the distribution to generate a new reasoning operation\(c_i\)

Add: this attention can be used to visualize and explain the content of the control state to improve the transparency of the model.

Read unit


Importing: knowledge base\(k_{h,w}\), memory status of the previous step\(m_{i-1}\), control status of the current step\(m_i\)

Step1: extract the intermediate information obtained by the model from the previous reasoning step by linearly transforming the knowledge base elements and the memory state of the previous step and multiplying the corresponding positions\(I_{i,h,w}\)

Step2: splice knowledge base elements and intermediate results. Considering that some reasoning processes need to combine independent facts to get the answer, this step will allow model reasoning to consider new information that is not directly related to the previous intermediate results.

Step3: calculate control status\(c_i\)And intermediate information\(I_{i,h,w}^`\)And generate the attention distribution on the knowledge base elements through softmax, and finally get the retrieval information of the reading unit by weighted summation\(r_i\)

Visual attention:


Example: for the question “what color is the mate thing to the right of the sphere in front of the tiny blue block”, first find the blue block and update it\(m_1\)After that, the control unit pays attention to “the sphere in front of”, finds the front sphere and updates it\(m_2\)Finally, pay attention to “the mate thing to the right of”, and find the result of the problem: Purple cylinder.

Write unit:


Input: memory status of the previous step\(m_{i-1}\), read the retrieval information of the unit\(r_i\), control status of the current step\(m_i\)

It is mainly used to integrate the memory state obtained by previous reasoning and the retrieval information obtained in this step based on reasoning instructions

Step1: splicing\(r_i\)and\(m_{i-1}\)And perform linear transformation to obtain the updated memory state\(m_i^{info}\)

Optional actions:

Step2: attention mechanism: to support non sequential reasoning, the unit is allowed to update by integrating all previous memory states. Calculate current instruction\(c_i\)And previous instruction sequences\(c_1,…,c_{i-1}\)And generate attention distribution\(sa_{ij}\). Using this probability distribution, the pre order memory states are weighted and summed, and combined\(m_i^{info}\)Get updated memory status\(m_i^`\)

Step3: memory gating: allows the model to dynamically adjust the length of the reasoning process according to a given problem. Instruction based optional update memory state\(m_i\)

Part3: output unit

Problem based feature representation\(q\)And final storage status\(m_p\), the final answer prediction is obtained by using a two-layer fully connected softmax classifier.


3、 Experiment

Experiment: clevr dataset







4、 Summary

1. Keeping the strict separation between the representation space of the problem and the image (they can only interact through the interpretable discrete distribution) greatly enhances the generalization of the network and improves its transparency.

2. Unlike the module network, MAC is an end-to-end fully differentiable network, which does not need additional supervision data. It can only be done by reasoning through the stacking sequence of MAC units, and does not need to rely on syntax trees or other design and deployment module sets. In addition, compared with the deep neural network method, MAC has better generalization performance, higher computing efficiency and more transparent relationship reasoning ability.