Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Time:2021-6-11

Darts is a classic NAS method, which breaks the previous discrete network search mode and can carry out end-to-end network search. Because darts updates the network based on gradient, the direction of update is more accurate, and the search time is greatly improved compared with the previous method. Cifar-10 only needs 4gpu days.

Source: Xiaofei’s algorithm Engineering Notes official account

Paper: darts: differential architecture search

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

  • Address: https://arxiv.org/abs/1806.09055
  • Paper code: https://github.com/quark0/darts

Introduction


At present, most of the popular neural network search methods are to select the discrete candidate network, while darts is to search the continuous search space, and use gradient descent to optimize the network structure according to the performance of the verification set

  • Based on bilevel optimization, an innovative gradient based neural network search method darts is proposed, which is suitable for convolution structure and loop structure.
  • The experimental results show that the gradient based structure search method has good competitiveness on cifar-10 and PTB datasets.
  • The search performance is very strong, only a small number of GPU days are needed, mainly due to the gradient based optimization mode.
  • Through darts, the network learned from cifar-10 and PTB can be transferred to Imagenet and wikitext-2.

Differentiable Architecture Search


Search Space

The overall search framework of darts is the same as nasnet. It searches the cell as the network infrastructure, and then stacks it into convolutional network or circular network. The computing unit is a directed acyclic graph, which contains an ordered sequence of $n $nodes. Each node $x ^ {(I)} $represents the intermediate information of the network (such as the characteristic graph of convolution network), and the edge represents the operation of $o ^ {(I, j)} $on $x ^ {(I)} $. Each computing unit has two inputs and one output. For convolution unit, the input is the output of the first two layers of computing units. For cyclic network, the input is the input of the current step and the state of the previous step. The output of both of them is to merge all the outputs of the intermediate node. The calculation of each intermediate node is based on all the previous nodes

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

A special zero operation is included to specify that there is no connection between two nodes. Darts transforms the learning of computing units into the learning of edge operations. The overall search framework is the same as nasnet and other methods. This paper focuses on how darts performs gradient based search.

Continuous Relaxation and Optimization

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Let $o $be the candidate operation set, and each operation represents the function $o (< cdot) $applied to $x ^ {(I)} $. In order to make the search space continuous, the original discrete operation selection is transformed into the softmax weighted output of all operations

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

The mixed weight of the operations between nodes $(I, J) $is expressed as the vector of dimension $| o | $$$, $alpha ^ {(I, J)} $, and the whole architecture search is simplified to learn the continuous value of $\ alpha = \ {alpha ^ {(I, J)} \} $, as shown in Figure 1. At the end of the search, each node selects the operation with the highest probability $o ^ {(I, J)} = argmax_{ o\in O}\alpha^{(i,j)}_ O $instead of $\ bar {o} ^ {(I, J)} $, constructs the final network.
  after simplification, darts aims to learn the network structure and all operation weights at the same time. Compared with the previous method, darts can use gradient descent to optimize the structure according to the loss of verification set. Define $\ mathcal {l}_{ Train} $and $- mathcal {l}_{ Val} $is the loss of training and verification set. The loss is determined by the network structure and network weight. The ultimate goal of search is to find the optimal $\ alpha ^ {*} $to minimize the loss of verification set_{ Val} (W ^ {*}, alpha ^ {*}) $, where the network weight of $W ^ {*} $is by minimizing the training loss of $W ^ {*} = argmin_ w \mathcal{L}_{ Train} (W, alpha ^ {*}) $. This means that darts is a bilevel optimization problem, in which the verification set is used to optimize the network structure and the training set is used to optimize the network weight, with $\ alpha $as the superior variable and $W $as the subordinate variable

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Approximate Architecture Gradient

The cost of calculating the network structure gradient by formula 3 is very large, which mainly lies in the inner layer optimization of formula 4, that is, every time the structure is modified, the optimal weight of the network needs to be retrained. In order to simplify this operation, a simple approximation is proposed

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

  $W $denotes the current network weight, $\ Xi $is the learning rate of a single update of the inner layer optimization, and the overall idea isAfter the network structure is changed, the single training step is used to optimize $W $to approach $W ^ {(*)} (\ alpha) $instead of formula 3, which requires complete training until convergence. When the actual weight value of $W $is the local optimal solution of the inner optimization ($- nabla)_{ w}\mathcal{L}_{ Train} (W, alpha) = 0 $), formula 6 is equivalent to formula 5 $/ nabla_{\ alpha}\mathcal{L}_{ val}(w, \alpha)$。

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Iterative process, such as algorithm 1, alternates the network structure and network weight, each update uses only a small amount of data. According to the chain rule, formula 6 can be expanded as follows:

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

$w^{‘}=w – \xi \nabla_ w \mathcal{L}_{ Train} (W, alpha) $, the second term of the above formula is very expensive. In this paper, the finite difference method is used to approximate the calculation, which is a key step in this paper$\ Epsilon $is a small scalar, and $W ^ {PM} = w / PM / epsilon / nabla_{ w^{‘}} \mathcal{L}_{ Val} (W ^ {‘}, alpha) $

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Two forward and backward calculations are needed to calculate the final difference, and the computational complexity is reduced from $o (| – alpha | – w |) $to $o (| – alpha | + | – w |) $.

  • First-order Approximation

    When $\ xi = 0 $, the second derivative of formula 7 disappears and the gradient changes from $\ nabla_{\ Alpha}, mathcal {l} (W, alpha) $decision, that is, the current weight is always optimal, and the loss of verification set is directly optimized by modifying the network structure$\ Xi = 0 $can speed up the search process, but it may also lead to poor performance. We call it the first-order approximation when $/ xi = 0 $, and the second-order approximation when $/ Xi > 0 $.

Deriving Discrete Architectures

When constructing the final network structure, each node selects the Top-k non zero operations with the strongest response from different nodes, and the response strength is determined by $\ frac {exp (\ alpha ^ {(I, J))_ o})}{\sum_{ o^{‘}\in O}exp(\alpha^{(i,j)}_{ O ^ {‘})} $calculation. In order to make the network performance of search better, convolution unit is set $k = 2 $, loop unit is set $k = 1 $. Filtering zero operation mainly makes each node have enough input, so as to make a fair comparison with the current SOTA model.

Experiments and Results

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

The search time is time-consuming, in which run is the best result of multiple searches.

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Structure found.

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Performance comparison on cifar-10.

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Performance comparison on PTB.

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019

Performance comparison of migration to Imagenet.

Conclustion


Darts is a classic NAS method, which breaks the previous discrete network search mode and can carry out end-to-end network search. Because darts updates the network based on gradient, the direction of update is more accurate, and the search time is greatly improved compared with the previous method. Cifar-10 only needs 4gpu days.



If this article is helpful to you, please like it or read it
More content, please pay attention to WeChat official account.

Darts: classic network search method based on gradient descent, open end-to-end network search | ICLR 2019