Darts is a classic NAS method, which breaks the previous discrete network search mode and can carry out endtoend network search. Because darts updates the network based on gradient, the direction of update is more accurate, and the search time is greatly improved compared with the previous method. Cifar10 only needs 4gpu days.
Source: Xiaofei’s algorithm Engineering Notes official account
Paper: darts: differential architecture search
 Address: https://arxiv.org/abs/1806.09055
 Paper code: https://github.com/quark0/darts
Introduction
At present, most of the popular neural network search methods are to select the discrete candidate network, while darts is to search the continuous search space, and use gradient descent to optimize the network structure according to the performance of the verification set
 Based on bilevel optimization, an innovative gradient based neural network search method darts is proposed, which is suitable for convolution structure and loop structure.
 The experimental results show that the gradient based structure search method has good competitiveness on cifar10 and PTB datasets.
 The search performance is very strong, only a small number of GPU days are needed, mainly due to the gradient based optimization mode.
 Through darts, the network learned from cifar10 and PTB can be transferred to Imagenet and wikitext2.
Differentiable Architecture Search
Search Space
The overall search framework of darts is the same as nasnet. It searches the cell as the network infrastructure, and then stacks it into convolutional network or circular network. The computing unit is a directed acyclic graph, which contains an ordered sequence of $n $nodes. Each node $x ^ {(I)} $represents the intermediate information of the network (such as the characteristic graph of convolution network), and the edge represents the operation of $o ^ {(I, j)} $on $x ^ {(I)} $. Each computing unit has two inputs and one output. For convolution unit, the input is the output of the first two layers of computing units. For cyclic network, the input is the input of the current step and the state of the previous step. The output of both of them is to merge all the outputs of the intermediate node. The calculation of each intermediate node is based on all the previous nodes
A special zero operation is included to specify that there is no connection between two nodes. Darts transforms the learning of computing units into the learning of edge operations. The overall search framework is the same as nasnet and other methods. This paper focuses on how darts performs gradient based search.
Continuous Relaxation and Optimization
Let $o $be the candidate operation set, and each operation represents the function $o (< cdot) $applied to $x ^ {(I)} $. In order to make the search space continuous, the original discrete operation selection is transformed into the softmax weighted output of all operations
The mixed weight of the operations between nodes $(I, J) $is expressed as the vector of dimension $ o  $$$, $alpha ^ {(I, J)} $, and the whole architecture search is simplified to learn the continuous value of $\ alpha = \ {alpha ^ {(I, J)} \} $, as shown in Figure 1. At the end of the search, each node selects the operation with the highest probability $o ^ {(I, J)} = argmax_{ o\in O}\alpha^{(i,j)}_ O $instead of $\ bar {o} ^ {(I, J)} $, constructs the final network.
after simplification, darts aims to learn the network structure and all operation weights at the same time. Compared with the previous method, darts can use gradient descent to optimize the structure according to the loss of verification set. Define $\ mathcal {l}_{ Train} $and $ mathcal {l}_{ Val} $is the loss of training and verification set. The loss is determined by the network structure and network weight. The ultimate goal of search is to find the optimal $\ alpha ^ {*} $to minimize the loss of verification set_{ Val} (W ^ {*}, alpha ^ {*}) $, where the network weight of $W ^ {*} $is by minimizing the training loss of $W ^ {*} = argmin_ w \mathcal{L}_{ Train} (W, alpha ^ {*}) $. This means that darts is a bilevel optimization problem, in which the verification set is used to optimize the network structure and the training set is used to optimize the network weight, with $\ alpha $as the superior variable and $W $as the subordinate variable
Approximate Architecture Gradient
The cost of calculating the network structure gradient by formula 3 is very large, which mainly lies in the inner layer optimization of formula 4, that is, every time the structure is modified, the optimal weight of the network needs to be retrained. In order to simplify this operation, a simple approximation is proposed
$W $denotes the current network weight, $\ Xi $is the learning rate of a single update of the inner layer optimization, and the overall idea isAfter the network structure is changed, the single training step is used to optimize $W $to approach $W ^ {(*)} (\ alpha) $instead of formula 3, which requires complete training until convergence. When the actual weight value of $W $is the local optimal solution of the inner optimization ($ nabla)_{ w}\mathcal{L}_{ Train} (W, alpha) = 0 $), formula 6 is equivalent to formula 5 $/ nabla_{\ alpha}\mathcal{L}_{ val}(w, \alpha)$。
Iterative process, such as algorithm 1, alternates the network structure and network weight, each update uses only a small amount of data. According to the chain rule, formula 6 can be expanded as follows:
$w^{‘}=w – \xi \nabla_ w \mathcal{L}_{ Train} (W, alpha) $, the second term of the above formula is very expensive. In this paper, the finite difference method is used to approximate the calculation, which is a key step in this paper$\ Epsilon $is a small scalar, and $W ^ {PM} = w / PM / epsilon / nabla_{ w^{‘}} \mathcal{L}_{ Val} (W ^ {‘}, alpha) $
Two forward and backward calculations are needed to calculate the final difference, and the computational complexity is reduced from $o ( – alpha  – w ) $to $o ( – alpha  +  – w ) $.

Firstorder Approximation
When $\ xi = 0 $, the second derivative of formula 7 disappears and the gradient changes from $\ nabla_{\ Alpha}, mathcal {l} (W, alpha) $decision, that is, the current weight is always optimal, and the loss of verification set is directly optimized by modifying the network structure$\ Xi = 0 $can speed up the search process, but it may also lead to poor performance. We call it the firstorder approximation when $/ xi = 0 $, and the secondorder approximation when $/ Xi > 0 $.
Deriving Discrete Architectures
When constructing the final network structure, each node selects the Topk non zero operations with the strongest response from different nodes, and the response strength is determined by $\ frac {exp (\ alpha ^ {(I, J))_ o})}{\sum_{ o^{‘}\in O}exp(\alpha^{(i,j)}_{ O ^ {‘})} $calculation. In order to make the network performance of search better, convolution unit is set $k = 2 $, loop unit is set $k = 1 $. Filtering zero operation mainly makes each node have enough input, so as to make a fair comparison with the current SOTA model.
Experiments and Results
The search time is timeconsuming, in which run is the best result of multiple searches.
Structure found.
Performance comparison on cifar10.
Performance comparison on PTB.
Performance comparison of migration to Imagenet.
Conclustion
Darts is a classic NAS method, which breaks the previous discrete network search mode and can carry out endtoend network search. Because darts updates the network based on gradient, the direction of update is more accurate, and the search time is greatly improved compared with the previous method. Cifar10 only needs 4gpu days.
If this article is helpful to you, please like it or read it
More content, please pay attention to WeChat official account.