Angus Galloway, Thomas Tanay, Graham W. Taylor. Adversarial Training Versus Weight Decay. CoRR abs/1804.03308 (2018).

Galloway et al. provide a theoretical and experimental discussion of adversarial training and weight decay with respect to robustness as well as generalization. In the following I want to try and highlight the most important findings based on their discussion of linear logistic regression. Considering the softplus loss $\mathcal{L}(z) = \log(1 + e^{-z})$, the learning problem takes the form:

$\min_w \mathbb{E}_{x,y \sim p_{data}} [\mathcal{L}(y(w^Tx + b)]$

where $y \in \{-1,1\}$. This optimization problem is also illustrated in Figure 1 (top). Now considering $L_2$ weight decay can also be seen to be equivalent to scaling the softplus loss. In particular, Galloway et al. Argue that $w^Tx + b = \|w\|_2 d(x)$ where $d(x)$ is the (signed) Euclidean distance to the decision boundary. (This follows directly from the fact that $d(x) = \frac{w^Tx +b}{\|w\|w_2}$.) Then, the problem can be rewritten as

$\min_w \mathbb{E}_{x,y \sim p_{data}} [\mathcal{L}(yd(x) \|w\|_2)]$

This can be understood as a scaled version of the softplus loss; adding a $L_2$ weight decay term basically controls the level of scaling. This is illustrated in Figure 1 (middle) for different levels of scaling. Finally, adversarial training means training on the worst-case example for a given $\epsilon$. In practice, for the linear logistic regression model, this results in training on $x - \epsilon y \frac{w}{\|w\|_2}$ - which can easily be understood when considering that the attacker can cause the most disturbance when changing the samples in the direction of $-w$ for label $1$. Then,

$y (w^T(x - \epsilon y \frac{w}{\|w\|_2}) + b) = y(w^Tx + b) - \epsilon \|w\|_2 = \|w\|_2 (yd(x) - \epsilon)$,

which results in a shift of the data by $\epsilon$ - as illustrated in Figure 1 (bottom). Overall, show that weight decay acts as scaling the objective and adversarial training acts as shifting the data (or equivalently the objective).

In the non-linear case, decaying weights is argued to be equivalent to decaying the logits. Effectively, this results in a temperature parameter for the softmax function resulting in smoother probability distributions. Similarly, adversarial training (in a first-order approximation) can be understood as effectively reducing the probability attributed to the correct class. Here, again, weight decay results in a scaling effect and adversarial training in a shifting effect. In conclusion, adversarial training is argued to be only effective with small perturbation sizes (i.e., if the shift is not too large), weil weight decay is also beneficial for generalization. However, from reading the paper, it is unclear what the actual recommendation on both methods is.

In the experimental section, the authors focus on two models, a wide residual network and a very constrained 4-layer convolutional neural network. Here, their discussion shifts slightly to the complexity of the employed model. While not stated very explicitly, one of the take-aways is that the simpler model might be more robust, especially for fooling images.

Figure 1: Illustration of the linear logistic regression argument. Top: illustration of linear logistic regression where $\xi$ is the loss $\mathcal{L}$, middle: illustration of the impact of weight decay/scaling, bottom: illustration of the impact of shift for adversarial training.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.