IAM

JULY2019

READING

Kathrin Grosse, Thomas Alexander Trost, Marius Mosbach, Michael Backes, Dietrich Klakow. Adversarial Initialization - when your network performs the way I want. CoRR abs/1902.03020 (2019).

Grosse et al. propose an adversarial attack on a deep neural network’s weight initialization in order to damage accuracy or convergence. An attacker with access to the used deep learning library is assumed. The attack has no knowledge about the training data or the addressed task; however, the attacker has knowledge (through the library’s API) about the network architecture and its initialization. The goal of the attacker is to permutate the initialized weights, without being detected, in order to hinder training. In particular, as illustrated in Figure 1 for two fully connected layers described by

$y(x) = \text{ReLU}(B \text{ReLU}(Ax + a) + b)$,

the attack tries to force a large part of neurons to have zero activation from the very beginning. This attack assumes non-negative input, e.g., images in $[0,1]$ as well as ReLU activations in order to zero-out the selected neurons. In Figure 1, this is achieved by permutating the weights in order to concentrate its negative values in a specific part of the weight matrix. Consecutive application of both weight matrices results in most activations to be zero. This will hinder training significantly as no gradients are available, while keeping the statistics of the weights (e.g., mean and variance) unchanged. A similar strategy can be applied to consecutive convolutional layers, as discussed in detail in the paper. Additionally, by slightly shifting the weights in each weight matrix allows to control the rough number of neurons that receives zero activations; this is intended to have control over the “degree” of damage, i.e. whether the network should diverge or just achieve lower accuracy. In experiments, the authors show that the proposed attacks on weight initialization allow to force training to diverge or reach lower accuracy. However, in the majority of cases, training diverges, which makes the attack less stealthy, i.e., easier to detect by the attacked user.


Figure 1: Illustration of the idea of the proposd attacks on two fully connected layers as described in the text. The color coding illustrates large, usually positive, weight values in black and small, often negative, weight values in light gray.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.