Andrew Ilyas, Logan Engstrom, Anish Athalye, Jessy Lin. Black-box Adversarial Attacks with Limited Queries and Information. ICML, 2018.

Ilyas et al. propose three query-efficient black-box adversarial example attacks using distribution-based gradient estimation. In particular, their simplest attacks involves estimating the gradient locally using a search distribution:

$\nabla_x \mathbb{E}_{\pi(\theta|x)} [F(\theta)] = \mathbb{E}_{\pi(\theta|x)} [F(\theta) \nabla_x \log(\pi(\theta|x))]$

where $F(\cdot)$ is a loss function – e.g., using the cross-entropy loss which is maximized to obtain an adversarial example. The above equation, using a Gaussian noise search distribution leads to a simple approximator for the gradient:

$\nabla \mathbb{E}[F(\theta)] = \frac{1}{\sigma n} \sum_{i = 1}^n \delta_i F(\theta + \sigma \delta_i)$

where $\sigma$ is the search variance and $\delta_i$ are sampled from a unit Gaussian. This scheme can then be applied as part of the projected gradient descent white-box attacks to obtain adversarial examples.

The above attack assumes that the black-box network provides probability outputs in order to compute the loss $F$. In the remainder of the paper, the authors also generalize this approach to the label-only case, where the network only provides the top $k$ labels for each input. In experiments, the attacks is shown to be effective while rarely requiring more than $50$k queries on ImageNet.

Also find this summary on ShortScience.org.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: