# DAVIDSTUTZ

12thAUGUST2018

Aditi Raghunathan, Jacob Steinhardt, Percy Liang. Certified Defenses against Adversarial Examples. CoRR abs/1801.09344, 2018.

Raghunathan et al. provide an upper bound on the adversarial loss of two-layer networks and also derive a regularization method to minimize this upper bound. In particular, the authors consider the scoring functions $f^i(x) = V_i^T\sigma(Wx)$ with bounded derivative $\sigma'(z) \in [0,1]$ which holds for Sigmoid and ReLU activation functions. Still, the model is very constrained considering recent, well-performng deep (convolutional) neural networks. The upper bound is then derived by considering $f(A(x))$ where $A(x)$ is the optimal attacker $A(x) = \arg\max_{\tilde{x} \in B_\epsilon(x)} f(\tilde{x})$. For a linear model $f(x) = (W_1 – W_2)^Tx$, an upper bound can be derived as follows:

$f(\tilde{x}) = f(x) + (W_1 – W_2)^T(\tilde{x} – x) \leq f(x) + \epsilon\|W_1 – W_2\|_1$.

For two-layer networks a bound is derived by considering

$f(\tilde{x}) = f(x) + \int_0^1 \nabla f(t\tilde{x} + (1-t)x)^T (\tilde{x} – x) dt \leq f(x) + \max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$.

In this case, Raghunathan rewrite the second term, i.e. $\max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$ to derive an upper bound in the form of a semidefinite program, see the paper for details. For $v = V_1 – V_2$, this semidefinite program is based on the matrix

$M(v,W) = \left[\begin{array}0 & 0 & 1^T W^R \text{diag}(v)\\0 & 0 & W^T\text{diag}(v)\\ \text{diag}(v)^T W 1 & \text{diag}(v)^T W & 0\end{array}\right]$.

By deriving the dual objective, the upper bound can then be minimized by constraining the eigenvalues of $M(v, W)$ (specifically, the largest eigenvalue; note that the dual also involves dual variables – see the paper for details). Overall, the proposed regularize involves minimizing the largest eigenvalue of $M(v, W) – D$ where $D$ is a diagonal matrix based on the dual variables. In practice, this is implemented using SciPy's implementation of the Lanczos algorithm.

Also find this summary on ShortScience.org.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: