26^{th}MARCH2017

L.-C. Chen, A. G. Schwing, A. L. Yuille, R. Urtasun. *Learning Deep Structured Models*. ICML, 2015.

What is **your opinion** on this article? **Let me know** your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

Chen and Schwing propose an algorithm to jointly learn deep features and Markov Random Field (MRF) parameters. The underlying motivation is that deep features are learned while considering the dependencies between the predicted variables. This is in contrast with earlier work (they explicitly name [1] and [2]) where deep features are learned separately in a first step, i.e. without considering the relationships between the variables.

They consider a scoring function $F(x, y;w)$ with $x$ being the input and $y \in Y=Y_1 \times \ldots \times Y_N$ the predicted output. Inference is given by the variables corresponding to the highest score:

$y^\ast = \arg \max_y F(x, y;w).$

Modeling only unary potentials using a deep network, this corresponds to a forward pass to evaluate the network. For general graphical models, in contrast, this problem is NP-hard due to the size of the space $Y$. Using the probability

$p_{(x,y)}(\hat{y}|w, \epsilon) = \frac{1}{Z_\epsilon (x, w)} \exp(F(x,\hat{y}; w))^{\frac{1}{\epsilon}}$(1).

For a configuration $\hat{y}$ with $Z_\epsilon$ being the partition function. Deriving the gradient of the negative log-likelihood given a training set $\mathcal{D}=\{(x,y)\}$ yields

$\sum_{(x,y) \in \mathcal{D}} \sum_{\hat{y} \in Y} \frac{\partial}{\partial w} F(x, \hat{y};w) (p_{(x,y)}(\hat{y} | w, \epsilon) - p_{(x,y), tg}(\hat{y})$

where the target distribution $p_{(x,y), tg}(y)$ is assumed to be the dirac delta. Chen and Schwing emphasize that other target distributions are also possible. Unfortunately, the derivation is not detailed. The problem with the above gradient is the computational complexity due to the size of the target space $Y$. In the course of the paper, Chen and Schwing use several approximations in order to make the learning task feasible.

First, they assume the scoring function $F(x,y;w)$ to decompose over regions $\mathcal{R}$ of variables:

$F(x,y;w) = \sum_{r \in \mathcal{R}} f_r(x,y_r;w)$

In order to derive the following approximation of the learning task (remember that the learning task involves minimizing the negative log likelihood of Equation (1)):

$\min_w \sum_{(x,y) \in \mathcal{D}} \left( \max_{b_{(x,y)} \in \mathcal{C}_{(x,y)}} \left[ \sum_{r, \hat{y}_r} b_{(x,y),r}(\hat{y}_r) f_r(x,\hat{y}_r;w) + \sum_r \epsilon c_r H(b_{(x,y),r})\right] - F(x,y;w)\right)$(2)

Here, the local beliefs $b_{(x,y),r}$ are used to approximate the true marginals $p_{(x,y),r}$. This approximation is the result of the following considerations. As stated by Wainwright and Jordan [3], the identity

$\epsilon \log Z_\epsilon (x,w) = \max_{p_{(x,y)}(\hat{y})\in \Delta} E[F(x,\hat{y};w)] + \epsilon H(p_{(x,y)})$

is used in order to simplify the computationally infeasible dependency on the partition function. In the discussed case, due to the decomposition in Equation (2), this amounts to

$\max_{p_{(x,y)}(\hat{y} \in \Delta} \sum_{r, \hat{y}_r} p_{(x,y),r}(\hat{y}_r) f_r(x,\hat{y}_r; w) + \epsilon H(p_{(x,y)})$

with $H$ denoting the entropy and $p_{(x,y),r}$ denoting the marginal corresponding to region $r$. There are obviously two problems with this formulation. First, the involved marginalization constraints are infeasible in size and, second, the entropy cannot be computed exactly for most graphical models. The former problem is tackled by using local beliefs $b_{(x,y),r}$ instead of the marginals $p_{(x,y),r}$. For the latter problem, Chen and Schwing propose to use the fractional entropy (as described in [4]):

$H(p_{(x,y)}) \approx \sum_{r} c_r H(b_{(x,y),r})$

Here, $c_r$ are counting numbers weighting the different marginal entropies. Putting both together, results in the form shown in Equation (2). However, due to the min-max form, this problem is still challenging.

Chen and Schwing follow a derivation from [5] to re-formulate this problem into a pure minimization problem by utilizing the dual problem of the inner maximization. As no derivation is given, and the corresponding section in [5] is hard to identify, details are omitted and can be found in the paper. The final algorithm, also with unclear (if not omitted) derivation, involves a forward pass (inference) to compute the scores $f_r (x,y_r;w)$, several iterations of belief propagation to obtain the local beliefs $b_{(x,y),r}$ and a backward pass in order to compute the gradient. For belief propagation, derivations can partly be found in [6].

For the presented experiments, including results on the Flickr dataset, Chen and Schwing discuss concrete models. For example, they use the architecture of [7] to model unary potentials of the $38$ tags to predict. Furthermore, pairwise potentials of the form

$f_r (x, y_i, y_j; w_p) = \sum_{mn} W_{mn} \delta(y_i = m, y_j = n)$

are used. Unfortunately, Chen and Schwing do not detail the inference procedure, which remains unclear after reading the paper.

Decision tree fields. ICCV, 2011.Tell me what you see and I will show you where it is. CVPR, 2014.Graphical Models, Exponential Families and Variational Inference. Foundations and Trends in Machine Learning, 2008.Fractional belief propagation. NIPS, 2003.Learning Structured Prediction Models: A Large Margin Approach. ICML, 2005.Primal-Dual Message Passing Algorithm for Approximated Large Scale Structured Prediction. NIPS, 2010.