Kamil Nar, Orhan Ocal, S. Shankar Sastry, Kannan Ramchandran. Cross-Entropy Loss Leads To Poor Margins. https://openreview.net/forum?id=ByfbnsA9Km.

Nar et al. show that minimizing the cross entropy loss does not result in the optimal margin solution for linearly separable datasets. In fact, the margin obtained can be arbitrarily bad, assuming that the training point slie on an affine subspace of the input space. If the dimensionality of this subspace becomes small relative to the dimensionality of the input space, poor margins become more likely. As solution, the authors propose differential training, i.e. minimizing

$\sum_i \sum_j \log(1 + e^{-w^T(x_i – y_j)})$

where $x_i$ are positive examples and $y_j$ are negative examples. Unfortunately, the authors do not provide a multi-class formulation of this loss.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: