Xavier Glorot, Antoine Bordes, Yoshua Bengio. Deep Sparse Rectifier Neural Networks. AISTATS 2011: 315-323

Glorot et al. propose the use of rectified linear units for feed-forward networks. They present a motivation based on the following two facts relating models used in deep learning and neuroscience:

  • Studies show that neurons encode information in a highly distributed and sparse way.
  • A frequently used model for neurons looks as follows:

    $f(I) = \begin{cases}\left[\tau \log\left(\frac{E + RI - V_r}{E + RI - V_{th}}\right) + t_{ref}\right]^{-1} & \text{ if }E + RI > V_{th}\\0 & \text{ else}\end{cases}$

    where the individual parameters are unimportant in the context of artificial neural networks as long as the structure is understood.

Sparsity is further motivating by the following desirable properties:

  • Information disentangling (i.e. disentangling of "neighboring" features).
  • Variable-size representations as needed.
  • High likelihood for linear separability.

The rectified linear unit, i.e. $\max⁡(0, x)$ is introduced by discussing its advantages and potential problems. Advantages, briefly summarized, are sparsity, cheaper computation compared to the hyperbolic tangent and sigmoid, and better gradient flow due to the linear, non-saturating part. A potential problem is blocking gradient flow by the hard threshold at $0$. To alleviate this, they propose the softplus function:

$\text{softplus}(x) = \log(1 + e^x)$

However, experiments show, that this does not hinder optimization.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.