APRIL2020

Peter L. Bartlett. *For Valid Generalization the Size of the Weights is More Important than the Size of the Network*. NIPS 1996: 134-140.

Also find this summary on ShortScience.org.

What is **your opinion** on this article? **Let me know** your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

Barlett shows that lower generalization bounds for multi-layer perceptrons with limited sizes of the weights can be found using the so-called fat-shattering dimension. Similar to the classical VC dimensions, the fat shattering dimensions quantifies the expressiveness of hypothesis classes in machine learning. Specifically, considering a sequence of points $x_1, \ldots, x_d$, a hypothesis class $H$ is said to shatter this sequence if, for any label assignment $b_1, \ldots, b_d \in \{-1,1\}$, a function $h \in H$ exists that correctly classifies the sequence, i.e. $\text{sign}(h(x_i)) = b_i$. The VC dimension is the largest $d$ for which this is possible. The VC dimension has been studied for a wide range of machine learning models (i.e., hypothesis classes). Thus, it is well known that multi-layer perceptrons with at least two layers have infinite VC dimension – which seems natural as two-layer perceptrons are universal approximators. As a result, most bounds on the generalization performance of multi-layer networks (and, thus, also of more general deep networks) do not apply as the VC dimension is infinite.

The fat-shattering dimension, in contrast, does not strictly require the sequence $x_1,\ldots, x_d$ to be correctly classified into the labels $b_1,\ldots, b_d$. Instead, the sequence is said to be $\gamma$-shattered if real values $r_1,\ldots,r_d$ exist such that for every labeling, $b_1,\ldots,b_d$, some some $h \in H$ satisfies $(h(x_i) – r_i)b_i \geq \gamma$. Note that the values $r_i$ are fixed across labelings, i.e., are chosen “before” knowing the labels. The fat-shattering dimension is the largest $d$ for which this is possible. As a result, the fat-shattering dimension relaxes the VC dimension in that the models in $H$ are allowed some “slack” (in lack of a better word). Note that $H$ contains real-valued functions.

Based on this definition, Barlett shows that multi-layer perceptrons in which all layers have weights $w$ constrained as $\|w\|_1 \leq A$ scales with $A^{l(l + 1)}$. More importantly, however, the fat-shattering dimension is finite. Thus, generalization bounds based on the fat-shattering dimensions apply and are discussed by Barlett; I refer to the paper for details on the bound.