Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: