Ayinde et al. study the impact of network architecture and weight initialization on learning redundant features. To empirically estimate the number of redundant features, the authors use an agglomerative clustering approach to cluster features based on their cosine similarity. Essentially, given a set of features, these are merged as long as their (average) cosine similarity is within some threshold $\tau$. Then, this number is compared across network architectures. Figure 1, for example, shows the number of redundant features for different depths of the network and using different activation functions on MNIST. As can be seen, ReLU activations avoid redundant features, while depth of the network usually encourages redundant features.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: