These findings seem to be at odds. The former says that deep linear nets are use...

These findings seem to be at odds. The former says that deep linear nets are useful, non-linear and trainable with gradient descent. The latter says that the non-linearity only exists due to quirks in floating point and that evolutionary strategies must be use to find extremely small activations that can exploit the non-linearities in floating point.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

https://arxiv.org/abs/1312.6120

"We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions."

Nonlinear Computation in Deep Linear Networks

https://blog.openai.com/nonlinear-computation-in-linear-netw...

"Neural networks consist of stacks of a linear layer followed by a nonlinearity like tanh or rectified linear unit. Without the nonlinearity, consecutive linear layers would be in theory mathematically equivalent to a single linear layer. So it’s a surprise that floating point arithmetic is nonlinear enough to yield trainable deep networks."