If you are averaging weights often enough, then it's basically the same as avera...

If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.