So interestingly, SGD has a nice intuitive explanation for why it is better than GD.
If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.
It's good to see people write up their experiments, it's useful for the rest of us to test how we understand neural nets.
I think there are a few mistake in your maths though. You can learn a 1-1 discrete mapping through a single node where you are using a one-hot vector. You just assign a weight to each of the input nodes, and then use a delta function on the other side. If I understood correctly, this is what you are doing.
Also, if you use a tanh in your input layer, but keep a linear output layer (as you start off with), you are still doing a linear approximation because you have a rank H (where H is the hidden layer) matrix that is trying to linearly approximate your input data. This is done optimally using PCA.
I'd second the advice to look into the coursera courses, or the nando de freitas oxford course on youtube (that actually has a really nice derivation of backprop).
If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.