Collection of most common Neural Network mistakes

1) you didn't try to overfit a single batch first. (i.e. if you can't overfit a small amount of data you've got a simple bug somewhere)
2) you forgot to toggle train/eval mode for the net.
3) you forgot to .zero_grad() (in pytorch) before .backward().
4) you passed softmaxed outputs to a loss that expects raw logits. ; others?

:)
5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters.
6) thinking view() and permute() are the same thing (& incorrectly using view)
7) I like to start with the simplest possible sanity checks - e.g. also training on all zero data first to see what loss I get with the base output distribution, then gradually include more inputs and scale up the net, making sure I beat the previous thing each time. (starting with small model + small amount of data & growing both together; I always find it really insightful)
(I turn my data back on and get the same loss

:) also if doing this produces a nice/decaying loss curve, this usually indicates not very clever initialization. I sometimes like to tweak the final layer biases to be close to base distribution)
8) Choose number of filters

There is currently not any real viable theory that explains why deep nets generalize well. We do know their generalization ability is not based on limiting the number of parameters: [1611.03530] Understanding deep learning requires rethinking generalization

It’s common to use more parameters than data points, even on very small datasets. For example, MNIST has 60,000 training examples but most MNIST classifier models have millions of parameters—-a fully connected hidden layer with 1,000 inputs and 1,000 outputs has one million weights, just in that layer.

I usually choose the network architecture by trial and error, making the model deeper and wider periodically. Usually, the problem is not that it overfits if it gets too big. Usually, the problem is that investing more computation does not give much of a test set improvement.
9) Batch size influence
It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize
10) the conv layer before batchnorm layer may set bias_term to False
11) Which colorspaces is better for CNN:
It seems that these results point to some variant of RGB being the best.

I've done similar experiments on CIFAR (although only between RGB, HSV, LAB, and YUV), and my results have also tended towards RGB working better than other colorspaces.
https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Colorspace.md