For linear models, when the eigenvalues are very different, for example, LR, I have two characteristics, one is (0,1) and the other is (010000). When using gradient descent, the loss contour is elliptical, which requires multiple iterations to reach the best.

However, if the normalization is carried out, the contour is circular, which makes SGD iterate to the origin, resulting in less iterations.

So it is because the gradient descent algorithm needs to be normalized, which speeds up the speed of solving the optimal solution of gradient descent.

When looking for the best advantage of tree model (regression tree), it is done by looking for the optimal split point, which can be seen from the implementation of decision tree ID3 algorithm python. Because there’s no point in deriving, there’s no need for normalization.