Least mean squares regression (II)


See this article for general LMS algorithm applications.
General LMS practical application

This paper designs the mathematical theory behind LMS.

1. The Least Mean Squares algorithm (LMS)

SDThe steepest descent method studied is a recursive calculationSignal statistics are knownRecursive algorithm of time Wiener filter (knowledge about R och P).
The problem is that this information is usually unknown!
LMS is a method based on the same principle as the steepest descent method, but its statistics are continuously estimated.
Because the statistics are continuously estimated, LMS algorithm can adapt to the changes of signal statistics; Therefore, LMS algorithm is an adaptive filter.

We want to create an algorithm to minimize\(E\{|e(n)|^2\}\), just like SD, but based on unknown statistics.
One strategy is to use the estimation of autocorrelation matrix R and cross-correlation vector P. If instantaneous estimate is selected:

\[\hat{\pmb R}=\pmb u(n)\pmb u^{H}(n) \\
\hat{\pmb p}(n)=\pmb u(n)d^{*}(n)

The resulting method is the least mean square multiplication algorithm.

For SD, the filter weight is updated by:

\[\pmb w(n+1)=\pmb w(n)+\frac{1}{2}\mu[-\nabla J(n)]

where \(\nabla J(n)=-2\pmb p+2\pmb R \pmb w(n)\)

In LMS, we use the estimated\(\hat{\pmb R},\hat{\pmb p}\)To calculate\(\hat{∇}J(n)\)Therefore, the updated filter vector also becomes an estimate. Therefore, it is recorded as\(\hat{w}(n)\)

\[\hat{\pmb \nabla}J(n)=-2 \hat{\pmb p}+2 \hat{\pmb R} \hat{\pmb w}(n)

For LMS algorithm, the update equation of filter weight becomes:

\[\hat{\pmb w}(n+1)=\hat{\pmb w}(n)+\mu \pmb u(n)e^{*}(n)

where \(e^{*}(n)=d^{*}(n)-\pmb u^{H}(n)\hat{\pmb w(n)}\)

Comparison with SD correspondence

\[\pmb w(n+1)=\pmb w(n)+\mu E\{ \pmb u(n)e^{*}(n)\}

where \(e^{*}(n)=d^{*}(n)-\pmb u^{H}(n)\pmb w(n)\)
Therefore, the difference is that\(E\{\pmb u(n)e^{∗}(n)\}\)Estimated as\(\pmb u(n)e^{∗}(n)\)。 This will result in gradient noise.

2. Convergence analysis of the LMS

Consider updating the equation:

\[\hat{\pmb w}(n+1)=\hat{\pmb w}(n)+\mu \pmb u(n)e^{*}(n)

We want to know where\(J(n)\)and\(\hat{\pmb w(n)}\)On the other hand, the convergence speed and convergence speed of the algorithm.


  1. Introduce filter weight error vector\(\epsilon(n)=\hat{\pmb w}(n)−\pmb w_o\)
  2. use\(\epsilon (n)\)Represents the update equation.
  3. with\(\epsilon (n)\)express\(J(n)\)。 Which involves\(\pmb K(n)=E\{\pmb \epsilon (n)\pmb \epsilon^{H}(n)\}\)
  4. calculation\(\pmb K(n)\)It controls the convergence of the difference equation.
  5. from\(\pmb K(n)\)reach\(\pmb X(n)\)For variable change, its convergence speed is also fast.

2.1 The filter weight error vector

LMS contains feedback. Like SD, if the step size µ is not properly selected, the algorithm also has the risk of divergence. In order to study its stability, the filter weight error vector is introduced\(\epsilon (n)\)

\[\pmb \epsilon (n)=\hat{\pmb w}(n)-\pmb w_o

among\(\pmb w_o=\pmb R^{-1}\pmb p,Wiener-Hopf\)

be careful,\(\epsilon (n)\)Corresponding to in SD\(c(n)=\pmb w(n)w−\pmb w_o\), but because\(\hat{\pmb w}(n)\)\(\epsilon (n)\)It’s random.

2.2 Update equation in \(\epsilon (n)\)

N + 1 time\(\epsilon (n)\)The update equation of can be recursively expressed as:

\[\epsilon (n)=\hat{\pmb w}(n)-\pmb w_o

\epsilon (n+1)&=\hat{\pmb w}(n+1)-\pmb w_o \\
&=\hat{\pmb w}(n)+\mu \pmb u(n)e^{*}(n)-\pmb w_o \\
&=\pmb \epsilon (n)+\mu \pmb u(n)[d^{*}(n)-\pmb u^{H}(n)\hat{\pmb w(n)}] \\
&=\pmb \epsilon (n)+\mu \pmb u(n)[d^{*}(n)-\pmb u^{H}(n)\pmb w_o-\pmb u^{H}(n)\pmb \epsilon (n)] \\
&=[\pmb I-\mu \pmb u(n)\pmb u^{H}(n)]\pmb \epsilon(n)+\mu \pmb u(n)e^{*}_{o}(n)

where \(\pmb u(n)e^{*}_{o}(n)=d^{*}(n)-\pmb u^{H}(n)\pmb w_o\)

Compared with SD value:

\[\pmb c(n)=\pmb w(n)-\pmb w_o

\pmb c(n+1)&=\pmb w(n)+\mu[\pmb p-\pmb R\pmb w(n)]-\pmb w_o \\
&=\pmb c(n)+\mu[\pmb p-\pmb R\pmb w(n)] \\
&=\pmb c(n)+\mu [\pmb R\pmb w_o-\pmb R\pmb w(n)] \\
&=\pmb c(n)+\mu \pmb R[\pmb w_o-\pmb w(n)] \\
&=c(n)-\mu \pmb R c(n) \\
&=(\pmb I-\mu \pmb R)\pmb c(n)

\[\pmb c(n+1)=(\pmb I-\mu \pmb R)\pmb c(n)

2.3 Express J(n) in \(\epsilon(n)\)

The convergence analysis of LMS is much more complex than that of SD. Therefore, two assumptions (approximations) are required.

  • Independence theory
  • Direct-averaging method
2.3.1 Independence theory
  1. The input vector instances n, u (1), u (2),…, u (n) at different times are independent of each other (and therefore irrelevant).
  2. The input signal vector at time n is independent of the expected signal at all early times, and U (n) is independent of D (1), D (2),…, D (n − 1).
  3. The desired signal of time n, D (n) is independent of all early desired signals, D (1), D (2),…, D (n − 1), but depends on the input signal vector of time n, u (n).
  4. At the same time, the signals D (n) and U (n) of instant n are normally distributed with each other.
2.3.2 Direct-averaging

Direct averaging means that in the update equation\(\epsilon\)

\[\epsilon (n+1)=[\pmb I-\mu \pmb u(n)\pmb u^{H}(n)]\pmb \epsilon(n)+\mu \pmb u(n)e^{*}_{o}(n)

Instantaneous estimation\(\hat{\pmb R}(n)=\pmb u(n)\pmb u^{H}(n)\)Expected set\(\pmb R=E\{\pmb u(n)\pmb u^{H}(n)\}\)replace:

\[\epsilon (n+1)=[\pmb I-\mu \pmb R]\pmb \epsilon(n)+\mu \pmb u(n)e^{*}_{o}(n)

In SD, we have:

\[J(n)=J_{min}+(\pmb w-\pmb w_o)^{H}\pmb R(\pmb w-\pmb w_o)

The eigenvalue of R controls the convergence rate.

For LMS,\(\pmb \epsilon (n)=\hat{\pmb w}(n)-\pmb w_o\)

J(n)&=E\{|e(n)|^{2}\}=E\{|d(n)-\hat{\pmb w}^{H}u(n)|^2\} \\
&=E\{|e_o(n)-\pmb \epsilon^{H}u(n)|^2\} \\
&=^{i}E\{|e_o(n)|^{2}\}+E\{\pmb \epsilon^{H} (n)\pmb u(n)\pmb u^{H}(n)\pmb \epsilon (n)\}

Here, (I) indicates that the independence assumption is used. because\(\pmb \epsilon\)and\(\hat{\pmb R}\)It is an estimate and is not independent, so the expectation operator cannot be translated into it.

\(J_{ex}(n)=J(n)−J_{min}=E{\pmb \epsilon^{H}(n)\hat{\pmb R}(n)\pmb \epsilon(n)}\)The behavior of determines the convergence.
The following is a brief description of the properties to be used:

  1. \(tr(scalar)=scalar\)
  2. \(tr(\pmb A\pmb B)=tr(\pmb B\pmb A)\)
  3. \(E\{tr(\pmb A)\}=tr(E\{\pmb A\})\)

\(J_{ex}(n)\)Can be rewritten as:

J_{ex}(n)&=E\{tr[\hat{\pmb R}(n)\pmb \epsilon(n)\pmb \epsilon^{H}(n)]\} \\
&=tr[E\{\hat{\pmb R}(n)\pmb \epsilon(n)\pmb \epsilon^{H}(n)\}] \\
&=tr[E\{\hat{\pmb R}(n)\}E\{\pmb \epsilon(n)\pmb \epsilon^{H}(n)\}] \\
&=tr[\pmb R\pmb K(n)]

Convergence depends on:\(\pmb K(n)=E\{\pmb \epsilon(n)\pmb \epsilon^{H}(n)\}\)

2.4 Difference equation of \(\pmb K(n)\)

The update equation of filter weight error is as follows:

\[\epsilon (n+1)=[\pmb I-\mu \pmb u(n)\pmb u^{H}(n)]\pmb \epsilon(n)+\mu \pmb u(n)e^{*}_{o}(n)

The solution of this stochastic difference equation is small\(µ\)Solution close to pair (by direct averaging):

\[\epsilon (n+1)=[\pmb I-\mu \pmb R]\pmb \epsilon(n)+\mu \pmb u(n)e^{*}_{o}(n)

This equation is still difficult to solve, but now the behavior of K (n) can be described by the following difference equation:

\[\pmb K(n+1)=(\pmb I-\mu \pmb R)\pmb K(n)(\pmb I-\mu \pmb R)+\mu^2J_{min}\pmb R

2.5 Changing of variables from \(\pmb K(n)\) to \(\pmb X(n)\)

We use orthogonal decomposition:

\[\pmb Q^{H}\pmb R \pmb Q=\pmb \Lambda

At the same time, we assume that:

\[\pmb Q^{H}\pmb K(n) \pmb Q=\pmb X(n)

among\(\pmb Λ\)yes\(\pmb R\)Diagonal eigenvalue matrix, where\(\pmb X(n)\)Usually not diagonal, get:

\[J_{ex}(n)=tr(\pmb R\pmb K(n))=tr(\pmb \Lambda \pmb X(n))

from\(\pmb K(n)\)express\(\pmb X(n)\)

\[\pmb X(n+1)=(\pmb I-\mu \pmb \Lambda)\pmb X(n)(\pmb I-\mu \pmb \Lambda)+\mu^2J_{min}\pmb \Lambda

\(J_{ex}(n)\)Only dependent on\(X(n)\)Diagonal element of (because of tracking)\(tr(ΛX(n)]\))。 Therefore, we can write\(J_{ex}(n)=\sum \limits_{i=1}^{M}λ_ix_{ii}(n)\)
Diagonal elements are updated by:

\[x_i(n+1)=(1-\mu \lambda_i)^2x_i(n)+\mu^2J_{min} \lambda_i

This is a non-uniform difference equation. This means that there will be a transient part and a static part dependent on jmin and µ. Therefore, you can write a cost function:


among\(J_{min}+J_{ex}(\infty) och J_{trans}(n)\)They are stationary and transient, respectively.
At best, the LMS reaches\(J_{min}+J_{ex}(\infty)\)

2.5 some properties of LMS
  • The restrictions on learning rate are the same as SD:

  • \[J_{ex}(\infty)=J_{min}\sum \limits_{i=1}^{M}\frac{\mu \lambda_i}{2-\mu \lambda_i}


J_{ex}(n)&=\sum \limits_{i=1}^{M}λ_ix_{ii}(n) \\
J_{ex}(n+1)&=\sum \limits_{i=1}^{M}λ_ix_{ii}(n+1) \\
&=\sum \limits_{i=1}^{M}λ_i[(1-\mu \lambda_i)^2x_i(n)+\mu^2J_{min} \lambda_i] \\
&=\sum \limits_{i=1}^{M}λ_i[(1-\mu \lambda_i)^{n}x_i(1)+\mu^2J_{min} \lambda_i(1+(1-\mu \lambda_i)^2+(1-\mu \lambda_i)^4+…+(1-\mu \lambda_i)^{2(n-1)})]


J_{ex}(\infty)&=\lim \limits_{n->\infty}\sum \limits_{i=1}^{M}λ_i[(1-\mu \lambda_i)^{n}x_i(1)+\mu^2J_{min} \lambda_i(1+(1-\mu \lambda_i)^2+(1-\mu \lambda_i)^4+…+(1-\mu \lambda_i)^{2(n-1)})] \\
&=J_{min}\sum \limits_{i=1}^{M}\frac{\mu \lambda_i}{2-\mu \lambda_i}

  • \(M=J\frac{J_{ex}(\infty)}{J_{min}}\)It measures the distance between the optimal solution and LMS (in the mean square sense)

3. Rules of thumb LMS

\(\pmb R\)The eigenvalues of are rarely known, but the sum of eigenvalues can be described by average power. Therefore, a set of thumb rules is usually used, that is, based on the input power:

\[\sum \limits_{k=0}^{M-1}E\{|u(n-k)|^2\}=Mr(0)=tr(\pmb R)

  • Stepsize \(0
  • \(\pmb M \approx \frac{\mu}{2}\sum \limits_{k=0}^{M-1}E\{|u(n-k)|^2\}\) (\(\ mu \ labda_i ignore \))
  • Average time constant:\(\tau_{mse,av}\approx \frac{1}{2\mu \lambda_{av}}\)
\[\lambda_{av}=\frac{1}{M}\sum \limits_{k=1}^{M}\lambda_i

  • \(\pmb M\)And\(\tau_{mse,av}\)Relationship between:
\[\pmb M\approx \frac{M}{4\tau_{mse,av}}

here,\(\tau_{mse,av}\)Represents the time required for J (n) to decay the e − 1 factor.

4. Summary

  1. Set the starting value of the filter coefficient,\(\hat{\pmb w}(0)\)
  2. calculation error\(\pmb e^{*}(n)=d(n)-\pmb u^{H}(n)\hat{\pmb w_o}\)
  3. Update filter coefficients\(\hat{\pmb w}(n+1)=\hat{\pmb w}(n)+\mu \pmb u(n)[d(n)-\pmb u^{H}(n)\hat{\pmb w_o}]\)
  4. Repeat steps 2 and 3

Recommended Today

Hive Foundation

1、 Hive basic concepts Hive is a data warehouse tool based on Hadoop. It can map structured data files into a table and provide SQL like query functions. The essence is:Convert HQL into MapReduce program flow chart Architecture Principle Architecture diagram User interface (client):cli (hive shell), jdbc/odbc (Java access hive), webui (browser access hive) Metadata: […]