SCLA Least Squares

Section 4.1 Least Squares

Subsection 4.1.1 Theory of Least Squares

Solving a linear system of equations is a fundamental use of the tools of linear algebra. You know from introductory linear algebra that a linear system may have no solution. In an applied situation there could be many reasons for this, and it begs the question: what to do next?

We often construct mathematical models of practical situations, frequently in an effort to measure various parameters of the model. Suppose we think that interest rates \(R\text{,}\) as measured by the rate on one-year government bonds, are a linear function of construction activity, \(C\text{,}\) as measured by the number of permits issued for new construction in the last 30 days. So the “hotter” the construction market, the greater the demand for loans, so the cost of money (the interest rate) is greater. With a good model, we might be able to predict interest rates by examining public records for changes in the number of construction permits issued.

So we have a mathematical model

\begin{equation*} R = aC + b \end{equation*}

where we do not know \(a\) or \(b\text{,}\) the parameters of the model. But we would like to know the values of these parameters. Or at least have good estimates, since we understand that our model is an extremely simple representation of a much more complicated situation. So we collect data by obtaining monthly records of the interest rate and construction permits for the past sixty years, where maybe we report permits on a per capita basis to account for changes in population over many decades. Now we have \(720\) pairs of permits issued and interest rates. We can substitute each pair into our model and we get a linear equation in the parameters \(a\) and \(b\text{.}\) Two such equations would be likely to have a unique solution, however if we consider all \(720\) equations there is unlikely to be a solution. Why do we say that? Imagine the linear system we would obtain from all these equations. The \(720\times 2\) coefficient matrix will have the values of \(C\) in the first column, and the second column will be all \(1\)'s. The vector of constants will be the values of \(R\text{.}\) If you form the augmented matrix of this system and row-reduce, you will be virtually guaranteed to get a pivot column in the third column and thus have no solution (Theorem RCLS).

The lack of a solution for our model can be expressed more clearly by saying that vector of constants (the values of \(R\)) is not in the column space of the coefficient matrix. Or, the vector of \(R\) values is not a linear combination of the vector of \(C\) values and the vector of all ones. If it were, then the scalars in the linear combination would be our solution for \(a\) and \(b\text{.}\) We will temporarily leave our economic forecasting example to be more general, but we will return to the example soon.

Consider the general problem of solving \(A\vect{x}=\vect{b}\) when there is no solution. We desire an \(\vect{x}\) so that \(A\vect{x}\) equals \(\vect{b}\text{,}\) but we will settle for an \(\vect{x}\) so that \(A\vect{x}\) is very close to \(\vect{b}\text{.}\) What do we mean by close? One measure would be that we want the difference \(A\vect{x}-\vect{b}\) to be a short vector, as short as possible. In other words we want to minimize the norm, \(\norm{A\vect{x}-\vect{b}}\text{.}\) We will return to this minimization idea shortly, but let's ask informally what direction this short vector might have?

We have a subspace, the column space of \(A\text{,}\) represented by all possibilities for \(A\vect{x}\) as we course over all values of \(\vect{x}\text{.}\) We have a vector \(\vect{b}\) that is not in the subspace. Which element in the subspace is closest to \(\vect{b}\text{?}\) Consider the situation in two dimensions, where the subspace is a line through the origin, and the point lies off the line. From the point off the line, which of the infinitely many points on the line is closest, and what direction would you take to get there? In three dimensions, consider a subspace that is a plane containing the origin, and the point lies off the plane. From the point off the plane, which of the infinitely many points on the plane is closest, and what direction would you take to get there? You should have answered “in a perpendicular direction” and “in the direction of the normal vector to the plane”. Move orthogonally to the subspace. More exactly, in the direction of an element of the orthogonal complement of the subspace.

More carefully, let \(\hat{\vect{b}}\) be the “closest” vector to \(\vect{b}\) that is in the column space of \(A\text{,}\) and define \(\hat{\vect{x}}\) by the scalars in the linear combination of the columns of \(A\) that yields \(\hat{\vect{b}}\text{.}\) That is, \(A\hat{\vect{x}} = \hat{\vect{b}}\text{.}\) We desire

\begin{equation*} \hat{\vect{b}}-\vect{b} = A\hat{\vect{x}}-\vect{b}\in\per{\left(\csp{A}\right)}=\nsp{\adjoint{A}}\text{,} \end{equation*}

\begin{align*} \zerovector=\adjoint{A}\left(A\hat{\vect{x}}-\vect{b}\right)&=\adjoint{A}A\hat{\vect{x}}-\adjoint{A}\vect{b} & &\Rightarrow & \adjoint{A}A\hat{\vect{x}}&=\adjoint{A}\vect{b}. \end{align*}

This linear system is called the normal equations, due to the role of orthogonality in its derivation. Several good things have happened. Primarily, \(\adjoint{A}A\) is a square matrix, is positive semi-definite, and if \(A\) has full column rank, then \(\adjoint{A}A\) is nonsingular. With a nonsingular coefficient matrix, we have the unique solution

\begin{equation*} \hat{\vect{x}}=\inverse{\left(\adjoint{A}A\right)}\adjoint{A}\vect{b}\text{.} \end{equation*}

The difference between \(\vect{b}\) and \(\hat{\vect{b}}\text{,}\) \(\vect{r}=\vect{b}-\hat{\vect{b}}\text{,}\) is known as the residual, and its length (the distance in the orthogonal direction) can be used as a measure of the approximation that is inherent in taking \(\hat{\vect{x}}\) as a solution to the system.

We consider two very simple examples of using the normal equations, in cases where our geometric intuition is useful.

Example 4.1.1. Trivial Least Squares.

Consider the extremely trivial system of equations in one variable, \(A\vect{x}=\vect{b}\) with

\begin{align*} A&=\begin{bmatrix}1\\2\end{bmatrix} & \vect{x}&=\colvector{x_1} & \vect{b}&=\colvector{5\\15} \end{align*}

Quite clearly this system has no solution. Forming the normal equations we get the silly system

\begin{equation*} \begin{bmatrix}5\end{bmatrix}\colvector{x_1}=\colvector{35} \end{equation*}

which has the unique solution \(x_1=7\text{.}\) What is more interesting is that

\begin{equation*} A\colvector{7} = \colvector{7\\14}. \end{equation*}

This is a vector in \(\csp{A}\) and the difference

\begin{equation*} \vect{r}=\colvector{5\\15}-\colvector{7\\14}=\colvector{-2\\1} \end{equation*}

is orthogonal to the columns of \(A\text{,}\) in other words \(\vect{r}\) is in the orthogonal complement of the column space of \(A\text{.}\) Geometrically, the point \((5,15)\) is not on the line \(y=2x\text{,}\) but \((7,14)\) is. The line has slope \(2\text{,}\) while the line segment joining \((5,15)\) to \((7,14)\) has slope \(-\frac{1}{2}\text{,}\) making it perpendicular to the line. So \((7,14)\) is the point on the line closest to \((5,15)\) and leads to the solution \(x_1=7\) of the resulting system.

Let's do it again, staying simple, but not trivially so.

Example 4.1.2. Simple Least Squares.

Consider the simple system of equations in two variables, \(A\vect{x}=\vect{b}\) with

\begin{align*} A&=\begin{bmatrix}1&-1\\2&3\\3&0\end{bmatrix} & \vect{x}&=\colvector{x_1\\x_2} & \vect{b}&=\colvector{1\\5\\-2} \end{align*}

If you try to solve this system, say by row-reducing an augmented matrix, you will discover that there is no solution. The vector \(\vect{b}\) is not in the column space of \(A\text{.}\) The normal equations give the system

\begin{equation*} \begin{bmatrix}14&5\\5&10\end{bmatrix}\colvector{x_1\\x_2}=\colvector{5\\14} \end{equation*}

which has the unique solution \(x_1=-\frac{4}{23}\text{,}\) \(x_2=\frac{171}{115}\text{.}\) What is more interesting is that

\begin{equation*} A\colvector{-\frac{4}{23}\\\frac{171}{115}} = \colvector{-\frac{191}{115}\\\frac{473}{115}\\-\frac{12}{23}}. \end{equation*}

This is a vector in \(\csp{A}\) and the difference

\begin{equation*} \vect{r}= \colvector{1\\5\\-2}- \colvector{-\frac{191}{115}\\\frac{473}{115}\\-\frac{12}{23}}= \colvector{\frac{306}{115}\\\frac{102}{115}\\-\frac{34}{23}} \end{equation*}

is orthogonal to both of the columns of \(A\text{,}\) in other words \(\vect{r}\) is in the orthogonal complement of the column space of \(A\text{.}\) Geometrically, the point \((1,5,-2)\) is not on the plane spanned by the columns of \(A\text{,}\) which has equation \(9x+3y-5z=0\text{,}\) but \((-\frac{191}{115},\,\frac{473}{115},\,-\frac{12}{23})\) is on the plane. The plane has a normal vector \(9\vec{i}+3\vec{j}-5\vec{k}\text{,}\) while the vector joining \((1,5,-2)\) to \((-\frac{191}{115},\,\frac{473}{115},\,-\frac{12}{23})\) is \(\frac{306}{115}\vec{i}+\frac{102}{115}\vec{j}-\frac{34}{23}\vec{k}\text{,}\) which is a scalar multiple of the normal vector to the plane. So \((-\frac{191}{115},\,\frac{473}{115},\,-\frac{12}{23})\) is the point on the plane closest to \((1,5,-2)\) and leads to the solution \(x_1=-\frac{4}{23}\text{,}\) \(x_2=\frac{171}{115}\) of the resulting system.

What does a solution to the normal equations look like for our economic forecasting model? As discussed above \(A\) is a \(720\times 2\) matrix, where the first column has the numbers of construction permits, \(c_i\) for \(1\leq i\leq 720\text{,}\) and the second column is all ones. The vector \(\vect{b}\) contains the \(720\) values of the interest rate, \(r_i\text{,}\) \(1\leq i\leq 720\text{.}\) So

\begin{align*} \adjoint{A}A&=\begin{bmatrix}\sum c_i^2&\sum c_i\\\sum c_i&720\end{bmatrix}\\ \adjoint{A}\vect{b}&=\begin{bmatrix}\sum c_ir_i\\\sum r_i\end{bmatrix} \end{align*}

Then,

\begin{align*} \begin{bmatrix}a\\b\end{bmatrix} &=\vect{x}=\inverse{\left(\adjoint{A}A\right)}\adjoint{A}\vect{b}\\ &=\inverse{\begin{bmatrix}\sum c_i^2&\sum c_i\\\sum c_i&720\end{bmatrix}} \begin{bmatrix}\sum c_ir_i\\\sum r_i\end{bmatrix}\\ &=\frac{1}{720\sum c_i^2 - \left(\sum c_i\right)^2} \begin{bmatrix}720&-\sum c_i\\-\sum c_i&\sum c_i^2\end{bmatrix} \begin{bmatrix}\sum c_ir_i\\\sum r_i\end{bmatrix}\\ &=\frac{1}{720\sum c_i^2 - \left(\sum c_i\right)^2} \begin{bmatrix}720\sum c_ir_i-\sum c_i\sum r_i\\-\sum c_i\sum c_ir_i + \sum c_i^2\sum r_i\end{bmatrix} \end{align*}

The expressions above can be cleaned up some, but the point is to see that we have an expression for each of \(a\) and \(b\text{,}\) that depends soley on the \(720\) pairs of data points \((c_i, r_i)\text{,}\) \(1\leq i\leq 720\text{.}\) These expressions may look familiar as the most basic case of linear regression. Exercise Checkpoint 4.1.3 asks you to derive these expressions, and in the more typical order. With estimates in hand, you can now consult the number of construction permits issued and form a prediction of interest rates.

Checkpoint 4.1.3.

Suppose we have \(n\) pairs of a dependent variable \(y\) that varies linearly according to the independent variable \(x\text{,}\) \(\left(x_i,y_i\right)\text{,}\) \(1\leq i\leq n\text{.}\) Model the data by the linear equation \(y=a + bx\text{.}\) Solve the normal equations to find expressions that estimate the parameters \(a\) and \(b\text{.}\)

Checkpoint 4.1.4.

Find the least-squares estimate obtained from data modeled by the linear equation \(y=bx\) (a situation where we know there is no \(y\)-intercept).

Checkpoint 4.1.5.

Suppose you have data modeled by a single quantity, \(z\text{,}\) that depends on two independent variables, \(x\) and \(y\text{,}\) linearly according to the model \(z = a+bx+cy\text{.}\) So your data points are triples \(\left(x_i,y_i,z_i\right)\text{,}\) \(1\leq i\leq n\text{.}\) Can you solve the normal equations to obtain expressions estimating \(a,b,c\text{?}\) This might not be a very instructive exercise, but perhaps determine \(\adjoint{A}A\) and \(\adjoint{A}\vect{b}\) before letting the matrix inverse dissuade you. What does the geometric picture of the data and your resulting estimates look like?

Checkpoint 4.1.6.

Suppose that \(A\) is a nonsingular matrix. What can you say about a solution, \(\hat{\vect{x}}\text{,}\) to the normal equations? Comment on this situation.

So far we have not stated any theorems, to say nothing of proving anything. Moving in an orthogonal direction feels like a good idea, but is it really best? Here is a theorem that suggests it is best according to one natural criteria.

Theorem 4.1.7.

Suppose that \(A\) is an \(m\times n\) matrix and \(\vect{b}\in\complex{m}\text{.}\) Then \(r\left(\vect{x}\right)=\norm{A\vect{x}-\vect{b}}\) is minimized by a solution, \(\hat{\vect{x}}\text{,}\) to the normal equations, \(\adjoint{A}A\vect{x}=\adjoint{A}\vect{b}\text{.}\)

Proof.

For any vector \(\vect{x}\) we can write

\begin{align*} A\vect{x}-\vect{b} &=A\vect{x}-A\hat{\vect{x}}+A\hat{\vect{x}}-\vect{b}\\ &=A\left(\vect{x}-\hat{\vect{x}}\right)+\left(A\hat{\vect{x}}-\vect{b}\right) \end{align*}

Clearly the first term is a vector in the column space of \(A\text{.}\) The second vector, by virtue of \(\hat{\vect{x}}\) being a solution to the normal equations, is an element of \(\nsp{\adjoint{A}}\text{,}\) the orthogonal complement of the column space (Theorem [provisional cross-reference: result on orthogonal complement of colum space]). So this is the promised decomposition of \(A\vect{x}-\vect{b}\) into the element of a subspace (the column space of \(A\) here) and the orthogonal complement of the subspace. Since these two vectors are orthogonal, we can apply the generalized version of the Pythagorean Theorem ([provisional cross-reference: Pythagorean theorem, perhaps into FCLA])

\begin{align*} \norm{A\vect{x}-\vect{b}}^2 &=\norm{A\left(\vect{x}-\hat{\vect{x}}\right)+\left(A\hat{\vect{x}}-\vect{b}\right)}^2\\ &=\norm{A\left(\vect{x}-\hat{\vect{x}}\right)}^2+\norm{\left(A\hat{\vect{x}}-\vect{b}\right)}^2\\ &\geq\norm{\left(A\hat{\vect{x}}-\vect{b}\right)}^2 \end{align*}

This inequality establishes that \(\hat{\vect{x}}\) minimizes \(r\left(\vect{x}\right)\text{.}\)

There are other measures of distance than the norm (we are using what would be called the \(2\)-norm, in order to differentiate it from other norms). But “Euclidean distance”, as we have used here, is often the most tractable and the results are often the most useful. In statistics, it is assumed that deviations in data, deviations from what the model would predict, are due to quantities that vary according to probability distributions. With this assumption, estimators, such as our least-squares estimates, become complicated functions of these random quantities. With the \(2\)-norm, and reasonable assumptions about the probability distributions for the data, the resulting least-squares estimators have probability distributions that are well-understood, thus making it possible to understand their properties and behavior.

Subsection 4.1.2 Goodness of Fit

We can associate a single number as a measure of how “good” our solution \(\hat{\vect{b}}\) really is. First, realize that the simplest possible model is to assume that \(\vect{b}\) should have constant entries, so the matrix \(A\) is a single columns of all ones. In this case, the normal equations provide \(\hat{\vect{x}}\) as the vector with a single entry that is the average of the entries of the vector \(\vect{b}\text{.}\) If we write the entries of \(\vect{b}\) as \(b_i\) and write the average as \(\overline{b}\text{,}\) then the length of the residual vector, squared, is

\begin{equation*} \norm{\vect{r}}^2 = \sum_{i=1}^{n}\left(b_i - \overline{b}\right)^2 = \mathrm{SS}_\mathrm{tot}\text{.} \end{equation*}

When we use a more sophisticated model, the residual vector will be shorter, and if we denote the entries of \(\hat{\vect{b}}\) by \(\hat{b}_i\text{,}\) we have

\begin{equation*} \norm{\vect{r}}^2 = \sum_{i=1}^{n}\left(b_i - \hat{b}_i\right)^2 = \mathrm{SS}_\mathrm{res}\text{.} \end{equation*}

The notation for these expressions is read as a “sum of squares”, either “total” or “residual”. Then the coefficient of determination is defined as

\begin{equation*} R^2 = 1 - \frac{\mathrm{SS}_\mathrm{res}}{\mathrm{SS}_\mathrm{tot}}\text{.} \end{equation*}

When we actually have a solution to the original system, then the residual vector for the model is the zero vector, so \(\mathrm{SS}_\mathrm{res}=0\) and \(R^2=1\text{.}\) When the model does not do better than the simplest possible model, then \(\mathrm{SS}_\mathrm{res} = \mathrm{SS}_\mathrm{tot}\) and \(R^2=0\text{.}\) So the hope is that a model will fit the data with an \(R^2\) as close to \(1\) as possible. What is considered “close” will vary from situation to situation. What is considered close for economics data may not be considered close for data from a physics experiment.

Subsection 4.1.3 Computing Least-Squares Solutions

In Exercise Checkpoint 4.1.5 we suggest formulating general expressions for least-squares estimates of three parameters of a model. This begins to get a bit absurd when you invert a matrix larger than size \(2\text{.}\) Instead it makes sense to formulate from the data the matrix \(\adjoint{A}A\) and the vector \(\adjoint{A}\vect{b}\text{,}\) and then proceed to a numerical solution. As \(\adjoint{A}A\) is a Hermitian positive definite matrix (Theorem 1.8.2, it can be decomposed into a Cholesky factorization twice as fast as we can decompose an arbitrary matrix into an LU factorization [provisional cross-reference: cost to form Cholesky]. The Cholesky factorization allows a round of forward-solving followed by a round of back-solving to find a solution to the normal equations.

But it gets better. Consider solving the system \(A\vect{x}=\vect{b}\text{.}\) Suppose \(A\) has rank \(n\) and we have a thin QR decomposition of \(A\text{,}\) \(A=QR\) where \(Q\) has orthogonal columns and \(R\) is an upper triangular square matrix with positive diagonal entries. Notice in particular that \(R\) is invertible. Then

\begin{align*} \adjoint{R}R\hat{\vect{x}}&=\adjoint{R}\adjoint{Q}QR\hat{\vect{x}}\\ &=\adjoint{\left(QR\right)}QR\hat{\vect{x}}\\ &=\adjoint{A}A\hat{\vect{x}}\\ &=\adjoint{A}\vect{b}\\ &=\adjoint{\left(QR\right)}\vect{b}\\ &=\adjoint{R}\adjoint{Q}\vect{b}\text{.} \end{align*}

Since \(R\) is invertible, so is \(\adjoint{R}\text{,}\) and we can multiply both sides of the above equation by \(\inverse{\left(\adjoint{R}\right)}\) to obtain the system

\begin{equation*} R\hat{\vect{x}}=\adjoint{Q}\vect{b}\text{.} \end{equation*}

This new system has an upper triangular coefficient matrix, so one round of back-solving will provide a solution for \(\hat{\vect{x}}\text{.}\) And the product \(\adjoint{Q}\vect{b}\) will be reasonably well-behaved due to the orthogonal columns of \(Q\text{.}\) Notice that in the case of a square matrix \(A\) (of full rank), the matrix \(Q\) will be invertible, and the same system that in general leads to the least-squares solution will also provide the exact solution in this special case.

This is an excellent example of how knowledge of a specialized situation, and a matrix decomposition, can be parlayed into better accuracty and greater speed. But it is not free, there is still significant effort required to compute the QR decomposition in the first place.

Checkpoint 4.1.8.

Compute the least-squares solution to the system \(A\vect{x}=\vect{b}\text{.}\) Compute the residual vector associated with your solution.

\begin{align*} A&=\begin{bmatrix}1&1\\0&0\\2&3\end{bmatrix} & \vect{b}&=\colvector{2\\1\\4} \end{align*}

Solution

We compute the pieces of the normal equations

\begin{align*} \adjoint{A}A&=\begin{bmatrix}5 & 7\\7 & 10\end{bmatrix} & \adjoint{A}\vect{b}&=\colvector{10\\14} \end{align*}

So the solution is

\begin{equation*} \vect{x}=\inverse{\begin{bmatrix}5 & 7\\7 & 10\end{bmatrix}}\colvector{10\\14}=\begin{bmatrix}10 & -7\\-7 & 5\end{bmatrix}\colvector{10\\14}=\colvector{2\\0} \end{equation*}

The residual is the difference

\begin{equation*} A\vect{x}-\vect{b} = \begin{bmatrix}1&1\\0&0\\2&3\end{bmatrix}\colvector{2\\0}-\colvector{2\\1\\4}=\colvector{0\\-1\\0} \end{equation*}

Notice that residual is indeed orthogonal to the column space of \(A\text{.}\)