The objective functions used in linear least -squares and
regularized least-squares are multidimensional
quadratics. We now analyze multidimensional quadratics further . We will see many more
uses of quadratics further in the course, particularly when dealing with Gaussian distributions .
The general form of a one-dimensional quadratic is given by:
This can also be written in a slightly different way (called standard form):
where . These two forms are equivalent, and it is
easy to go back and forth between them (e.g., given a, b, c, what are ?). In the latter
form, it is easy to visualize the shape of the curve: it is a bowl, with minimum (or maximum) at
b, and the “width” of the bowl is determined by the magnitude of a, the sign of a tells us which
direction the bowl points (a positive means a convex bowl, a negative means a concave bowl), and
c tells us how high or low the bowl goes (at x = b). We will now generalize these intuitions for
The general form for a 2D quadratic function is:
and, for an N-D quadratic, it is:
Note that there are three sets of terms: the quadratic terms , the linear terms
and the constant term .
Dealing with these summations is rather cumbersome. We can simplify things by using matrixvector
notation. Let . Then we can write a quadratic as:
You should verify for yourself that these different forms
are equivalent: by multiplying out all the
elements of f(x), either in the 2D case or, using summations, the general N − D case.
For many manipulations we will want to do later, it is helpful for A to be symmetric, i.e., to
have . In fact, it should be clear that these off-diagonal entries are redundant. So, if we
are a given a quadratic for which A is asymmetric, we can symmetrize it as:
and use instead. You should confirm for yourself that this is equivalent to the
As before, we can convert the quadratic to a form that leads to clearer interpretation:
where , assuming that A-1 exists. Note the similarity here to the
1-D case. As before, this function is a bowl-shape in N dimensions, with curvature specified by
the matrix A, and with a single stationary point μ. However, fully understanding the shape of
f(x) is a bit more subtle and interesting.
4.1 Optimizing a quadratic
Suppose we wish to find the stationary points (minima or maxima) of a quadratic
The stationary points occur where all values , that is, . Let us assume
that A is symmetric (if it is not, then we can symmetrize it as above). This is a very common form
of cost function (e.g. the log probability of a Gaussian as we will later see).
Due to the linearity of the differentiation operator , we can look at each of the three terms of
Eq.11 separately. The final (constant) term does not depend on x and so we can ignore it. Let us
examine the first term. If we write out the individual terms within the vectors/matrices, we get:
Note that the gradient is a vector of length N, where each
of the terms correspond to .
So in the expression above, for the terms in the gradient corresponding to each , we only need
to consider the terms involving , namely
The gradient then has a very simple form:
We can write a single expression for all of the using matrix/vector form:
You should multiply this out for yourself to see that this
corresponds to the individual terms above.
If we assume that A is symmetric, then we have
This is also very helpful rule that you should remember .
The next term in the cost function, bTx,
has an even simpler gradient. Note that this is simply a dot product , and the result is a scalar:
Only one term corresponds to each
. We can again express this in matrix/
This is another helpful rule that you will encounter again. If we use both of the expressions we
have just derived, and set the gradient of the cost function to zero , we get:
The optimum is given by the solution to this system of equations:
In the case of scalar x, this reduces to x = −b/2a. For linear regression: A = XXT and b =
−2XyT . As an exercise, convince yourself that this is true.