Linear Algebra

An overview of the chapter “Linear Algebra” from the famous book “Deep Learning” written by Ian Goodfellow, Yoshua Bengio and Aaron Courville. The authors create a brief introduction of the important concepts of linear algebra that help guide machine learning. All images and tables in this post are from their book.

Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

  • Scalars: A single number. For example,

  • Vectors: An array of numbers. For example,

To access elements 1 and 3 in array, we set and use . To get all elemets other than 1 and 3, we use

  • Matrices: 2-D array of numbers. For example,

  • Tensors: more than 2-D array. For example,

One important operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner.

In Machine Learning, we allow addition of matrces and vectors where, The implicit copying of b across each row is called broadcasting.

Multiplying Matrices and Vectors

  • Matrix Multiplication:

where, .

  • Element wise product or Hadard product:

where, .

  • Dot product: Product of two vectors of same dimensionality.

Matrix multiplication is associative (), distributive () but not commutative (). However, the commutative property holds for vector product ().

Also, note that .

Identity & Inverse Matrices

Linear algebra offers a powerful tool called matrix inversion, that allows us to analytically solve equation for many values of . We denote the identity matrix that preserves n-dimensional vectors as , and

where is inverse of A. Also note that, .





Note that might or might not exist depending on the existence of inverse of .

Linear Dependence and Span

Suppose has 2 solutions , then is also a solution.

Formally, a linear combination of some set of vectors is given by multiplying each vector by a corresponding scalar coefficient and adding the results:

The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. Determining whether has a solution amounts to testing whether is in the span of columns of , aka column space or range of In order for the system to have a solution for all values of , we therefore require that the column space of A be all of . If any point in is excluded from th column space, that point is a potential value of that has no solution. The requirement that the column space of A be all of implies immediately that must have at least columns, i.e., . Having is only a necessary condition for every point to have a solution, It is not however sufficient condition since, it is possible for some of the columns to be redundant.

A set of vectors is linearly independent if no vector in the set is a linear combination of other vectors. This means that for the column space of the matrix to encompass all of , the matrix must contain at least one set of linearly independent columns. This condition is both necessary and sufficient for to have a solution for every value of . Note that the requirement is for a set to have exactly linear independent columns, not at least . In order for the matrix to have an inverse, we additionally need to ensure that has at most one solution for each value of . To do so, we need to ensure that matrix has at most m columns. Otherwise, there is more than one way of parametrizing each solution. Together, this means that the matrix must be a square, that is, we require that and that all of the columns must be linearly independent. A square matrix with linearly independent columns is known as singular. If is not square or is square, but singular, it can still be possible to solve the equation. However, we cannot use the method of matrix inversion to find the solution.

Norms

Norms are used to measure the size of a vector. Formally, the norm is given by:

for , . Norm must follow 3 conditions:

The norm, with , is known as the Eucledian Norm and is equal to . The squared norm is more convenient to work with mathematically and computationally than the norm itself. If data is close to the origin is of high importance, we use norm. The norm may be simplified as . We sometimes measure the size of the vector by counting its number of nonzero elements. Some authors refer to this function as the “ norm”, but this is incorrect terminology. The number of non-zero entries in a vector is not a norm, because scaling the vector by does not change the number of nonzero entries. The norm is often used as a substitute for the number of nonzero entries. One other norm is the max norm or which can be simplifies to the absolute value of the element with the largest magnitude in the vector, . Sometimes, we also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure Frobenius norm:

which is similar to norm of vectors, but for matrices.

Furthermore, the dot product of two vectors can be rewritten in terms of norms. Specifically,

Special Kinds of matrices and Vectors

  • Diagonal Matrix:

We write to denote a square diagonal matrix whose diagonal entries are given by the entries of vector . To compute , we only need to scale each element by . Not all diagonal matrices need be square. It is possible to construct a rectangular diagonal matrix.

  • Symmetric Matrix: Matrix is said to be symmetric if , or .
  • Unit Vector: A vector with unit norm. .
  • Orthogonal Vectors: A vector and are said to be orthogonal if . If both the vectors have non-zero norm, this means that they are at a 90 degree angle to each other. In , at most vectors may be mutually orthogonal to each other with nonzero norm. If the vectors are not only orthogonal but also have a unit norm, we call them Orthonormal.
  • Orthogonal Matrix: A square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal. This would mean that . This would also imply that .

Eigendecomposition

This is a matrix decomposition method in which we decompose a matrix into a set of eigenvectors and eigenvalues.

Here, is a square matrix, is the eigenvector and is the eigenvalue. If is an eigen vector, so is any rescaled vector where . Hence, we only consider unit vectors. Suppose that a matrix has linearly independent eigenvectors, , with corresponding eigenvalues . We may concatenate all of the eigenvectors to form a matrix with one eigenvector per column: . Likewise, we concatenate the eigenvalues to form a vector . The eigendecomposition of A is then given by . We have seen that constructing matrices with specific eigenvalues and eigenvectors allows us to stretch space in desired directions. However, we often want to decompose matrices into their eigenvalues and eigenvectors. Not every matrix can be decomposed into eigenvalues and eigenvectors. However, every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:

where is an orthogonal matrix composed of eigenvectors of , and is a diagonal matrix. While any real symmetric matrix is guaranteed to have an eigendecomposition, the eigendecomposition may not be unique. If any two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors with that eigenvalue. By convention, we usually sort the entries of in descending order. Under this convention, the eigendecomposition is unique only if all of the eigenvalues are unique.

The eigendecomposition of a matrix tells us many useful facts about the matrix. The matrix is singular if and only if any of the eigenvalues are zero. The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form subject to . Whenever, is equal to an eigenvector of , taks on the value of corresponding eigenvalue, since:

since is orthogonal and . A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero-valued is called positive semidefinite. Likewise, if all eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative or zero valued, it is negative semidefinite. Positive semidefinite matrices are interesting for two reasons:

  • They guarantee that
  • They also guarantee that .

Singular Value Decomposition

This is another method to factorize a matrix, into singular vectors and singular values. This is more generally applicable. Every real matrix has a SVD, but the same is not true for eigen decomposition. Here, we write in terms of three matrices:

Here, is a matrix, is defined to be a matrix, to be a matrix, and to be a matrix. The matrices and are both defined to be orthogonal matrices. The matrix is defined to be a diagonal matrix (not necessarily a square). The elements across the diagonal of are known as the singular values of the matrix . The columns of are known as left-singular vectors. The columns of are known as the right-singular vectors.

Furthermore, the left-singular vectors of are the eigenvectors of , whereas the right-singular vectors of are the eigenvectors of .

Moore Penrose Pseudoinverse

Matrix inversion is not defined for matrices that are not square. Suppose , if:

  • , no solutions of
  • , multiple solutions of

The Moore-Penrose pseudoinverse of is defined as a matrix , where , , and are the singular value decomposition of . is obtained by taking the reciprocal of its non-zero elements, then taking the transpose of the resulting matrix. When has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution with minimal Eucledian norm among all possible solutions. When has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the for which is as close as possible to in terms of Eucledian norm

The Trace Operator

The trace operator gives the sum of all the diagonal entries of a matrix:

The trace operator provides an alternative way of writing the Frobenius norm of a matrix: . Furthermore, the trace operator is invariant to the transpose operator: . Also, the trace operator is invariant to cyclic permutations, even if the output is of a different shape: .

The Determinant

The determinant of a square matrix, denoted by , is a function mapping matrices to real scalars. The determinant is equal to the product of all the eigenvalues of the matrix. The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0, the space is contracted completely along at least one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation preserves volume.

Principal Component Analysis

Principal component analysis, or PCA, is a technique widely used for applications such as dimensionality reduction, lossy data compression, feature extraction and data visualization. PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized. These lower dimensional linear space is obtained from the eigenvectors derived from eigendecomposition of the data. Equivalently, it can also be defined as the linear projection that minimizer the average projection cost, defined as the mean squared distance between the data points and their projections.

Written on January 10, 2021