Reproducing Kernel Hilbert Spaces from a Feature Learning Perspective

9 minute read

Published: October 22, 2022

We will call a function $\varphi: \mathcal{X} \to V_{feat}$ a feature function from the data domain $\mathcal{X}$ to feature space $V_{feat}$, where $V_{feat}$ is a vector space, if $\varphi(x)$ encodes the semantics of $x$ in a mathematically meaningful way. Mostly, we’ll deal with a finite dimensional $V_{feat} \subseteq \mathbb{R}^d,\ d \in \mathbb{N}$. Yet, we only need $V_{feat} \subseteq \ell^2$ where $\ell^2$ is the space of square summable sequences.

With a good feature function, we can use the resulting features to make prediction models. For example, given two pictures of cats $c_1,c_2$, we’ll want $\| \varphi(c_1) - \varphi(c_2) \|_{V_{feat}}$ to be small so $c_1,\ c_2$ are close together in feature space. Simultaneously, for photos of a cat and a dog $c,d$, we’ll want $\| \varphi(c) - \varphi(d) \|_{V_{feat}}$ to be large. If we have a feature function that places cats close to cats and dogs close to dogs while keeping cats and dogs distant, we can build a linear classifier on top of these features.

It’s common to think of neural networks as the composition of features with a linear layer. A classification network $f$ can be written as the composition of a learned feature function with a linear classifier. In the neural network case, $V_ \subseteq \mathbb{R}^d$ where typically, $V_{feat} \subset \mathbb{R}^d$ due to the nature of the non-linearities used. Incorporating the bias into the weights, a network can be written as,

\[f(x) = w^T\varphi(x), \quad w \in \mathbb{R^d}\]

We then observe that kernel machines — SVMs are the most popular example [1] — have the same functional form. The difference being that they have an additional constraint on the weights $w$ — namely that $w$ is in the completion of $\text{span}\{\varphi(x) : x \in \mathcal{X} \}$ , aka $V_{feat}$ — and the feature function $\varphi$ is induced from a positive semi-definite (PSD) kernel rather than being learned. Noticing this similarity could turn out to be quite useful. PSD kernels have a long history [2] and some well-studied and powerful math behind them. So what I hope to do here is to introduce some of this theory from the perspective of feature learning and the feature spaces induced by PSD kernels.

Reproducing Kernel Hilbert Spaces

First off, reproducing kernel Hilbert spaces (RKHS) are Hilbert spaces. Hilbert spaces generalize the properties of Euclidean space which we use to model the world of everyday life. For this reason, Hilbert spaces have intuitive and simple properties while remaining a useful and expressive.

A Hilbert space $\mathbb{H}$ is an inner product space where every Cauchy sequence $\{f_n\}_{n=1}^\infty$ converges to some element $f^* \in \mathbb{H}$. The space of square-integrable functions $L_2[0,1]$ is the typical example. This space consists of all functions $f: [0,1] \to \mathbb{R}$ such that

\[\int_0^1 f^2(x) dx < \infty.\]

The inner product associated with $L_2[0,1]$ is

\[\int_0^1 f(x)g(x) dx, \quad \text{for all $f,g \in L_2[0,1]$}.\]

We now need to introduce the notion of a positive semidefinite (PSD) function. To define PSD functions, we first define PSD matrices. Recall that, a matrix $M \in \mathbb{R}^{n\times n}$ is positive semidefinite (PSD) if for all $\alpha \in \mathbb{R}^n$

\[\alpha^TM \alpha \geq 0.\]

Then a function $\mathcal : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is PSD if for all $n \in \mathbb{N}$ and all subsets $\{x_i\}_{i=1}^n \subset \mathcal{X}$ the matrix given by $G_{ij} := K(x_i, x_j)$ is PSD. $G$ is known as the Gram matrix.

A kernel is defined as a PSD function $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ of two variables in data space to the real line. Images, word sequences, and probability distributions are all examples of valid data spaces.

Linear Kernel

The simplest example of a PSD kernel on $\mathbb{R}^d$ is the linear kernel. The linear kernel function $K(x,z): \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ is simply

\[K(x,z) = \langle x , z \rangle.\]

We can verify positive semidefiniteness by letting $\{x_i\}_{n=1}^n$ be a arbitrary collection of points in $\mathbb{R}^d$ and considering the Gram matrix $G_{ij} := \langle x_i, x_j \rangle$. Letting $\alpha$ be any vector in $\mathbb{R}^n$, we see

\[\alpha^T G \alpha = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle x_i , x_j \rangle = \| \sum_{i = 1}^n \alpha_i x_i \|_2^2 \geq 0,\]

and conclude that $K$ is PSD.

Feature Kernel

A much more exciting PSD kernel can be called the feature kernel. Given a feature function $\varphi: \mathcal{X} \to \mathbb{R^d}$ we can define a kernel $K(x,z): \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ by

\[K(x,z) = \langle \varphi(x) , \varphi(z) \rangle.\]

We can see that it is indeed PSD in the same manner as above. Let $\{x_i\}_{n=1}^n$ be a arbitrary collection of points in $\mathbb{R}^d$ and consider the Gram matrix. Then for any $\alpha$ in $\mathbb{R}^n$

\[\alpha^T G \alpha = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle \varphi(x_i) , \varphi(x_j) \rangle = \| \sum_{i = 1}^n \alpha_i \varphi(x_i) \|_2^2 \geq 0,\]

It’s worth noting that the kernel isn’t uniquely associated with the feature function $\varphi$. For example, if $U \in \mathbb{R}^{d\times d}$ is an orthogonal matrix then $K(x,z) = \langle U \varphi(x), U \varphi(z) \rangle = \langle \varphi(x), \varphi(z)\rangle$.

Interpretting the Feature Kernel

An inner product measures the alignment of its two inputs — adjusting for scale. This leads to the equality

\[\langle x, z \rangle = \|x\| \|z\| \cos(\theta)\]

where $\theta$ is the angle between $x,z$. So we see that a kernel measures the similarity of its inputs with respect to the feature space(s) induced by the kernel. This is the main point I want to make.

Reproducing Property + Other Remarks

Having introduced kernels, we can justify the reproducing term. Given any PSD function $K$ on $\mathcal{X} \times \mathcal{X}$, we can define a Hilbert space of functions $\mathbb{H}$ such that

\[\langle K(\cdot,x), f \rangle_\mathbb{H} = f(x), \quad \text{for all} f \in \mathbb{H}, x \in \mathcal{X}.\]

This identity is known as the reproducing property. You can show that a kernel is uniquely associated with a Hilbert space where the reproducing property is satisfied. These Hilbert spaces are called reproducing kernel Hilbert spaces (RKHS).

RKHS

We can define an RKHS in terms of its properties.

Let $\mathcal{X}$ be a non-empty set and $\mathbb{H}$ a Hilbert space of functions $f: \mathcal{X} \to \mathbb{R}$. Then $\mathbb{H}$ is called a reproducing kernel Hilbert space with inner product $\langle\cdot,\cdot\rangle_\mathbb{H}$ and norm $\| \cdot \|_\mathbb{H} = \sqrt{\langle \cdot,\cdot \rangle_\mathbb{H}}$ if there exists a PSD kernel function $K:\mathcal{X} \times \mathcal{X} \to \mathbb{R}$ such that

$\langle K(x,\cdot),f \rangle_\mathbb{H} = f(x)$, \quad for all $x \in \mathcal{X}$
$\mathbb{H} = \overline{\text{span}{ K(\cdot,x): \forall x \in \mathcal{X}}}$, \quad for $\overline{A}$ denoting the completion of a set $A$

This definition helps characterize which functions $f$ might belong to some RKHS $\mathbb{H}$. Roughly, any function $f \in \mathbb{H}$ must be at least as smooth as the kernel (where smoothness is measured by the RKHS norm), given that it is either a linear combination of partially evaluated kernels or the limit of such a combination. The reproducing property shows that any kernel may be written as a feature kernel where $\varphi(x) := K(x,\cdot)$ and $V_{feat} := \mathbb{H}$. We see

\[K(x,z) = \langle K(x,\cdot),K(z,\cdot) \rangle_\mathbb{H}.\]

An alternative definition of an RKHS — and one that is particularly relevant to learning — is as a Hilbert space of functions where all evaluation functionals $\text{eval}_x[f] = f(x)$ are bounded (and therefore continuous). In that case, we may appeal to the Riesz Representation Theorem for a characterization of the kernel.

Riesz Representation Theorem

Let $L$ be a bounded, linear functional on a Hilbert space $\mathbb{H}$. Then there exists a unique $g \in \mathbb{H}$ such that $L[f] = \langle f, g \rangle_\mathbb{H}$ for all $f \in \mathbb{H}$. We call $g$ the representer of $L$.

Then, the partially evaluated kernel $K(\cdot,x)$ is the representer for the bounded and continuous evaluation functional at the point $x$. This holds for all $x \in \mathcal{X}$.

Why use an RKHS?

RKHS are very natural spaces to use when doing machine learning. That is, when fitting functions to data. Because, while we must fit a function to approximate a finite dataset, what we truly care about is fitting the underlying distribution. So, if we want to fit a function in terms of a function norm, we better have function norm convergence imply pointwise convergence or we could end up fitting to the functional norm perfectly yet still mispredict other points. Due to the boundedness and continuity of evaluation in an RKHS, it turns out that RKHS are exactly these spaces. This can be seen in a simple example.

Let $f_j \to f^*$ be a sequence of functions converging in the RKHS $\mathbb{H}$. Then

\[\begin{split} |f_j(x) - f^*(x)| & = | \langle K(x,\cdot),f_j \rangle_\mathbb{H} - \langle K(x,\cdot),f^* \rangle _\mathbb{H}| \\ & = |\langle K(x,\cdot),f_j - f^* \rangle_\mathbb{H}| \\ & \leq \|K(x,\cdot)\|_\mathbb{H} \ \|f_j - f^*\|_\mathbb{H}\to 0, \end{split}\]

Likewise, we can see the consequences of using a more permissive function space. It turns out that $L^2[0,1]$, the set of square-integrable functions, is not an RKHS. Now, take $L^2[0,1]$ to be our hypothesis space equipped with the usual norm. Assume that $f^*(x) = 0$ for all $x$ and consider the iterates $f_n(x) = x^n, n \in \mathbb{N}$. Then

\[\|f_n \|_{L^2[0,1]}^2 = \int_0^1 x^{2n} dx = \frac{1}{2n+1}\]

so we see that $f_n \to f^*$ in $L^2[0,1]$. Yet, $f_n(1) = 1$ for all $n \in \mathbb{N}$ so ${f_n}_{n \in \mathbb{N}}$ does not converge pointwise to $f^*$

References

[1] Bernhard Scholkopf and Alexander J. Smola. earning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, ISBN: 0262194759

[2] James Stewart. Positive definite functions and generalizations, a historical survey. Rocky Mountain Journal of Mathematics, 10.1216/rmj-1976- 6-3-409

Share on

Twitter Facebook LinkedIn