Some recent work has focused on obtaining explicit representations $z(x) \in \mathbb{R}^D$ of points $x \in \mathbb{R}^d$ given a (specific class of) kernel function $K\colon \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ [2,4]. The interesting results are of the following kind: $ | \langle z(x), z(y) \rangle - K(x,y) | < \epsilon $ with high probability. These results have focused obtaining guarantees of this sort and generally $D = \Omega \left( \frac{d}{\epsilon^2} \right)$. Both these papers assume that we have explicit access to the input space $\mathbb{R}^d$ (although, it might possible to extend their approach to the situation when there is no access to the actual $x \in \mathbb{R}^d$). What I am concerned with is the following questions:
[1] M. F. Balcan and A. Blum, On a Theory of Learning with Similarity Functions, ICML 2006
[2] A. Rahimi and B. Recht, Random Features for Large Scale Kernel Machines, NIPS 2007
[3] P. Kar and P. Jain, Similarity-based Learning via Data driven Embeddings, NIPS 2011
[4] P. Kar and H. Karnick, Random Feature Maps for Dot-Product Kernels, AISTATS 2012
- Given a $n \times n$ similarity matrix for some set of objects with no access to actual points $x$, and assuming that similarity matrix is positive (semi) definite (effectively this matrix is obtained by a positive (semi) definite kernel function on the objects):
- can we obtain the embedding of the points in some $\mathbb{R}^D$?
- If so, what is the lower bound on $D$ to get constant probability guarantees that the inner-products of the points in this embedding are $\epsilon$ close to the values in the kernel matrix? This lower bound will probably depend on the properties of the kernel function.
- What can be done about newer points?
- How would the results change if now the embedding are sought on the $D$-dimensional Hamming cube instead of $\mathbb{R}^D$ (albeit with appropriate scaling)?
- Similar to the previous except now the similarity matrix is effectively obtained by a good similarity function as defined by Balcan & Blum, 2006[1]. Again, we seek constant probability additive $\epsilon$ approximation of the similarity values. Now the lower bounds on $D$ should depend on the properties of the similarity functions. This result would be a more general result than the case for positive (semi) definite kernel functions.
These results would give a lower bound on the dimensionality of the space into which the points need to be embedded into to preserve the pairwise similarities $( |\langle z(x), z(y) \rangle - s(x,y)| < \epsilon )$.
Active learning-esque. More interesting is the following situation:
- Consider the situation where there is no initial access to the $n \times n$ similarity matrix, but can be obtained from an oracle on query-by-query basis. This brings in an active learning flavor to the problem where you can ask the oracle for the similarity for any pair of choice for the set of $n$ objects.
- The first question is that of existence -- is it possible to obtain constant probability $\epsilon$ approximation embedding with less than the total $n^2$ entries of the matrix?
- If this is (somehow) possible (under restrictions), what is the relationship between the total number of queries required and the minimum number of dimensions required to embed the objects in and preserve the similarities as dot-products?
I am sure one of the authors of these following papers are already working on the problem and it would be interesting to read the results when they come out.
[1] M. F. Balcan and A. Blum, On a Theory of Learning with Similarity Functions, ICML 2006
[2] A. Rahimi and B. Recht, Random Features for Large Scale Kernel Machines, NIPS 2007
[3] P. Kar and P. Jain, Similarity-based Learning via Data driven Embeddings, NIPS 2011
[4] P. Kar and H. Karnick, Random Feature Maps for Dot-Product Kernels, AISTATS 2012
No comments:
Post a Comment