Stack of ideas: 2012

Tuesday, October 30, 2012

Lower bounds for representations with general similarity functions

Some recent work has focused on obtaining explicit representations $z(x) \in \mathbb{R}^D$ of points $x \in \mathbb{R}^d$ given a (specific class of) kernel function $K\colon \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ [2,4]. The interesting results are of the following kind: $ | \langle z(x), z(y) \rangle - K(x,y) | < \epsilon $ with high probability. These results have focused obtaining guarantees of this sort and generally $D = \Omega \left( \frac{d}{\epsilon^2} \right)$. Both these papers assume that we have explicit access to the input space $\mathbb{R}^d$ (although, it might possible to extend their approach to the situation when there is no access to the actual $x \in \mathbb{R}^d$). What I am concerned with is the following questions:

Given a $n \times n$ similarity matrix for some set of objects with no access to actual points $x$, and assuming that similarity matrix is positive (semi) definite (effectively this matrix is obtained by a positive (semi) definite kernel function on the objects):

can we obtain the embedding of the points in some $\mathbb{R}^D$?
If so, what is the lower bound on $D$ to get constant probability guarantees that the inner-products of the points in this embedding are $\epsilon$ close to the values in the kernel matrix? This lower bound will probably depend on the properties of the kernel function.
What can be done about newer points?
How would the results change if now the embedding are sought on the $D$-dimensional Hamming cube instead of $\mathbb{R}^D$ (albeit with appropriate scaling)?

Similar to the previous except now the similarity matrix is effectively obtained by a good similarity function as defined by Balcan & Blum, 2006[1]. Again, we seek constant probability additive $\epsilon$ approximation of the similarity values. Now the lower bounds on $D$ should depend on the properties of the similarity functions. This result would be a more general result than the case for positive (semi) definite kernel functions.

These results would give a lower bound on the dimensionality of the space into which the points need to be embedded into to preserve the pairwise similarities $( |\langle z(x), z(y) \rangle - s(x,y)| < \epsilon )$.

Active learning-esque. More interesting is the following situation:

Consider the situation where there is no initial access to the $n \times n$ similarity matrix, but can be obtained from an oracle on query-by-query basis. This brings in an active learning flavor to the problem where you can ask the oracle for the similarity for any pair of choice for the set of $n$ objects.

The first question is that of existence -- is it possible to obtain constant probability $\epsilon$ approximation embedding with less than the total $n^2$ entries of the matrix?
If this is (somehow) possible (under restrictions), what is the relationship between the total number of queries required and the minimum number of dimensions required to embed the objects in and preserve the similarities as dot-products?

I am sure one of the authors of these following papers are already working on the problem and it would be interesting to read the results when they come out.

[1] M. F. Balcan and A. Blum, On a Theory of Learning with Similarity Functions, ICML 2006
[2] A. Rahimi and B. Recht, Random Features for Large Scale Kernel Machines, NIPS 2007
[3] P. Kar and P. Jain, Similarity-based Learning via Data driven Embeddings, NIPS 2011
[4] P. Kar and H. Karnick, Random Feature Maps for Dot-Product Kernels, AISTATS 2012

Thursday, September 20, 2012

GMM estimation with regularization

I was discussing the selection of the number of Gaussians $k$ in a Gaussian Mixture Model (GMM) with a colleague. Formally, the GMM is represented as: $$ f(x) = \sum_{i = 1}^k w_i f_i(x) $$ where $f(x)$ is the density estimate at $x$, $f_i(x) \sim \mathcal{N}(\mu_i, \sigma_i)$ is a Gaussian with mean $\mu_i$ and variance $\sigma_i$, and $w_i$ is the weight of the $i^{th}$ Gaussian and $\sum_i w_i = 1$.

Usual parametric estimation involves picking a $k$ and then finding the parameters by maximizing the likelihood $$\max_\theta L( \theta; x_1,\ldots, x_n)$$ over the data ($\theta = \{ w_i, \mu_i, \sigma_i \} $). There is no standard way used to pick $k$ to the best of my knowledge. The AIC and BIC can be used to choose the appropriate value of $k$.

Moreover, I am sure there are methods in Bayesian nonparametrics that allow for automatic selection of $k$. I just need to read about that. The idea that I was discussing with my colleague involves using Sparse Additive Models -- consider starting with $k$ large (possibly very close to the total number of points $n$). But the GMM is actually an additive model. So it might be possible to automatically select $k$ by solving the following:
$$ \max_\theta L(\theta; x_1,\ldots, x_n) - \lambda \|\mathbf{w}\|_0 $$ where $\| \cdot \|_0$ is the $L_0$ norm. Remember that the $w_i$s still have to sum to $1$.

Given this formulation, I would like to know the following:

Has somebody already worked along this line and figured it out? There has been work on GMM estimation with sparsity assumptions, but the sparsity assumptions appear to be on the covariance matrix of the normal distribution.
Given this formulation, is it possible to say whether this formulation will actually recover the right $k$? Can there be any form of sparsitency results regarding this?
Is it possible to come up with a scalable algorithm to solve the $L_0$ regularized GMM? Exact solution might be hard unless we try all possible values of $k$ and see which one has the lowest objective. This is assuming that the EM algorithm (say that is what we use) reaches the global maximum. Some relaxations might be applied to obtain useful algorithms.

This automatic selection might also occur in robust parametric estimation where the model (in this case $k$) is misspecified. However, I am not sure how this would work. Robust estimation has the capability to fit the model on whatever data fits the model ignoring the rest of the data.

Tuesday, September 11, 2012

Of phases and magnitudes

I attended a really interesting talk today by Emmanuel Candes about phase recovery for X-Ray crystallography. While I will not discuss the problem, some thing he mentioned did stick to me. His (fairly believable) claim was that for any signal, the phase is way more important than the magnitude of the signal.

If you consider a signal as a complex number, it can be represented as: $r \cdot e^{i\varphi}$ where $r$ is the magnitude and $\varphi$ is the phase. For a complex number, intuitively, the phase determines the direction of the signal in the complex plane. If you think of an analogy for high dimensional (say $d$) data, the norm of any point $p \in \mathbb{R}^d$ can be represented by a single value $\left\| p \right\|$ (one dimension) while the direction requires a $d$-dimensional unit vector $\frac{p}{\| p \|}$.

So the norm requires a single dimension while the direction requires all $d$ dimensions (I think it might be possible to represent the direction with one less dimension). So here I can think of two questions:

Is there any formal analogy set up between magnitude and phase of signals to the norm and direction of $d$-dimensional points? This question might be stupid in itself
Are there any useful applications (real problems) where the magnitude (or equivalently the norm) is more if not as much important as the phase (or direction)?