Stack of ideas: September 2012

I was discussing the selection of the number of Gaussians $k$ in a Gaussian Mixture Model (GMM) with a colleague. Formally, the GMM is represented as: $$ f(x) = \sum_{i = 1}^k w_i f_i(x) $$ where $f(x)$ is the density estimate at $x$, $f_i(x) \sim \mathcal{N}(\mu_i, \sigma_i)$ is a Gaussian with mean $\mu_i$ and variance $\sigma_i$, and $w_i$ is the weight of the $i^{th}$ Gaussian and $\sum_i w_i = 1$.

Usual parametric estimation involves picking a $k$ and then finding the parameters by maximizing the likelihood $$\max_\theta L( \theta; x_1,\ldots, x_n)$$ over the data ($\theta = \{ w_i, \mu_i, \sigma_i \} $). There is no standard way used to pick $k$ to the best of my knowledge. The AIC and BIC can be used to choose the appropriate value of $k$.

Moreover, I am sure there are methods in Bayesian nonparametrics that allow for automatic selection of $k$. I just need to read about that. The idea that I was discussing with my colleague involves using Sparse Additive Models -- consider starting with $k$ large (possibly very close to the total number of points $n$). But the GMM is actually an additive model. So it might be possible to automatically select $k$ by solving the following:
$$ \max_\theta L(\theta; x_1,\ldots, x_n) - \lambda \|\mathbf{w}\|_0 $$ where $\| \cdot \|_0$ is the $L_0$ norm. Remember that the $w_i$s still have to sum to $1$.

Given this formulation, I would like to know the following:

Has somebody already worked along this line and figured it out? There has been work on GMM estimation with sparsity assumptions, but the sparsity assumptions appear to be on the covariance matrix of the normal distribution.
Given this formulation, is it possible to say whether this formulation will actually recover the right $k$? Can there be any form of sparsitency results regarding this?
Is it possible to come up with a scalable algorithm to solve the $L_0$ regularized GMM? Exact solution might be hard unless we try all possible values of $k$ and see which one has the lowest objective. This is assuming that the EM algorithm (say that is what we use) reaches the global maximum. Some relaxations might be applied to obtain useful algorithms.

This automatic selection might also occur in robust parametric estimation where the model (in this case $k$) is misspecified. However, I am not sure how this would work. Robust estimation has the capability to fit the model on whatever data fits the model ignoring the rest of the data.

I attended a really interesting talk today by Emmanuel Candes about phase recovery for X-Ray crystallography. While I will not discuss the problem, some thing he mentioned did stick to me. His (fairly believable) claim was that for any signal, the phase is way more important than the magnitude of the signal.

If you consider a signal as a complex number, it can be represented as: $r \cdot e^{i\varphi}$ where $r$ is the magnitude and $\varphi$ is the phase. For a complex number, intuitively, the phase determines the direction of the signal in the complex plane. If you think of an analogy for high dimensional (say $d$) data, the norm of any point $p \in \mathbb{R}^d$ can be represented by a single value $\left\| p \right\|$ (one dimension) while the direction requires a $d$-dimensional unit vector $\frac{p}{\| p \|}$.

So the norm requires a single dimension while the direction requires all $d$ dimensions (I think it might be possible to represent the direction with one less dimension). So here I can think of two questions:

Is there any formal analogy set up between magnitude and phase of signals to the norm and direction of $d$-dimensional points? This question might be stupid in itself
Are there any useful applications (real problems) where the magnitude (or equivalently the norm) is more if not as much important as the phase (or direction)?

Stack of ideas

Thursday, September 20, 2012

GMM estimation with regularization

Tuesday, September 11, 2012

Of phases and magnitudes