I was discussing the selection of the number of Gaussians $k$ in a Gaussian Mixture Model (GMM) with a colleague. Formally, the GMM is represented as: $$ f(x) = \sum_{i = 1}^k w_i f_i(x) $$ where $f(x)$ is the density estimate at $x$, $f_i(x) \sim \mathcal{N}(\mu_i, \sigma_i)$ is a Gaussian with mean $\mu_i$ and variance $\sigma_i$, and $w_i$ is the weight of the $i^{th}$ Gaussian and $\sum_i w_i = 1$.
Usual parametric estimation involves picking a $k$ and then finding the parameters by maximizing the likelihood $$\max_\theta L( \theta; x_1,\ldots, x_n)$$ over the data ($\theta = \{ w_i, \mu_i, \sigma_i \} $). There is no standard way used to pick $k$ to the best of my knowledge. The AIC and BIC can be used to choose the appropriate value of $k$.
Moreover, I am sure there are methods in Bayesian nonparametrics that allow for automatic selection of $k$. I just need to read about that. The idea that I was discussing with my colleague involves using Sparse Additive Models -- consider starting with $k$ large (possibly very close to the total number of points $n$). But the GMM is actually an additive model. So it might be possible to automatically select $k$ by solving the following:
$$ \max_\theta L(\theta; x_1,\ldots, x_n) - \lambda \|\mathbf{w}\|_0 $$ where $\| \cdot \|_0$ is the $L_0$ norm. Remember that the $w_i$s still have to sum to $1$.
Given this formulation, I would like to know the following:
Usual parametric estimation involves picking a $k$ and then finding the parameters by maximizing the likelihood $$\max_\theta L( \theta; x_1,\ldots, x_n)$$ over the data ($\theta = \{ w_i, \mu_i, \sigma_i \} $). There is no standard way used to pick $k$ to the best of my knowledge. The AIC and BIC can be used to choose the appropriate value of $k$.
Moreover, I am sure there are methods in Bayesian nonparametrics that allow for automatic selection of $k$. I just need to read about that. The idea that I was discussing with my colleague involves using Sparse Additive Models -- consider starting with $k$ large (possibly very close to the total number of points $n$). But the GMM is actually an additive model. So it might be possible to automatically select $k$ by solving the following:
$$ \max_\theta L(\theta; x_1,\ldots, x_n) - \lambda \|\mathbf{w}\|_0 $$ where $\| \cdot \|_0$ is the $L_0$ norm. Remember that the $w_i$s still have to sum to $1$.
Given this formulation, I would like to know the following:
- Has somebody already worked along this line and figured it out? There has been work on GMM estimation with sparsity assumptions, but the sparsity assumptions appear to be on the covariance matrix of the normal distribution.
- Given this formulation, is it possible to say whether this formulation will actually recover the right $k$? Can there be any form of sparsitency results regarding this?
- Is it possible to come up with a scalable algorithm to solve the $L_0$ regularized GMM? Exact solution might be hard unless we try all possible values of $k$ and see which one has the lowest objective. This is assuming that the EM algorithm (say that is what we use) reaches the global maximum. Some relaxations might be applied to obtain useful algorithms.
This automatic selection might also occur in robust parametric estimation where the model (in this case $k$) is misspecified. However, I am not sure how this would work. Robust estimation has the capability to fit the model on whatever data fits the model ignoring the rest of the data.