Stack of ideas

Tuesday, February 17, 2015

Vanishing component analysis II: Applications and dreams

Continuing with my discussion here, I wish to understand (and think about) some applications of the vanishing ideal. The authors (Ron Livni, et al., 2013 [1]) present a very useful application for binary classification (and hence, multi-class classification). Consider a labeled dataset $S = \{ (x_i, y_i), i = 1, \ldots, n \}$ such that $x_i \in \mathbb{R}^d$ and $y_i \in \{ 1, \ldots, L \}$. Then the authors suggest the following:

Split $S$ into $S_l = \{ x \colon (x, y) \in S , y = l \}, l = 1, \ldots, L$ (the set of points with label $l$).
Find the set of generators $F(S_l)$ of the set $S_l$ for each of the labels $l$.
For every point $x \in \mathbb{R}^d$ such that $(x, y) \in S$, create a new representation $\widetilde{x} = [ |f_1^1(x)|, |f_2^1(x)|, \ldots, |f_1^l(x)|, |f_2^l (x)|, \ldots ] $, where $f_i^l$ is the $i^{th}$ generator in the set $F(S_l)$ for label $l$.

Given this representation, the classes in the training points are linearly separable and a simple perceptron or a linear support vector machine can be utilized to effectively learn a nonlinear classifier (nonlinear transformation + a linear classifier). There is an assumption here to get the linear separability -- the set $S_l$ is a subset of the algebraic set $U_l$, and these algebraic sets $U_l, l = 1,\ldots,L$ do not intersect with each other $ \left( U_{l} \cap U_{l'} = \emptyset, l \not= l' \right)$. However, the authors claim that "it is enough to choose those polynomials that generate algebraic sets $U_l, l = \{1, \ldots, L\}$ whose intersection is empty".

This is a very powerful method (in my opinion) which is learning nonlinear representations of the data (to me the features themselves appear very robust even though I do not have any mathematical justification of that -- if $x$ is slightly perturbed, $\widetilde{x}$ should not move too much, however, $x$ is involved in the learning of the set of generators $F(S_l)$, which might change when $x$ changes) and then eventually just using a very simple classifier (linear in this case). A thing to note -- while the representations linearly separate the classes, the representation learning process (in this case, the vanishing component analysis) does not explicitly try to separate the two classes. Karampatziakis & Mineiro, 2014 [2] propose a way of learning representations of the data that explicitly tries to minimize the correlation between the two classes via a generalized eigenvalue decomposition. Maybe I'll tackle that paper in a future post.

To me, it is a little unfortunate that this idea/technique is currently only usable for supervised learning. As I rambled here, VCA could be used to quantify the complexity of the data somehow. Reading about this above application for classification, the following question popped in my head:

Given a dataset $S$ an integer $k$, can we find $k$ (possibly non-overlapping) algebraic sets such that the algebraic sets $S_1, \ldots, S_k$ collectively have low complexity (with respect to the sets of generators of the vanishing ideals of the S_i)?
If $\mathcal{C}(S_i)$ is the complexity cost of a subset $S_i \subseteq S$ and depends on the set of generators $F(S_i)$ of the vanishing ideal of $S_i$, we want something like $$ \arg \min_{S_1,\ldots, S_k \colon S_i \cap S_j = \emptyset \forall i \not= j, \cup_{i = 1}^k S_i = S} \sum_{i = 1}^k \mathcal{C}(S_i).$$
This is actually a very naive extension of the $k$-means problem where $$\mathcal{C}(S_i) = \sum_{x, x' \in S_i} \|x - x' \|_2^2 = \sum_{x \in S_i} \| x - \mbox{Mean}(S_i) \|_2^2,$$ but with a more complicated notion of complexity, where even computing the complexity (naively) requires a VCA.

To me, there does not appear to be any polynomial time algorithm for this problem. The naive way of trying all possible $O(k^{|S|})$ $k$-partitions of the dataset is the only possible way in my head. An iterative algorithm like Lloyd's algorithm for $k$-means takes advantage of the fact the following facts:

the computational cost of the complexity evaluation (evaluating $\mathcal{C}(S_i)$) is very cheap in $k$-means (linear in $|S_i|$), while doing the same for the $k$-algebraic sets problem would naively require a complete VCA for each subset $S_i$
the change in $\sum_i \mathcal{C}(S_i)$ when moving a point $x$ from set $S_i$ to set $S_{i'}$ is easily evaluated in $k$-means problem, however, in $k$-algebraic sets problem, there is no obvious efficient way (to me at least) other than just recomputing the VCA for each of the new subsets $S_i$ and $S_{i'}$

Maybe a deep look at the VCA algorithm [1] will reveal some ways to improve the computational cost of these two steps, which in turn, would allow a Lloyd's style iterative algorithm. I hope to take a deeper look at the algorithm in the next post on this.

[1] R. Livni, D. Lehavi, S. Schein, H. Nachlieli, S. Shalev-Schwartz & A. Globerson, Vanishing component analysis, ICML 2013
[2] N. Karampatziakis & P. Mineiro, Discriminative features via generalized eigenvectors, ICML 2014

Thursday, October 30, 2014

Vanishing component analysis: Speculating about a vanishing ideal

Last year at ICML 2013, Ron Livni, et al. [1] presented a paper called Vanishing Component Analysis. This introduced me to this task of finding the vanishing ideal of a set. Formally, given a set of points $S = \left\{ x_1, \ldots, x_m \right \} \subset \textbf{R}^d$, the vanishing ideal $I(S)$ of $S$ is the set of $d$-variate polynomials that vanish on $S$ (that is, for any polynomial $f \in I(S)$, $f(x_i) = 0 \forall x_i \in S$). An ideal $I$ is a set of polynomials which is a sub-group with respect to addition (closed under addition, negation and contains the zero polynomial) and it "absorbs multiplication" (that is, if $f \in I$ and $g$ is a $d$-variate finite degree polynomial over reals, then $fg \in I$, where $f \cdot g(x) = f(x) \cdot g(x)$). Another way of talking about $g$ is that $g$ belongs to polynomial ring in $d$ variables over $\mathbb{R}$. Ron Livni, et al., 2013 [1] proposed a way of finding the set of generators for the vanishing ideal $I(S)$ of any set $S$. Given an ideal $I$, the set of generators is a set of polynomials $\left \{ f_1, \ldots, f_k \right \} \subset I $ such that, $\forall f \in I$, $\exists g_1, \ldots, g_k$ in the $d$-variate polynomial ring and $f = \sum_i g_i \cdot f_i$. Let us call this set of generators (of the vanishing ideal) of $S$ $F(S)$. So $F(S)$ is kind of the "lowest degree" subset of the vanishing ideal.

This interests me for several reasons (just that some mathematics like this exists and can be done is exciting in itself), but I would like to note two specific things here:

the usability and the interpretability of the (set of generators of the) vanishing ideal of a set (which is what I talk about in the following blob of text)
the computational challenges for the proposed algorithms for large data sets (which I wish to get to in a future post)

Naively thinking about the set of the generators of the vanishing ideal $I(S)$, it seems to me that higher the degree of the polynomials in the set of generators, more "non-homogeneous" the set $S$ is. I am sure there is some proper interpretation of the degrees of the generators in algebraic geometry. So anything I say after this is pure speculation (mostly originating from my lack of knowledge in algebraic geometry) ...

Think of the situation where $d = 1$ so $S \subset \mathbb{R}$. So if we think of the points in $S$ lying on the $x$-axis, the vanishing ideal $I(S)$ of $S$ is the set of polynomials which intersects with the $x$-axis at every $x_i \in S$. More irregular and weird (for a lack of more mathematical terms) the points in $S$, the higher the degree of the polynomial(s) in $F(S)$ (or subsequently in $I(S)$) since higher degree polynomials can (almost) fit any arbitrary set of points on the $x$-axis. I am hoping that such a behaviour transfers (probably with the help of complicated mathematics I am yet to comprehend) to higher dimensional sets ($d > 1$).

Given the set of generators of the vanishing ideal $F(S)$, we can define a notion of complexity of a data set $S$ which roughly corresponds to the degrees of the polynomials resulting in $F(S)$. For example, the complexity $\mathcal{C}(S)$ of the set $S$ can be defined (pardon my lack of foresight here) as:
$$ \mathcal{C}(S) = \sum_{f \in F(S)} \mbox{ max-degree of a monomial in } f.$$
Then this roughly captures the following fact: more complex/non-homogeneous $S$, higher the degrees of the polynomials in $F(S)$, hence higher $\mathcal{C}(S)$. I am sure there is a well defined notion of some such complexity already in literature. Another interesting characteristic of $F(S)$ would be the dimension wise statistic, such as:
$$\mathcal{C}(S, d) = \sum_{f \in F(S)} \mbox{ max degree of the } d^{\mbox{th}} \mbox{ variable over all monomials in } f.$$
The hope (and I emphasize on hope) is that this statistic indicates the level of complexity added to the data set by this dimension.

Note. In both the definition of $\mathcal{C}(S)$ and $\mathcal{C}(S, d)$ the sum and the max operator can be interchanged to get some notion of complexity (if there is no established notion already).

If the above happens to be somewhat true, it is very neat because it gives an actual number which corresponds to the level of complexity/non-homogeneity infused into the whole data set by a particular dimension -- a variable importance of sorts, which could be a tool for feature selection (if and when required). Related interesting (to me) questions also arise:

What kinds of polynomials $f \in F(S)$ and numbers $\mathcal{C}(S, d)$ should we expect if all the dimensions are independent? Do we see polynomials which depends only on a single dimension at a time?
What happens when some of the dimensions are correlated? For example, if the dimension $i$ and $j$ are correlated, would be see a polynomial of the form $f = x[i] - x[j]^c + constant$, where $c$ is an integer?

I will stop at this for now, but this mathematical problem generated a lot of questions in my mind which I thought I should document lest I forget. I have a feeling that most of my questions/ideas would be very easily answered/refuted/validated if I open a book on algebraic geometry (any recommendations for a novice?) ...

[1] R. Livni, D. Lehavi, S. Schein, H. Nachlieli, S. Shalev-Schwartz & A. Globerson, Vanishing Component Analysis, ICML 2013

Saturday, September 20, 2014

MLConf Atlanta, September 19. 2014 talk

At the cost of cheap self-promotion and in a desperate attempt to end a posting drought, here is something I spoke about at MLConf, Atlanta (September, 19, 2014 slides):

Given a set of objects S, a query q and a notion of similarity between the query and the objects, the task of *similarity search* (or *nearest-neighbour search*) is to find the object in S most similar (or nearest) to q. This problem is pervasive in everyday applications. While this in itself is not directly a learning problem, the ability to perform this task (efficiently) allows various forms of learning. For example, if you are trying to learn a prediction function from a set of training examples, similarity search presents a model-free way of learning --- for any test point p, find the points in the training set nearest (or most similar) to p and compute the prediction for p as the (weighted) aggregate of the function values at these nearest neighbours of p (in classification, it can be the weighted majority, while in regression, it can be the weighted average). One usual way of detecting outliers in a set is to identify the nearest-neighbours of each point and use the similarity to their corresponding nearest-neighbours to generate the outlier-ness score.

One of the things that makes search-based learning very powerful is the fact that we make no assumption on the dataset nor do we make any model or distributional assumption on the function we are trying to learn. The learning is completely *nonparametric* — whatever we learn, we learn from the data made available to us. There is one implicit assumption I would like to mention here — if two points are close, their predictions are usually close.

There are two very important knobs in search-based learning — the notion of similarity, and the threshold that separates the nearest-neighbours from the non-nearest-neighbours. While many standard notions of similarity exist, the notion of similarity should ideally be constructed with domain knowledge in mind. The threshold is a tuning parameter of the learning problem and is usually learned via cross-validation.

A similarity function can be defined between any pair of objects. Those defined in an intuitive and domain specific way usually exhibit the best performance in terms of quality. This is probably one of the most (if not the most) important ingredient in search-based learning. Another key ingredient is the efficiency of the nearest-neighbour search — faster the search, faster the prediction (and faster the tuning phase). The efficiency of search is greatly intertwined with the type of similarity function. On a high level, similarity functions can be categorized in two different ways — one is based on the whether the similarity function is symmetric or not (whether A's similarity to B is the same as B's similarity to A), and the other is based on whether self-similarity is maximum or not (whether A's similarity to A is maximum or there can exist a B whose similarity to A is greater than that).

While symmetric similarity functions with maximum self-similarity make sense, it is mostly useful when we are comparing apples to apples — consider the situation where the duplicate of the query in our set of points is our ideal match. Searching for similar images given a query image is a task where symmetric similarity functions with maximum self-similarity would be useful. Asymmetric similarity functions are useful in situations where there is good reason to have a sense of directionality in the similarity. For example, take the task of matching people in a dating website where the similarity of A to B is governed by A's estimated attraction to B (and vice versa). In this case, the similarity function is asymmetric for good reason (many could be attracted to the same person but that person does not usually find all of them attractive). Similarity functions where self-similarity is not maximum is useful in situations where we are comparing apples to oranges — the query is an apple and the object in your set are oranges. Think, for example, about a user as a query, movies or products as objects in your set and a similarity function defined as the estimated preference of the user for a movie or product. If we are searching for the most preferred movie/product for the user, we do not really care about the self-similarity — mostly because we would never compare the user to itself. Another example of this is document search where the query is a single phrase and we are looking for the most relevant documents.

Most of the focus and progress of research on efficient search has been for symmetric similarity functions with maximum self-similarity. Metrics form a class of such (dis-)similarity functions. Almost all efficient search algorithms work solely for metrics. This has a unwanted side-effect — similarity function between pairs of objects are coerced to be metrics, even though that might be inappropriate, only to allow the use of efficient search algorithms; and since everybody tries to use metrics, all research on efficient search continues to focus mostly on metrics.

Asymmetric similarity functions with maximum self-similarity are quite meaningful in many situations. When dealing with sets as individual objects, the size of the set-theoretic difference (the set-theoretic difference of A and B are elements present in A but not in B; the set-theoretic difference of B and A are the elements present in B but not in A) is an asymmetric notion of (dis-)similarity where the self-similarity is maximum. When individual objects are probability distributions (a document can be thought of as a multinomial probability distribution over the terms of the dictionary), KL-divergence (KL stands for Kullback-Liebler) is a widely used notion of asymmetric similarity between probability distributions (again with self-similarity being the maximum). There has been some recent focus on efficient search with a class of asymmetric similarity measures with maximum self-similarity known as Bregman divergences. However, the research is still quite immature and most of the work has focused on understanding the hardness of the problem in a theoretical sense.

Symmetric similarity functions where the self-similarity may not be maximum is actually widely used in machine learning. When working with structured data in a table, the (unnormalized) inner-product between two of the objects (vectors in this case) is a symmetric similarity function where self-similarity may not be the maximum. Mercer kernels are essentially a generalization of the inner-product for any kind of data — they are symmetric though self-similarity may not be the maximum. They are quite popular in machine learning and Mercer kernels have been defined for text, graphs, time series, images. Efficient search with general Mercer kernels have only received recent attention, opening avenues for search-based learning with similarity functions which are less restrictive and more domain-relevant. Even though the research in this topic is far from mature, it is a step in the right direction.

Monday, July 22, 2013

Non-convex optimization with random jitters

If you think of non-convex optimization in machine learning, everybody is usually content with a local minimum. This may be so because either (1) the "quality" of a local minimum is not that worse than the global minimum (it would be great if this was shown more often), or (2) nothing better can be done; your hands are tied.

Obviously, there are notions of random restarts (or random mutation in genetic algorithms for optimization) which intuitively allow you to "escape local minima".

I was wondering if there has ever been theoretical study about the "robustness" of local minima. What I mean by this is that if I jitter the solution (corresponding to a local minumum) with some random noise (that is, if we are solving $\min_x f(x)$ to reach a local minima $\tilde{x}$, then we add some jitter/noise to $\tilde{x}$), or jitter the "problem" with some random noise (that is, if we are solving $\min_x Ax$, then we add some noise to $A$), how much of a jitter can this local minimum handle? Specifically, if we start the greedy minimization procedure with the jittered solution or with the current solution and the jittered problem, how much jitter (for some formal notion of jitter/noise) will allow the optimization procedure to still end up in (or around) the same local minimum?

Moreover, how useful would this notion of robustness be in actually systematically looking for the global minimum. I have no idea about this. If a local minimum is very "robust", then it would seem that in the region around this local minimum, it is very likely that this local minimum is the only local minimum in and around this minimum. This can allow us to ignore any starting point from in and around this region, and technically remove this region from the parameter search space.

Most of this idea here is very handwavy (call it my rambling) and would require to be formalized to be of any meaning. I will update this post if and when I think a little more about this.

Tuesday, October 30, 2012

Lower bounds for representations with general similarity functions

Some recent work has focused on obtaining explicit representations $z(x) \in \mathbb{R}^D$ of points $x \in \mathbb{R}^d$ given a (specific class of) kernel function $K\colon \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ [2,4]. The interesting results are of the following kind: $ | \langle z(x), z(y) \rangle - K(x,y) | < \epsilon $ with high probability. These results have focused obtaining guarantees of this sort and generally $D = \Omega \left( \frac{d}{\epsilon^2} \right)$. Both these papers assume that we have explicit access to the input space $\mathbb{R}^d$ (although, it might possible to extend their approach to the situation when there is no access to the actual $x \in \mathbb{R}^d$). What I am concerned with is the following questions:

Given a $n \times n$ similarity matrix for some set of objects with no access to actual points $x$, and assuming that similarity matrix is positive (semi) definite (effectively this matrix is obtained by a positive (semi) definite kernel function on the objects):

can we obtain the embedding of the points in some $\mathbb{R}^D$?
If so, what is the lower bound on $D$ to get constant probability guarantees that the inner-products of the points in this embedding are $\epsilon$ close to the values in the kernel matrix? This lower bound will probably depend on the properties of the kernel function.
What can be done about newer points?
How would the results change if now the embedding are sought on the $D$-dimensional Hamming cube instead of $\mathbb{R}^D$ (albeit with appropriate scaling)?

Similar to the previous except now the similarity matrix is effectively obtained by a good similarity function as defined by Balcan & Blum, 2006[1]. Again, we seek constant probability additive $\epsilon$ approximation of the similarity values. Now the lower bounds on $D$ should depend on the properties of the similarity functions. This result would be a more general result than the case for positive (semi) definite kernel functions.

These results would give a lower bound on the dimensionality of the space into which the points need to be embedded into to preserve the pairwise similarities $( |\langle z(x), z(y) \rangle - s(x,y)| < \epsilon )$.

Active learning-esque. More interesting is the following situation:

Consider the situation where there is no initial access to the $n \times n$ similarity matrix, but can be obtained from an oracle on query-by-query basis. This brings in an active learning flavor to the problem where you can ask the oracle for the similarity for any pair of choice for the set of $n$ objects.

The first question is that of existence -- is it possible to obtain constant probability $\epsilon$ approximation embedding with less than the total $n^2$ entries of the matrix?
If this is (somehow) possible (under restrictions), what is the relationship between the total number of queries required and the minimum number of dimensions required to embed the objects in and preserve the similarities as dot-products?

I am sure one of the authors of these following papers are already working on the problem and it would be interesting to read the results when they come out.

[1] M. F. Balcan and A. Blum, On a Theory of Learning with Similarity Functions, ICML 2006
[2] A. Rahimi and B. Recht, Random Features for Large Scale Kernel Machines, NIPS 2007
[3] P. Kar and P. Jain, Similarity-based Learning via Data driven Embeddings, NIPS 2011
[4] P. Kar and H. Karnick, Random Feature Maps for Dot-Product Kernels, AISTATS 2012

Thursday, September 20, 2012

GMM estimation with regularization

I was discussing the selection of the number of Gaussians $k$ in a Gaussian Mixture Model (GMM) with a colleague. Formally, the GMM is represented as: $$ f(x) = \sum_{i = 1}^k w_i f_i(x) $$ where $f(x)$ is the density estimate at $x$, $f_i(x) \sim \mathcal{N}(\mu_i, \sigma_i)$ is a Gaussian with mean $\mu_i$ and variance $\sigma_i$, and $w_i$ is the weight of the $i^{th}$ Gaussian and $\sum_i w_i = 1$.

Usual parametric estimation involves picking a $k$ and then finding the parameters by maximizing the likelihood $$\max_\theta L( \theta; x_1,\ldots, x_n)$$ over the data ($\theta = \{ w_i, \mu_i, \sigma_i \} $). There is no standard way used to pick $k$ to the best of my knowledge. The AIC and BIC can be used to choose the appropriate value of $k$.

Moreover, I am sure there are methods in Bayesian nonparametrics that allow for automatic selection of $k$. I just need to read about that. The idea that I was discussing with my colleague involves using Sparse Additive Models -- consider starting with $k$ large (possibly very close to the total number of points $n$). But the GMM is actually an additive model. So it might be possible to automatically select $k$ by solving the following:
$$ \max_\theta L(\theta; x_1,\ldots, x_n) - \lambda \|\mathbf{w}\|_0 $$ where $\| \cdot \|_0$ is the $L_0$ norm. Remember that the $w_i$s still have to sum to $1$.

Given this formulation, I would like to know the following:

Has somebody already worked along this line and figured it out? There has been work on GMM estimation with sparsity assumptions, but the sparsity assumptions appear to be on the covariance matrix of the normal distribution.
Given this formulation, is it possible to say whether this formulation will actually recover the right $k$? Can there be any form of sparsitency results regarding this?
Is it possible to come up with a scalable algorithm to solve the $L_0$ regularized GMM? Exact solution might be hard unless we try all possible values of $k$ and see which one has the lowest objective. This is assuming that the EM algorithm (say that is what we use) reaches the global maximum. Some relaxations might be applied to obtain useful algorithms.

This automatic selection might also occur in robust parametric estimation where the model (in this case $k$) is misspecified. However, I am not sure how this would work. Robust estimation has the capability to fit the model on whatever data fits the model ignoring the rest of the data.

Tuesday, September 11, 2012

Of phases and magnitudes

I attended a really interesting talk today by Emmanuel Candes about phase recovery for X-Ray crystallography. While I will not discuss the problem, some thing he mentioned did stick to me. His (fairly believable) claim was that for any signal, the phase is way more important than the magnitude of the signal.

If you consider a signal as a complex number, it can be represented as: $r \cdot e^{i\varphi}$ where $r$ is the magnitude and $\varphi$ is the phase. For a complex number, intuitively, the phase determines the direction of the signal in the complex plane. If you think of an analogy for high dimensional (say $d$) data, the norm of any point $p \in \mathbb{R}^d$ can be represented by a single value $\left\| p \right\|$ (one dimension) while the direction requires a $d$-dimensional unit vector $\frac{p}{\| p \|}$.

So the norm requires a single dimension while the direction requires all $d$ dimensions (I think it might be possible to represent the direction with one less dimension). So here I can think of two questions:

Is there any formal analogy set up between magnitude and phase of signals to the norm and direction of $d$-dimensional points? This question might be stupid in itself
Are there any useful applications (real problems) where the magnitude (or equivalently the norm) is more if not as much important as the phase (or direction)?