Stack of ideas: 2014

Last year at ICML 2013, Ron Livni, et al. [1] presented a paper called Vanishing Component Analysis. This introduced me to this task of finding the vanishing ideal of a set. Formally, given a set of points $S = \left\{ x_1, \ldots, x_m \right \} \subset \textbf{R}^d$, the vanishing ideal $I(S)$ of $S$ is the set of $d$-variate polynomials that vanish on $S$ (that is, for any polynomial $f \in I(S)$, $f(x_i) = 0 \forall x_i \in S$). An ideal $I$ is a set of polynomials which is a sub-group with respect to addition (closed under addition, negation and contains the zero polynomial) and it "absorbs multiplication" (that is, if $f \in I$ and $g$ is a $d$-variate finite degree polynomial over reals, then $fg \in I$, where $f \cdot g(x) = f(x) \cdot g(x)$). Another way of talking about $g$ is that $g$ belongs to polynomial ring in $d$ variables over $\mathbb{R}$. Ron Livni, et al., 2013 [1] proposed a way of finding the set of generators for the vanishing ideal $I(S)$ of any set $S$. Given an ideal $I$, the set of generators is a set of polynomials $\left \{ f_1, \ldots, f_k \right \} \subset I $ such that, $\forall f \in I$, $\exists g_1, \ldots, g_k$ in the $d$-variate polynomial ring and $f = \sum_i g_i \cdot f_i$. Let us call this set of generators (of the vanishing ideal) of $S$ $F(S)$. So $F(S)$ is kind of the "lowest degree" subset of the vanishing ideal.

This interests me for several reasons (just that some mathematics like this exists and can be done is exciting in itself), but I would like to note two specific things here:

the usability and the interpretability of the (set of generators of the) vanishing ideal of a set (which is what I talk about in the following blob of text)
the computational challenges for the proposed algorithms for large data sets (which I wish to get to in a future post)

Naively thinking about the set of the generators of the vanishing ideal $I(S)$, it seems to me that higher the degree of the polynomials in the set of generators, more "non-homogeneous" the set $S$ is. I am sure there is some proper interpretation of the degrees of the generators in algebraic geometry. So anything I say after this is pure speculation (mostly originating from my lack of knowledge in algebraic geometry) ...

Think of the situation where $d = 1$ so $S \subset \mathbb{R}$. So if we think of the points in $S$ lying on the $x$-axis, the vanishing ideal $I(S)$ of $S$ is the set of polynomials which intersects with the $x$-axis at every $x_i \in S$. More irregular and weird (for a lack of more mathematical terms) the points in $S$, the higher the degree of the polynomial(s) in $F(S)$ (or subsequently in $I(S)$) since higher degree polynomials can (almost) fit any arbitrary set of points on the $x$-axis. I am hoping that such a behaviour transfers (probably with the help of complicated mathematics I am yet to comprehend) to higher dimensional sets ($d > 1$).

Given the set of generators of the vanishing ideal $F(S)$, we can define a notion of complexity of a data set $S$ which roughly corresponds to the degrees of the polynomials resulting in $F(S)$. For example, the complexity $\mathcal{C}(S)$ of the set $S$ can be defined (pardon my lack of foresight here) as:
$$ \mathcal{C}(S) = \sum_{f \in F(S)} \mbox{ max-degree of a monomial in } f.$$
Then this roughly captures the following fact: more complex/non-homogeneous $S$, higher the degrees of the polynomials in $F(S)$, hence higher $\mathcal{C}(S)$. I am sure there is a well defined notion of some such complexity already in literature. Another interesting characteristic of $F(S)$ would be the dimension wise statistic, such as:
$$\mathcal{C}(S, d) = \sum_{f \in F(S)} \mbox{ max degree of the } d^{\mbox{th}} \mbox{ variable over all monomials in } f.$$
The hope (and I emphasize on hope) is that this statistic indicates the level of complexity added to the data set by this dimension.

Note. In both the definition of $\mathcal{C}(S)$ and $\mathcal{C}(S, d)$ the sum and the max operator can be interchanged to get some notion of complexity (if there is no established notion already).

If the above happens to be somewhat true, it is very neat because it gives an actual number which corresponds to the level of complexity/non-homogeneity infused into the whole data set by a particular dimension -- a variable importance of sorts, which could be a tool for feature selection (if and when required). Related interesting (to me) questions also arise:

What kinds of polynomials $f \in F(S)$ and numbers $\mathcal{C}(S, d)$ should we expect if all the dimensions are independent? Do we see polynomials which depends only on a single dimension at a time?
What happens when some of the dimensions are correlated? For example, if the dimension $i$ and $j$ are correlated, would be see a polynomial of the form $f = x[i] - x[j]^c + constant$, where $c$ is an integer?

I will stop at this for now, but this mathematical problem generated a lot of questions in my mind which I thought I should document lest I forget. I have a feeling that most of my questions/ideas would be very easily answered/refuted/validated if I open a book on algebraic geometry (any recommendations for a novice?) ...

[1] R. Livni, D. Lehavi, S. Schein, H. Nachlieli, S. Shalev-Schwartz & A. Globerson, Vanishing Component Analysis, ICML 2013

At the cost of cheap self-promotion and in a desperate attempt to end a posting drought, here is something I spoke about at MLConf, Atlanta (September, 19, 2014 slides):

Given a set of objects S, a query q and a notion of similarity between the query and the objects, the task of *similarity search* (or *nearest-neighbour search*) is to find the object in S most similar (or nearest) to q. This problem is pervasive in everyday applications. While this in itself is not directly a learning problem, the ability to perform this task (efficiently) allows various forms of learning. For example, if you are trying to learn a prediction function from a set of training examples, similarity search presents a model-free way of learning --- for any test point p, find the points in the training set nearest (or most similar) to p and compute the prediction for p as the (weighted) aggregate of the function values at these nearest neighbours of p (in classification, it can be the weighted majority, while in regression, it can be the weighted average). One usual way of detecting outliers in a set is to identify the nearest-neighbours of each point and use the similarity to their corresponding nearest-neighbours to generate the outlier-ness score.

One of the things that makes search-based learning very powerful is the fact that we make no assumption on the dataset nor do we make any model or distributional assumption on the function we are trying to learn. The learning is completely *nonparametric* — whatever we learn, we learn from the data made available to us. There is one implicit assumption I would like to mention here — if two points are close, their predictions are usually close.

There are two very important knobs in search-based learning — the notion of similarity, and the threshold that separates the nearest-neighbours from the non-nearest-neighbours. While many standard notions of similarity exist, the notion of similarity should ideally be constructed with domain knowledge in mind. The threshold is a tuning parameter of the learning problem and is usually learned via cross-validation.

A similarity function can be defined between any pair of objects. Those defined in an intuitive and domain specific way usually exhibit the best performance in terms of quality. This is probably one of the most (if not the most) important ingredient in search-based learning. Another key ingredient is the efficiency of the nearest-neighbour search — faster the search, faster the prediction (and faster the tuning phase). The efficiency of search is greatly intertwined with the type of similarity function. On a high level, similarity functions can be categorized in two different ways — one is based on the whether the similarity function is symmetric or not (whether A's similarity to B is the same as B's similarity to A), and the other is based on whether self-similarity is maximum or not (whether A's similarity to A is maximum or there can exist a B whose similarity to A is greater than that).

While symmetric similarity functions with maximum self-similarity make sense, it is mostly useful when we are comparing apples to apples — consider the situation where the duplicate of the query in our set of points is our ideal match. Searching for similar images given a query image is a task where symmetric similarity functions with maximum self-similarity would be useful. Asymmetric similarity functions are useful in situations where there is good reason to have a sense of directionality in the similarity. For example, take the task of matching people in a dating website where the similarity of A to B is governed by A's estimated attraction to B (and vice versa). In this case, the similarity function is asymmetric for good reason (many could be attracted to the same person but that person does not usually find all of them attractive). Similarity functions where self-similarity is not maximum is useful in situations where we are comparing apples to oranges — the query is an apple and the object in your set are oranges. Think, for example, about a user as a query, movies or products as objects in your set and a similarity function defined as the estimated preference of the user for a movie or product. If we are searching for the most preferred movie/product for the user, we do not really care about the self-similarity — mostly because we would never compare the user to itself. Another example of this is document search where the query is a single phrase and we are looking for the most relevant documents.

Most of the focus and progress of research on efficient search has been for symmetric similarity functions with maximum self-similarity. Metrics form a class of such (dis-)similarity functions. Almost all efficient search algorithms work solely for metrics. This has a unwanted side-effect — similarity function between pairs of objects are coerced to be metrics, even though that might be inappropriate, only to allow the use of efficient search algorithms; and since everybody tries to use metrics, all research on efficient search continues to focus mostly on metrics.

Asymmetric similarity functions with maximum self-similarity are quite meaningful in many situations. When dealing with sets as individual objects, the size of the set-theoretic difference (the set-theoretic difference of A and B are elements present in A but not in B; the set-theoretic difference of B and A are the elements present in B but not in A) is an asymmetric notion of (dis-)similarity where the self-similarity is maximum. When individual objects are probability distributions (a document can be thought of as a multinomial probability distribution over the terms of the dictionary), KL-divergence (KL stands for Kullback-Liebler) is a widely used notion of asymmetric similarity between probability distributions (again with self-similarity being the maximum). There has been some recent focus on efficient search with a class of asymmetric similarity measures with maximum self-similarity known as Bregman divergences. However, the research is still quite immature and most of the work has focused on understanding the hardness of the problem in a theoretical sense.

Symmetric similarity functions where the self-similarity may not be maximum is actually widely used in machine learning. When working with structured data in a table, the (unnormalized) inner-product between two of the objects (vectors in this case) is a symmetric similarity function where self-similarity may not be the maximum. Mercer kernels are essentially a generalization of the inner-product for any kind of data — they are symmetric though self-similarity may not be the maximum. They are quite popular in machine learning and Mercer kernels have been defined for text, graphs, time series, images. Efficient search with general Mercer kernels have only received recent attention, opening avenues for search-based learning with similarity functions which are less restrictive and more domain-relevant. Even though the research in this topic is far from mature, it is a step in the right direction.

Stack of ideas

Thursday, October 30, 2014

Vanishing component analysis: Speculating about a vanishing ideal

Saturday, September 20, 2014

MLConf Atlanta, September 19. 2014 talk