Continuing with my discussion here, I wish to understand (and think about) some applications of the vanishing ideal. The authors (Ron Livni, et al., 2013 [1]) present a very useful application for binary classification (and hence, multi-class classification). Consider a labeled dataset $S = \{ (x_i, y_i), i = 1, \ldots, n \}$ such that $x_i \in \mathbb{R}^d$ and $y_i \in \{ 1, \ldots, L \}$. Then the authors suggest the following:
- Split $S$ into $S_l = \{ x \colon (x, y) \in S , y = l \}, l = 1, \ldots, L$ (the set of points with label $l$).
- Find the set of generators $F(S_l)$ of the set $S_l$ for each of the labels $l$.
- For every point $x \in \mathbb{R}^d$ such that $(x, y) \in S$, create a new representation $\widetilde{x} = [ |f_1^1(x)|, |f_2^1(x)|, \ldots, |f_1^l(x)|, |f_2^l (x)|, \ldots ] $, where $f_i^l$ is the $i^{th}$ generator in the set $F(S_l)$ for label $l$.
Given this representation, the classes in the training points are linearly separable and a simple perceptron or a linear support vector machine can be utilized to effectively learn a nonlinear classifier (nonlinear transformation + a linear classifier). There is an assumption here to get the linear separability -- the set $S_l$ is a subset of the algebraic set $U_l$, and these algebraic sets $U_l, l = 1,\ldots,L$ do not intersect with each other $ \left( U_{l} \cap U_{l'} = \emptyset, l \not= l' \right)$. However, the authors claim that "it is enough to choose those polynomials that generate algebraic sets $U_l, l = \{1, \ldots, L\}$ whose intersection is empty".
This is a very powerful method (in my opinion) which is learning nonlinear representations of the data (to me the features themselves appear very robust even though I do not have any mathematical justification of that -- if $x$ is slightly perturbed, $\widetilde{x}$ should not move too much, however, $x$ is involved in the learning of the set of generators $F(S_l)$, which might change when $x$ changes) and then eventually just using a very simple classifier (linear in this case). A thing to note -- while the representations linearly separate the classes, the representation learning process (in this case, the vanishing component analysis) does not explicitly try to separate the two classes. Karampatziakis & Mineiro, 2014 [2] propose a way of learning representations of the data that explicitly tries to minimize the correlation between the two classes via a generalized eigenvalue decomposition. Maybe I'll tackle that paper in a future post.
To me, it is a little unfortunate that this idea/technique is currently only usable for supervised learning. As I rambled here, VCA could be used to quantify the complexity of the data somehow. Reading about this above application for classification, the following question popped in my head:
[1] R. Livni, D. Lehavi, S. Schein, H. Nachlieli, S. Shalev-Schwartz & A. Globerson, Vanishing component analysis, ICML 2013
[2] N. Karampatziakis & P. Mineiro, Discriminative features via generalized eigenvectors, ICML 2014
This is a very powerful method (in my opinion) which is learning nonlinear representations of the data (to me the features themselves appear very robust even though I do not have any mathematical justification of that -- if $x$ is slightly perturbed, $\widetilde{x}$ should not move too much, however, $x$ is involved in the learning of the set of generators $F(S_l)$, which might change when $x$ changes) and then eventually just using a very simple classifier (linear in this case). A thing to note -- while the representations linearly separate the classes, the representation learning process (in this case, the vanishing component analysis) does not explicitly try to separate the two classes. Karampatziakis & Mineiro, 2014 [2] propose a way of learning representations of the data that explicitly tries to minimize the correlation between the two classes via a generalized eigenvalue decomposition. Maybe I'll tackle that paper in a future post.
To me, it is a little unfortunate that this idea/technique is currently only usable for supervised learning. As I rambled here, VCA could be used to quantify the complexity of the data somehow. Reading about this above application for classification, the following question popped in my head:
Given a dataset $S$ an integer $k$, can we find $k$ (possibly non-overlapping) algebraic sets such that the algebraic sets $S_1, \ldots, S_k$ collectively have low complexity (with respect to the sets of generators of the vanishing ideals of the S_i)?
If $\mathcal{C}(S_i)$ is the complexity cost of a subset $S_i \subseteq S$ and depends on the set of generators $F(S_i)$ of the vanishing ideal of $S_i$, we want something like $$ \arg \min_{S_1,\ldots, S_k \colon S_i \cap S_j = \emptyset \forall i \not= j, \cup_{i = 1}^k S_i = S} \sum_{i = 1}^k \mathcal{C}(S_i).$$
This is actually a very naive extension of the $k$-means problem where $$\mathcal{C}(S_i) = \sum_{x, x' \in S_i} \|x - x' \|_2^2 = \sum_{x \in S_i} \| x - \mbox{Mean}(S_i) \|_2^2,$$ but with a more complicated notion of complexity, where even computing the complexity (naively) requires a VCA.
To me, there does not appear to be any polynomial time algorithm for this problem. The naive way of trying all possible $O(k^{|S|})$ $k$-partitions of the dataset is the only possible way in my head. An iterative algorithm like Lloyd's algorithm for $k$-means takes advantage of the fact the following facts:
If $\mathcal{C}(S_i)$ is the complexity cost of a subset $S_i \subseteq S$ and depends on the set of generators $F(S_i)$ of the vanishing ideal of $S_i$, we want something like $$ \arg \min_{S_1,\ldots, S_k \colon S_i \cap S_j = \emptyset \forall i \not= j, \cup_{i = 1}^k S_i = S} \sum_{i = 1}^k \mathcal{C}(S_i).$$
This is actually a very naive extension of the $k$-means problem where $$\mathcal{C}(S_i) = \sum_{x, x' \in S_i} \|x - x' \|_2^2 = \sum_{x \in S_i} \| x - \mbox{Mean}(S_i) \|_2^2,$$ but with a more complicated notion of complexity, where even computing the complexity (naively) requires a VCA.
To me, there does not appear to be any polynomial time algorithm for this problem. The naive way of trying all possible $O(k^{|S|})$ $k$-partitions of the dataset is the only possible way in my head. An iterative algorithm like Lloyd's algorithm for $k$-means takes advantage of the fact the following facts:
- the computational cost of the complexity evaluation (evaluating $\mathcal{C}(S_i)$) is very cheap in $k$-means (linear in $|S_i|$), while doing the same for the $k$-algebraic sets problem would naively require a complete VCA for each subset $S_i$
- the change in $\sum_i \mathcal{C}(S_i)$ when moving a point $x$ from set $S_i$ to set $S_{i'}$ is easily evaluated in $k$-means problem, however, in $k$-algebraic sets problem, there is no obvious efficient way (to me at least) other than just recomputing the VCA for each of the new subsets $S_i$ and $S_{i'}$
[1] R. Livni, D. Lehavi, S. Schein, H. Nachlieli, S. Shalev-Schwartz & A. Globerson, Vanishing component analysis, ICML 2013
[2] N. Karampatziakis & P. Mineiro, Discriminative features via generalized eigenvectors, ICML 2014