Stack of ideas: MLConf Atlanta, September 19. 2014 talk

At the cost of cheap self-promotion and in a desperate attempt to end a posting drought, here is something I spoke about at MLConf, Atlanta (September, 19, 2014 slides):

Given a set of objects S, a query q and a notion of similarity between the query and the objects, the task of *similarity search* (or *nearest-neighbour search*) is to find the object in S most similar (or nearest) to q. This problem is pervasive in everyday applications. While this in itself is not directly a learning problem, the ability to perform this task (efficiently) allows various forms of learning. For example, if you are trying to learn a prediction function from a set of training examples, similarity search presents a model-free way of learning --- for any test point p, find the points in the training set nearest (or most similar) to p and compute the prediction for p as the (weighted) aggregate of the function values at these nearest neighbours of p (in classification, it can be the weighted majority, while in regression, it can be the weighted average). One usual way of detecting outliers in a set is to identify the nearest-neighbours of each point and use the similarity to their corresponding nearest-neighbours to generate the outlier-ness score.

One of the things that makes search-based learning very powerful is the fact that we make no assumption on the dataset nor do we make any model or distributional assumption on the function we are trying to learn. The learning is completely *nonparametric* — whatever we learn, we learn from the data made available to us. There is one implicit assumption I would like to mention here — if two points are close, their predictions are usually close.

There are two very important knobs in search-based learning — the notion of similarity, and the threshold that separates the nearest-neighbours from the non-nearest-neighbours. While many standard notions of similarity exist, the notion of similarity should ideally be constructed with domain knowledge in mind. The threshold is a tuning parameter of the learning problem and is usually learned via cross-validation.

A similarity function can be defined between any pair of objects. Those defined in an intuitive and domain specific way usually exhibit the best performance in terms of quality. This is probably one of the most (if not the most) important ingredient in search-based learning. Another key ingredient is the efficiency of the nearest-neighbour search — faster the search, faster the prediction (and faster the tuning phase). The efficiency of search is greatly intertwined with the type of similarity function. On a high level, similarity functions can be categorized in two different ways — one is based on the whether the similarity function is symmetric or not (whether A's similarity to B is the same as B's similarity to A), and the other is based on whether self-similarity is maximum or not (whether A's similarity to A is maximum or there can exist a B whose similarity to A is greater than that).

While symmetric similarity functions with maximum self-similarity make sense, it is mostly useful when we are comparing apples to apples — consider the situation where the duplicate of the query in our set of points is our ideal match. Searching for similar images given a query image is a task where symmetric similarity functions with maximum self-similarity would be useful. Asymmetric similarity functions are useful in situations where there is good reason to have a sense of directionality in the similarity. For example, take the task of matching people in a dating website where the similarity of A to B is governed by A's estimated attraction to B (and vice versa). In this case, the similarity function is asymmetric for good reason (many could be attracted to the same person but that person does not usually find all of them attractive). Similarity functions where self-similarity is not maximum is useful in situations where we are comparing apples to oranges — the query is an apple and the object in your set are oranges. Think, for example, about a user as a query, movies or products as objects in your set and a similarity function defined as the estimated preference of the user for a movie or product. If we are searching for the most preferred movie/product for the user, we do not really care about the self-similarity — mostly because we would never compare the user to itself. Another example of this is document search where the query is a single phrase and we are looking for the most relevant documents.

Most of the focus and progress of research on efficient search has been for symmetric similarity functions with maximum self-similarity. Metrics form a class of such (dis-)similarity functions. Almost all efficient search algorithms work solely for metrics. This has a unwanted side-effect — similarity function between pairs of objects are coerced to be metrics, even though that might be inappropriate, only to allow the use of efficient search algorithms; and since everybody tries to use metrics, all research on efficient search continues to focus mostly on metrics.

Asymmetric similarity functions with maximum self-similarity are quite meaningful in many situations. When dealing with sets as individual objects, the size of the set-theoretic difference (the set-theoretic difference of A and B are elements present in A but not in B; the set-theoretic difference of B and A are the elements present in B but not in A) is an asymmetric notion of (dis-)similarity where the self-similarity is maximum. When individual objects are probability distributions (a document can be thought of as a multinomial probability distribution over the terms of the dictionary), KL-divergence (KL stands for Kullback-Liebler) is a widely used notion of asymmetric similarity between probability distributions (again with self-similarity being the maximum). There has been some recent focus on efficient search with a class of asymmetric similarity measures with maximum self-similarity known as Bregman divergences. However, the research is still quite immature and most of the work has focused on understanding the hardness of the problem in a theoretical sense.

Symmetric similarity functions where the self-similarity may not be maximum is actually widely used in machine learning. When working with structured data in a table, the (unnormalized) inner-product between two of the objects (vectors in this case) is a symmetric similarity function where self-similarity may not be the maximum. Mercer kernels are essentially a generalization of the inner-product for any kind of data — they are symmetric though self-similarity may not be the maximum. They are quite popular in machine learning and Mercer kernels have been defined for text, graphs, time series, images. Efficient search with general Mercer kernels have only received recent attention, opening avenues for search-based learning with similarity functions which are less restrictive and more domain-relevant. Even though the research in this topic is far from mature, it is a step in the right direction.

Stack of ideas

Saturday, September 20, 2014

MLConf Atlanta, September 19. 2014 talk

No comments:

Post a Comment