Efficient random projection trees for nearest neighbor search and related problems
Nearest neighbor search (NNS) is one of the most well-known problems in the field of computer science. It has been widely used in many different areas such as recommender systems, classification, clustering etc. Given a database S of n objects, a query q, and a measure of similarity, the naive way to solve a NNS problem is to perform a linear search over the objects in database S and return an object from S, which based on the similarity measure, is most similar to q. However, due to the growth of data in recent years, a solution better than linear time complexity is desirable. Locality sensitivity hashing (LSH) and random projection trees (RPT) are two popular methods to solve NNS problem in sublinear time. Earlier works have demonstrated that RPT has superior performance compared to LSH. However, RPT has two major drawbacks, namely, i) its high space complexity ii) if it makes a mistake at any internal node of a single tree, it cannot recover from this mistake and the rest of the search for that tree becomes useless. One of the main contributions of this thesis is to propose new methods to address these two drawbacks. To address the first issue, we design a sparse version of RPT which reduces the space complexity overhead without significantly affecting nearest neighbor search performance. To address the second issue, we develop various strategies that uses auxiliary information and priority function to improve nearest neighbor search performance of original RPT. We support our claims both theoretically and experimentally on many real-world datasets. A second contribution of the thesis is to use the RPT data structure to solve related search problems such as, maximum inner product search (MIPS) and nearest neighbor to query hyperplane (NNQH) search. Both these problems can be reduced to an equivalent NNS problem by applying appropriate transformations. In case of MIPS problem, we establish among many different transformations that reduce a MIPS problem to an equivalent NNS problem, which one is more preferable to be used in conjunction with RPT. In case of NNQH problem, the transformation that reduces NNQH problem to an equivalent NNS problem increases the data dimensionality tremendously and hence space complexity requirement of original RPT. In the latter case, we show that our sparse RPT version comes to rescue. Our NNQH solution which uses space efficient versions of RPT is used to solve active learning problem. We perform extensive empirical evaluations for both these applications on many real world datasets to show superior performance of our proposed methods compare to the state of the art algorithms.