Building a k-d tree using MapReduce?

Building a k-d tree using MapReduce? - java

I am trying to build the KD tree(independent) for image features. I have extracted the image features,the feature contains suppose 1000 float values.
Using map-reduce to distribute the images among the nodes of the cluster according to classification(eg, cat,dog,guns)ie. each node will contain the bunch of the similar images & then build KD tree of the images on each node. I am confused about how the tree can be built.
So how can I build the KD tree using map-reduce? Each node will contain the tree,right? What could be the logic to distribute the images? While building the KD-tree, on what basis should I add image-feature vectors in tree(ie left or right child)?
Any help is appreciated.Thanks in advance.

I don't think that a k-d-tree is the right thing for your data. Here's what Wikipedia says about it:
k-d trees are not suitable for efficiently finding the nearest neighbour in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2^k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbour methods should be used instead.
Your feature vectors have dimensionality 1000, which means that you should have around 10^300 images, which is quite unlikely.
I suggest that you look at Locality-sensitive hashing, which is one of the mentioned approximate nearest-neighbor searches for high-dimensional data.
Since Wikipedia is not always the best place to learn something complicated, I suggest you take a look at the respective lecture slides of the Data Mining course of ETH Zurich instead. It just so happens that I am taking this course in the current semester.

Related

Collision / distance to volumes from point in OpenGL ES on Android

is user inside volume OpenGL ES Java Android
I have an opengl renderer that shows airspaces.
I need to calculate if my location already converted in float[3] is inside many volumes.
I also want to calculate the distance with the nearest volume.
Volumes are random shapes extruded along z axis.
What is the most efficient algorithm to do that?
I don t want to use external library.

What you have here is a Nearest Neighbor Search problem. Since your meshes are constant and won't change, you should probably use a space partioning algorithm. It's a big topic but, in short, you generally need to use a tree structure and sort all the objects to be put into various tree nodes. You'll need to pre-calculate the tree itself. There are plenty of books and tutorials on the net about space partioning, and you could also at source code of, for example, id Software products like Doom, Quake etc. to see how this algorithms (BSP, at least) are used. The efficiency of each algorithm depends on what you have and what you need. Using BSP trees, for example, you'll have the objects sorted from nearest to farthest so you can quickly get the one you need.

How can I use splay tree data structure in huffman coding for data compression?

First of all, I am new to programming, so I would expect simple and well explained answers. Secondly this is a very specific question and I don't want moderators and other users to just close this question as off-topic or for being too broad.
Anyway I want to implement Huffman coding in java using some kind of a data structure. But ,however, I was thinking of using splay tree as it's something that will not be covered in my course's syllabus and also since I want to learn a new data structure. Now the main question is if the Huffman coding algorithm would require splay tree data structure in the first place?
What can I use splay tree for in my Huffman based data compression project? Or would you rather suggest a better(for it's efficiency and maybe creativity in the context that it's unique and not so heard of) data structure for this project?
Thanks

Any Huffman code can be represented by the structure of a binary tree, whose leaves are the symbols to be encoded. When following a path from the root to the symbol to be encoded, left and right branches can be represented as 0 or 1 bits; the result is a correct prefix code, with code lengths specified by the depth of the symbols.
Ideally, you would use the structure of the splay tree directly, to determine the Huffman code for each symbol. However, splay trees maintain their data in the nodes, not the leaves. You will either need to find some way to use a splay tree based on data in the leaves, or come up with a transformation that computes a valid (and efficient) set of prefix codes from node locations instead.
One possibility is to maintain the leftmost and rightmost leaf of each subtree in its root node (to be updated as the tree is splayed, of course). This should allow you to search for leaves, even though you don't actually care about your node data as such. Conventional splaying operations should then naturally generate a dynamic Huffman code biased towards recently occurring symbols.

Lookup algorithm that returns regions?

I have a large list of regions with 2D coordinates. None of the regions overlap. The regions are not immediately adjacent to one another and do not follow a placement pattern.
Is there an efficient lookup algorithm that can be used to let me know what region a specific point will fall into? This seems like it would be the exact inverse of what a QuadTree is.

The data structure you need is called an R-Tree. Most RTrees permit a "Within" or "Intersection" query, which will return any geographic area containing or overlapping a given region, see, e.g. wikipedia.
There is no reason that you cannot build your own R-Tree, its just a variant on a balanced B-Tree which can hold extended structures and allows some overlap. This implementation is lightweight, and you could use it here by wrapping your regions in rectangles. Each query might return more than one result but then you could check the underlying region. Its probably an easier solution than trying to build a polyline-supporting R-tree version.

What you need, if I understand correctly, is a point location data structure that is, as you put it, somehow the opposite of quad or R-tree. In a point location data structure you have a set of regions stored, and the queries are of the form: given point p give me the region in which it is contained.
Several point location data structures exists, the most famous and the one that achieves the best performance is the Kirkpatrick's one also known as triangulation refinement and achieves O(n) space and O(logn) query time; but is also famous to be hard to implement. On the other hand there are several simpler data structures that achieves O(n) or O(nlogn) space but O(log^2n) query time, which is not that bad and way easier to implement, and for some is possible to reduce the query time to O(logn) using a method called fractional cascading.
I recommend you to take a look into chapter 6 of de Berg, Overmars, et al. Computational Geometry: Algorithms and Applications which explains the subject in a way very easy to grasp, though it doesn't includes Kirkpatrick's method, which you can find it in Preparata's book or read it directly from Kirkpatrick's paper.
BTW, several of this structures assumes that your regions are not overlapping but are expected to be adjacent (regions share edges), and the edges forms a connected graph, some times triangular regions are also assumed. In all cases you can extend your set of regions by adding new edges, but don't you worry for that, since the extra space needed will be still linear, since the final set of regions will induce a planar graph. So you can blindly extend your sets of regions without worrying with too much growth of space.

Java Library or algorithm for calculating coordinates of tree nodes / drawing

I am looking for a library which would give me the exact coordinates of each node in a tree (any tree, not just binary trees).
Let's say I define the tree in the following notation
(() (() (() () ()))
And some library gives me the coordinates like this:
[500 0]([200 50]() [600 50]([500 100]() [750 100]([600 150]() [700 150]() [800 150]()))
or any other notation which uniquely represents a tree.
This kind of library would enable a space-efficient drawing of trees and would also solve the problem of overlapping nodes and links. For example, if a tree is a list infact, I would like that library to take that into account and arrange nodes in a single column or row, to save space.
If nothing similar exists, an algorithm would also come in handy, provided that it can be implemented relatively easily.

I think the Nested Set model may help you.
The algorithm is fairly simple, and very efficient for reads, although updates to the tree are a little more expensive, because the boundaries of the nodes have to be updated in a cascading fashion. Here is the algorithm implemented in SQL.

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.