plotting high dimensional data in java - java

I have thousands of data points and each data point has 50 dimensions. I would like to see the sparseness of data using java. Is there any java package/methods to plot such high dimensional data.

What you need to look for is multidimensional scaling. It basically reduces the dimensionality of the data space, trying to maintain the distances.
So you take a MDS package, reduce your data to 2D (or 3D) and plot them using 2D/3d graphics library (swing, jogl).
It might work or not, depending on the number of the data points and the space they're in. For 50 dimensions you might be out of luck, but it really depends.
A quick google for java implementation got me this
There's a package in R too, so you can use that.

Related

Collision / distance to volumes from point in OpenGL ES on Android

is user inside volume OpenGL ES Java Android
I have an opengl renderer that shows airspaces.
I need to calculate if my location already converted in float[3] is inside many volumes.
I also want to calculate the distance with the nearest volume.
Volumes are random shapes extruded along z axis.
What is the most efficient algorithm to do that?
I don t want to use external library.
What you have here is a Nearest Neighbor Search problem. Since your meshes are constant and won't change, you should probably use a space partioning algorithm. It's a big topic but, in short, you generally need to use a tree structure and sort all the objects to be put into various tree nodes. You'll need to pre-calculate the tree itself. There are plenty of books and tutorials on the net about space partioning, and you could also at source code of, for example, id Software products like Doom, Quake etc. to see how this algorithms (BSP, at least) are used. The efficiency of each algorithm depends on what you have and what you need. Using BSP trees, for example, you'll have the objects sorted from nearest to farthest so you can quickly get the one you need.

Java library to find all polygons containing a point

I have to store a set of 2D polygons in memory (less than 1000) in a structure which allows to find efficiently the ones containing a point. Polygons never change and contain about 10 points.
I have to launch the query about 10000 times per second.
I guess a structure using quad trees or similar and bounding boxes of the polygons would do this as I need.
Does anybody know a free java library offering this service ?
I don't think there's such a service, but as a structure you can use https://docs.oracle.com/javase/8/docs/api/java/awt/Polygon.html. You even have a method to check for point inclusion.

How can I optimize the rendering of a graph while preserving the shape?

I've been writing a program to graph datapoints in Java. I need a lots of flexibility and speed so I don't want to use an existing library as much as possible. Right now, it essentially uses Graphics2D to draw lines and dots representing the points in a file of data.
My problem is, some of my datasets have upwards of 100,000 points.When it is going to be rendered with all full drag/zoom functionality, it is getting quite slow.
My question is, how can I reduce this dataset or make a simplification of it so that I can display the graph without changing the overall shape?
I could only draw every third point, for instance, but what if that skipped over and didn't display an important outlier? I could try averaging groups of points, but that could have the same problem.
And for services like Google Finance, where they probably have millions of points to display, how do they deal with this?
You may want to check for range differences between points before rendering them. Give them a threshold that they need to stay within in order to not be re-rendered.

PageRank implementation for Research

After reading theory of PageRank algorithm from this site I would like to play with it.
I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at WebGraph.main(WebGraph.java:6)
How can I implement PageRank in Java, is there any way of storing hyperlink matrix?
That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.
Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was
1, 0, 0
0, 0, 2,
0, 3, 0
a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.
Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).
Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.
How do weighted edges affect PageRank in networkx?
Networkx: Differences between pagerank, pagerank_numpy, and pagerank_scipy?
A few suggestions:
Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.
Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...
Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).
As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:
java -jar -Xmx512m pagerank.jar
EDIT:
But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.
You could use a graph database like Neo4j for that purpose.
You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.
This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.
One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.
PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.
I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.
Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.
If that doesn't work:
In Java there is an implementation of Pregel called GoldenOrb.
Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).
You'll have to read around BSP, and PageRank to handle the size of data you have.
Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here

Applet A* pathfinding using large byte[] arrays - heap space error

I've written a basic Java applet which works as a map viewer (like Google Maps) for a game fansite.
In it, I've implemented an A* pathfinding algorithm on a 2D map with 16 different floors, "connected" on certain points. The floors are stored in PNG images which are downloaded when needed and converted to byte arrays. The node cost is retrieved from the pixel RGB values and put in the byte arrays.
The map contains about 2 million tiles, spread over 16 floors. The images are 1475 x 2000 (15-140 KB for the PNG images) big, so some of the floors contain a lot of empty tiles .
The byte arrays will be huge in memory, resulting in a “java.lang.OutOfMemoryError: Java heap space” error for most JVM configurations.
So my questions are
Is there anyway to reduce the size of these byte arrays and still have the pathfinder function properly?
Should I take a different approach to finding the optimal path, not having the tiles in memory?
I would think finding a path on the web server would be too CPU intensive.
Best regards,
A
You've just run into the biggest problem with A*: its memory requirement is proportional to the size of the state space.
You have a few options here.
The first would be to change your search algorithm from A* to IDA*, and add in search enhancements like a memory cache to remember as many previously-searched node costs as possible.
Another alternative is to keep A* but move to hierarchical search. This will likely require you to do some preprocessing on your image files, however.
You'll find several good resources (downloadable papers) on this subject here: http://webdocs.cs.ualberta.ca/~holte/CMPUT651/readinglist.html

Categories

Resources