Problems with svd in java - java

I have gone through jama and colt(I code in java) . Both of them expect me to use arrays such that the number of rows are more than the number of coloumns .
But in case of the Latent semantic analysis (LSA) i have 5 books and there are a total of 1000 odd words . When i use a term document matrix i get a 5*1000 matrix.
Since this does not work , i am forced to transpose the matrix . On transposing i use a 1000 * 5 . With a 1000*5 when i perform a svd i get a S matrix with 5*5 . To perform dimensionality reduction this the 5*5 matrix looks small .
What can be done ?

The text segment size you are using is way too large. A document (column) should represent a page or few pages of text, perhaps a chapter at the largest. I have seen paragraph size used as well.

Related

Java: Reduce array to specific number of averages

The main issue which needs to be solved is:
Let's say I have an array with 8 numbers, e.g. [2,4,8,3,5,4,9,2] and I use them as values for my x axis in an coordinate system to draw a line. But I can only display 3 of this points.
What I need to do now is do reduce the number of points (8) to 3, without manipulating the line too much - so using an average should be an option.
I am NOT looking for the average of the array in a whole - I still need 3 points of the amount of 8 in total.
For an array like [2,4,2,4,2,4,2,4] and 4 numbers out of that array, I could simply use the average "3" of each pair - but that's not possible if the number is uneven.
But how would I do that? Do you know how this process is called in a mathematical way?
To give you some more realistic details about this issue: I have an x axis, which is 720px long and let's say I get 1000 points. Now I have to reduce this 1000 points (2 arrays, one for x and one for y values) to a maximum of 720 points.
Thought about interpolation and stuff like that, but I'm still not quite sure if this is what I am looking for.
Interpolation is good idea. You input your points and get a polynomial function as an output. Then you can use it to draw your line. Check more here : Interpolation over an array (or two)
I would recommend that you fit all the points you have in some fashion and then evaluate at the particular points you need for the display.
There are a myriad of choices for fitting:
Least squares
Piecewise using polynomials or splines
You should consult a text or find a library to help you - something like Apache Commons Math.
It sounds like you are looking for a more advanced mathematical function than a simple average.
I would suggest trying to identify potential algorithms via Mathematica Stack Exchange and then trying to find a Java library that implements any of the potential choices (maybe a new question here).
since its for an X-axis, why not use the
MIN, MAX and (MIN+MAX)/2
for your three points?

graph representation with minimum cost (time and space)

i have to represent a graph in java but neither as an adjacency list nor an adjacency matrix..
the basic idea is that if
deg[i]
is that exit degree of vertex i, then its neigboors can store in
edges[i][j] where
i <= j <= deg[i]
, but given that
edges[][]
must be initialized with some values i dont know how to make it differ from an adjacency matrix..
any suggestions?
In my knowledge there are only two ways to represent a graph in a language.
Either Use Adjacency matrix
Or Use Incidence matrix
You can make an incidence matrix like
E1 E2 E3 E4
V1 1 2 1
1 V2 2
1 2 1 V3
1 1 1 2
V4 1 1 2 1
You are fighting against lower bounds on this question. The two main representations of the graph are already very good for their respective use.
Adjacency list, minimizes space. You will be hard pressed to use less memory than 1 pointer per edge. Space: O(V*E). Search: O(V)
Adjacency matrix, is very fast, at the cost of v^2 space. Space: O(V^2). Search: O(1)
So, to make something that is better for both space and time you will have to combine the ideas from both. Also realize will just have better practical performance, theoretically you will not improve O(1) search, or O(V*E) size.
My idea would be to store all the graph nodes in one array. Then for each Node have an adjacency list represented as a bit vector. This would essentially be a matrix like representation, but only for those nodes that exist in the graph, giving you a smaller size than a matrix. Search would be slightly improved over an Adjacency list as the query node can be tested against the bit vector.
Also check out sparse matrix representations.

Plot Experimental Data Distribution

I'm working on a clustering task and I've built the dataset similarity matrix M, repeating N times the clustering algorithm and chosing as element m_{ij} the the number of times the elements i and j have been on the same cluster, divided by N.
Now I'd like to have a graphical way to check my results, so i was wondering if there is any library that, given an array of doubles ( aka the values in the upper triangular part of my matrix ), plots the data distribution and the histogram ? All the doubles are in the [0,1] interval, and most of the will be around 0 and 1 .
For plotting, the best programming language that I know is Matlab. If you have chance, I would definitely suggest you Matlab.
Write your results into a txt file and just insert it into Matlab, then you can play with that numbers and do plottings with almost no effort

How should I implement a Mahalanobis distance function in Java?

I am working on a project in java and have two 2d int arrays both 10x15. I want to convert the Mahalanobis distance between them. They are grouped in categories along the x axis of the array (size 10). I understand that you must find the mean value in these groups and redistribute the data so that it is centered. My problem now is generating the covariance matrix necessary for calculation. If anyone knows a good way to do this or point to a useful guide that can step me through the process in 3D it would be a great help. Thanks.
A covariance matrix contains the expected relationship between any two variables. Given a statistical distribution on a vector x, with statistical mean avg:
covariance(i,j) = expected value of [ (x[i] - avg[i])(x[j] - avg[j]) ]
Given a statistical set of N vectors v_1 ... v_N, with mean vector avg, you can estimate the covariance of the distribution they were taken from as follows:
sample_covariance(i,j) = sum[for k=1..N]( (v_k[i] - avg[i])*(v_k[j] - avg[j]) ) / (N-1)
This last is the covariance matrix you're looking for. I recommend you also read the wiki link above.

Best data structure to store and manipulate my data?

I am writing a simple Java program that will input a text file which will have some numbers representing a (n x n) matrix where numbers are separated by spaces. for ex:
1 2 3 4
5 6 7 8
9 1 2 3
4 5 6 7
I then want to store these numbers in a data structure that I will then use to manipulate the data (which will include, comparing adjecent numbers and also deleting certain numbers based on specific rules.
If a number is deleted, all the other numbers above it fall down the amount of spaces.
For the example above, if say i delete 8 and 9, then the result would be:
() 2 3 ()
1 6 7 4
5 1 2 3
4 5 6 7
so the numbers fall down in their columns.
And lastly, the matrix given will always be square (so always n x n, where n will be always given and will always be positive), therefore, the data structure has to be flexible to virtually accept any n-value.
I was originally implementing it in a 2-d array, but I was wandering if someone had an idea of a better data structure that I could use in order to improve efficiency (something that will allow me to more quickly access all the adjacent numbers in the matrix (rows and columns).
Ultimately, mu program will automatically check adjacent numbers against the rules, I delete numbers, re-format the matrix, and keep going, and in the end i want to be able to create an AI that will remove as many numbers from the matrix as possible in the least amount of moves as possible, for any n x n matrix.
In my opinion, you yo know the length of your array when you start, you are better off using an array. A simple dataType will be easier to navigate (direct access). Then again, using LinkedLists, you will be able to remove a middle value without having to re-arrange the data inside you matrix. This will leave you "top" value as null. in your example :
null 2 3 null
1 6 7 4
5 1 2 3
4 5 6 7
Hope this helps.
You could use one dimensional array with the size n*n.
int []myMatrix = new myMatrix[n * n];
To access element with coordinates (i,j) use myMatrix[i + j * n]. To fall elements use System.arraycopy to move lines.
Use special value (e.g. Integer.MIN_VALUE) as a mark for the () hole.
I expect it would be fastest and most memory efficient solution.
Array access is pretty fast. Accessing adjacent elements is easy, as you just increment the relevant index(s) (being cognizant of boundaries). You could write methods to encapsulate those operations that are well tested. Having elements 'fall down' though might get complicated, but shouldn't be too bad if you modularize it out by writing well tested methods.
All that said, if you don't need the absolute best speed, there are other options.
You also might want to consider a modified circularly linked list. When implementing a sudoku solver, I used the structure outlined here. Looking at the image, you will see that this will allow you to modify your 2d array as you want, since all you need to do is move pointers around.
I'll post a screen shot of relevant picture describing the datastructure here, although I would appreciate it if someone will warn me if I am violating some sort of copy right or other rights of the author, in which case I'll take it down...
Try a Array of LinkedLists.
If you want the numbers to auto-fall, I suggest you to use list for the coloumns.

Categories

Resources