I am trying to develop a feature selection algorithm in java. I am using weka libraries for this purpose. Is there any way to calculate P value using weka libraries ?
or is there any java machine learning library to calculate p value ?
I was able to calculate chi square value using weka libraries. Is there any way to calculate p value using this chi square value ?
You can use SignificanceAttributeEval, not chi-square. You can download it from the package manager (go to the Weka GUI chooser, tools > package manager).
This algorithm works only with a nominal class, but if you have a numeric class you can discretize in bins. This algorithm evaluates the worth of the attributes in relation to the class by computing probabilistic significance as a two-way function. Like the P-value, this algorithm measures the probability that two features happen together not by chance, but unlike the P-value, here the best features have a higher score. The score is similar to the one of correlations, but has only positive values, non-significative features have score 0.
Related
I have a large dataset (>500.000 elements) that contains the stress values (σ_xx, σ_yy, σ_zz, τ_xy, τ_yz, τ_xz) of FEM-Elements. These stress values are given in the global xyz-coordinate space of the model. I want to calculate the main axis stress values and directions from those. If you're not that familiar with the physics behind it, this means taking the symmetric matrix
| σ_xx τ_xy τ_xz |
| τ_xy σ_yy τ_yz |
| τ_xz τ_yz σ_zz |
and calculating its eigenvalues and eigenvectors. Calculating each set of eigenvalues and -vectors on its own is too slow. I'm looking for a library, an algorithm or something in Java that would allow me to do this as array calculations. As an example, in python/numpy I could just take all my 3x3-matrices, stack them along a third dimension to get a nx3x3-array, and pass that to np.linalg.eig(arr), and it automatically gives me an nx3-array for the three eigenvalues and an nx3x3-array for the three eigenvectors.
Things I tried:
nd4j has an Eigen-module for calculating eigenvalues and -vectors, but only supports a single square array at a time.
Calculate the characteristic polynomial and use cardanos formula to get the roots/eigenvalues - possible to do for the whole array at once, but I'm stuck now on how to get the corresponding eigenvectors. Is there maybe a general simple algorithm to get from those to the eigenvectors?
Looking for an analytical form of the eigenvalues and -vectors that can be calculated directly: It does exist, but just no.
You'll need to write a little code.
I'd create or use a Matrix class as a dependency and find methods to give you eigenvalues and eigenvectors. The ones you found in nd4j sound like great candidates. You might also consider the Linear Algebra For Java (LA4J) dependency.
Load the dataset into a List<Matrix>.
Use functional Java methods to apply a map to give you a List of eigenvalues as a vector per stress matrix and a List of eigenvectors as a matrix per stress matrix.
You can optimize this calculation to the greatest extent possible by applying the map function to a stream. Java will parallelize the calculation under the covers to leverage available cores to the greatest extent possible.
Follow-up: This is the way that worked best for me, as I can do all operations without iterating over every element. As stated above, I'm using Nd4j, which seems to be limited in its possibilities compared to numpy (or maybe I just didn't read the documentation thoroughly enough). The following method uses only basic array operations:
From the given stress values, calculate the eigenvalues using Cardano's formula. Only element wise instructions are needed to do that (add, sub, mul, div, pow). The result should be three vectors of size n, each containing one eigenvalue for all elements.
Use the formula given here to calculate the matrix S for each eigenvalue. Like step 1, this can obviously also be done using only element-wise operations with the stress value- and eigenvalue-vectors, in order to avoid specifiying some complicated instructions on which array to multiply according to which axis while keeping whatever other axis.
Take one column from S and normalize it to get a normalized eigenvector for the given eigenvalue.
Note that this method only works if you have a real symmetric matrix. You also should make sure to properly deal with cases where the same eigenvalue appears multiple times.
I need to enumerate all basis corresponding to all extreme points of a LP with the CPLEX API in Java. Unfortunately I did not find any way to do this with CPLEX. Is there a solution ?
If not, I will do this myself but I will need to play with basis. Is any simple way with CPLEX to enumerate all basis and check if a basis is a feasible solution ?
The short answer: no.
There is no easy way to do this. One possible approach, but somewhat cumbersome, is to encode the basis using binary variables. E.g.:
xb[i] = 1 for basic variables
0 for non-basic variables
We need to add constraints on non-basic variables: they will be at bound. I.e. for a non-negative variable x[i] we have
xb[i]=0 => x[i]=0
(this is an indicator constraint). Furthermore we know that
sum(i,xb[i]) = m
(the number of basic variables is equal to the number of rows in the model).
Then use Cplex's solution pool to enumerate all possible feasible bases. An illustration for this approach is shown in this link. (This particular example enumerates all optimal bases, but it is not difficult to tell Cplex to enumerate all feasible bases).
I have a 2X2 Matrix and am doing prop_test in R and then take the P value from the result.
Now, I want to do the same thing in Java i.e., computing the P value.
One way is to use https://commons.apache.org/proper/commons-math/javadocs/api-3.1/org/apache/commons/math3/stat/inference/ChiSquareTest.html and compute the P-value
But, chi-square test does not use a Yates' continuity correction which can result in differences between R's approach and ChiSquareTest approach.
I would like to know any other ways to get the P-Value using Yates' continuity correction
Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.
BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.
Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.
I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene
I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:
"boost" that shorter length posts - using doc.getBoost()
"lengthNorm" in the definition of norm(t,d)
Here is the documentation
I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?
Using BM25Similarity you could reduce to 0f:
#param b Controls to what degree document length normalizes tf values
or
#param k1 Controls non-linear term frequency normalization (saturation).
Both params will affect SimWeight
indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));
More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
Shorter docs are meant to be more relevant when you use TF-IDF scoring.
You can use your custom scoring functions in Lucene. Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.
There's a code sample here that will help you implement it