Well, is there a high-performance graph library for working with primitivies, without those generics/autoboxing overheads? For double lists you may use trove, for linear algebra you may use netlib-java (examples for you to better understand the point of my interest in this question).
As for Graphs/Networks: all the libs I've found use generics and should be not that performant. I may as well do some tests for that, but I believe that heap-managed network link weights would be inferior to double[] with some bit offsets to get the index for i and j. The usage scenario: there're hundreds of such networks (most of them sparse) of size 4k*4k, there's some genetic optimization running over that set of networks, which do some flow/min route estimations for each specimen.
So, there're: JGraphT, JUNG, ANNAS, JDSL (the links lead to the APIs/code samples which expose the miserable Java Generics/Object wrappers in all of them). Are there any Trove-ish alternatives? I'd already created some simplistic implementation, but just decided to look around to avoid inventing the proper bicycle...
Any opinions, suggestions?
Thanks,
Anton
PS: Please don't start on performance of generics-laden Java code, at least without linking to some decent benchmark, ok? ;)
You may use some sparse matrix with row compression. Not best and not specialized, but you may build upon it.
Well, there're some generic sparse matrix implementations which do not mess with generics and one rather solid performance benchmark:
java-matrix-benchmark on google code
ujmp related overview
The most convincing is MTJ's sparse matrix.
Please add answers to the question if you have any suggestions or updates. I'll accept any better ideas. Thanks.
If you need performant data structures, you should check the fastutil project, which is an efficient both in time and memory implementation of the Java Collection Framework. Performance is achieved also avoiding boxing and unboxing primitive types.
Fastutil are very efficient data structure. If you need a graph ADT implementation, you could check this, which is an efficient in memory graph implementation, based on the fastutil.
The project was part of my MS thesis, which was about community detection in big graphs.
Hope it helps!
Related
Since we have so many languages to use for programming to create a software product, and every language has its own pros and cons. While thinking on how to choose a language, a friend suggested to loop for the cheapest cost a language creates as judged by the lines of code method used for estimation. I was wondering how many lines of Java code and how many lines of Python code each FP incurs ?
You should assume a probability model. This might be helpful: https://www.cs.uoregon.edu/Classes/13W/cis472/slides/estimation-2pp.pdf
In terms of implementation, you can use scikit-learn and scikit-stats libraries in Python, where you are able to implement most statistical methods in a few lines of code.
I found the answer to my question in this pdf. http://namcookanalytics.com/wp-content/uploads/2013/07/Function-Points-as-a-Universal-Software-Metric2013.pdf
I'm looking for an in-memory map with java-friendly APIs (not necessarily java) that supports range queries. Our design doesn't yet call for it to be distributed.
Any suggestions? Thanks!
Use a TreeMap. A range query can be done using the methods lowerEntry and higherEntry, higherKey and lowerKey. Find the first key smaller than the left end of the range, the first key bigger than the right one and return everything between them.
Depending on how flexible you need things to be and how extensible, etc., you could consider using an in-memory database; that would give you far more capability than you've mentioned here, and is probably only interesting if you think you might have a use for a lot more one day. You would be expending a lot of complexity, and possibly space, for something that would be extremely flexible. But you should be aware that several (free) java databases offer in-memory configurations, including Derby (released with Java).
Is an interval tree maybe what you're looking for?
I want to implement a OCR system. I need my program to not make any mistakes on the letters it does choose to recognize. It doesn't matter if it cannot recognize a lot of them (i.e high precision even with a low recall is Okay).
Can someone help me choose a suitable ML algorithm for this. I've been looking around and find some confusing things. For example, I found contradicting statements about SVM. In the scikits learn docs, it was mentioned that we cannot get probability estimates for SVM. Whereas, I found another post that says it is possible to do this in WEKA.
Anyway, I am looking for a machine learning algorithm that best suites this purpose. It would be great if you could suggest a library for the algorithm as well. I prefer Python based solutions, but I am OK to work with Java as well.
It is possible to get probability estimates from SVMs in scikit-learn by simply setting probability=True when constructing the SVC object. The docs only warn that the probability estimates might not be very good.
The quintessential probabilistic classifier is logistic regression, so you might give that a try. Note that LR is a linear model though, unlike SVMs which can learn complicated non-linear decision boundaries by using kernels.
I've seen people using neural networks with good results, but that was already a few years ago. I asked an expert colleague and he said that nowadays people use things like nearest-neighbor classifiers.
I don't know scikit or WEKA, but any half-decent classification package should have at least k-nearest neighbors implemented. Or you can implement it yourself, it's ridiculously easy. Give that one a try: it will probably have lower precision than you want, however you can make a slight modification where instead of taking a simple majority vote (i.e. the most frequent class among the neighbors wins) you require larger consensus among the neighbors to assign a class (for example, at least 50% of neighbors must be of the same class). The larger the consensus you require, the larger your precision will be, at the expense of recall.
I'm still seeking an ideal solution to this question. To summarize, I am modeling a power subsystem in Java and I need a Directed-Acyclic-Graph (DAG)-type container for my data.
I found exactly what I need in C++'s Standard Template Library (STL). It is the multiset, which supports storing multiple data values for the same key. I can clearly see how storing power nodes and keys, and their upstream/downstream connections as values, could be pulled off with this data structure.
My customer has a hard-requirement that I write the power subsystem model in Java, so I need a data structure identical to the STL multiset. I could potentially roll my own, but it's late in the game and I can't afford the risk of making a mistake.
I'm supremely disappointed that Java is so light on Tree / Graph collections.
Has anyone found an multiset-type structure in Java?
Check out Guava's Multiset. In particular the HashMultiset and the TreeMultiset.
Have you looked at Google's version: http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Multiset.html
Started up solving UVa problems again as a way to pass time (going to the army in 6 weeks). I love writing Java, but end up using C / C++. It's not because IO is faster, no need to box data, more memory or use of unsigned, because its algorithm efficiency that counts.
In short i am slowly constructing how to/article/code base for different categories of efficient algorithms and dp is next.
Quoting Mark Twain: It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.
I aid assistance in building priority list what are must have efficient algorithms.
This MIT lecture is a good introduction to dynamic programming if you're already familiar with algorithms.
The wikipedia article on Dynamic Programming has a section entitled "Algorithms that use dynamic programming" with many examples.
Here is another good list of practice problems in dynamic programming.
Since you referenced the UVa problem list, you should definitely take a look at Problem 103 - Stacking Boxes. The problem lends itself well to a solution using a Longest Increasing Subsequence algorithm.