Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Is there an open-source Java implementation of the VCDIFF binary diff format (decoder and encoder)?
There are xdelta and open-vcdiff, but those are both C libraries.
Alternatively, are the other formats/algorithms that one could use to generate diffs for binary files from Java?
You can generate binary diffs using badiff; the website is
http://badiff.org/
and it is available on maven central. It's BSD licensed, so friendly for both OSS and commercial. The algorithm used is a chunked version of the O(ND) diff described in this paper:
http://www.xmailserver.org/diff2.pdf
The diff format isn't particularly compatible with anything else, but it produces some really good and really small diffs.
The library is pretty fast; on my desktop machine it can generate a diff for two random 50MB input streams in 54 seconds. Hopefully that's fast enough; I think it's reasonably impressive since that's a comparison of two token streams of 50 million tokens each. badiff will take advantage of multiple CPU cores when computing diffs.
disclaimer: I'm the author of badiff, so of course I think it's cool. I'm always open to suggestions; things like being able to read/write "standard" binary diff formats sound like cool new features to add in upcoming releases.
I have a decoder for VCDIFF written in C#, which would probably be fairly straightforward to port to Java, if that's any help. It's part of MiscUtil but I don't think it relies on any other bits of MiscUtil (or only minimally, anyway).
Unfortunately I never got round to writing an encoder, which is obviously rather harder - and wasn't necessary in our case (where we needed to apply patches in .NET on a mobile device, but could create them however we wanted at the server).
I hava ported MiscUtil's vcdiff decoder to java.
https://github.com/xiaxiaocao/jvcdiff
update: now it also have a vcdiff encoder
There is a java-port of xdelta:
http://sourceforge.net/projects/javaxdelta/
But i can not say anything on its quality - i did not try it yet.
I have a Java port of open-vcdiff on Github. It's tested against open-vcdiff, but it's not used in production anywhere.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Which other frameworks exists besides Mahout for implementing Machine Learning algorithms in JAVA such that the underlying framework takes the JAVA code and runs it on Hadoop?
I am looking for alternatives to Mahout because I am need of a SVM and an Agglomerative Clustering implementation on Hadoop, and only SVM is supported in Mahout.
I recommend you guys for Apache Hadoop based machine learning / data mining library like Apache Mahout.
http://www.openankus.org/pages/viewpage.action?pageId=2195722
It is so simple and easy mapreduce job processing. Are you interested in? See more wiki (http://www.openankus.org)
Well, if SVM is on hadoop, the rest is easy to implement!
Note that naive agglomerative clustering algorithm is not efficient for large data ( O(n^2) complexity). Such complexity makes it impossible to run the algorithm on a large dataset, even on a big cluster, unless you try one of its extensions like this one: ftp://193.167.42.127/franti/papers/GraphPnn-TPAMI.pdf
Pattern. It has a Java API and you can use R too.
http://www.cascading.org/pattern/
A quick Googling gave the following
http://java-ml.sourceforge.net/ - After close to 3 years, there was a release. Not sure how well it is supported and what algorithms are implemented.
http://sourceforge.net/projects/weka/ - Some recent recommendations by others look good.
Also, see this thread.
Haven't tried both of them.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new to NLP and was doing research about what language toolkit I should be using to do the following. I would like to do one of the two things which accomplishes the same thing:
I basically would like to classify a text, usually one sentence that contains 15 words. Would like to classify if the sentence is talking about a specific subject.
Is there a tool that given a sentence, it finds out the subject of a sentence.
I am using PHP and Java but the tool can be anything that runs on Linux command line
Thank you very much.
The most basic way of doing this is create a set of labeled training data and using it to train a classifier. How the classifier works is a more complicated issue- for spam filtering and many other things, just looking at the word frequency works pretty well.
Here is a basic example: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html
It is trivial to write a Naive Bayes classifier; a package like MALLET will also have this plus better machine learning methods. Lingpipe will also have this sort of stuff.
What you really should care about is the quality of data and what your features are. By quality of data I mean lots of data without that many borderline cases, and by features I mean are you choosing just words, or combinations of words (word ngrams), or dependency features, or something more complex. You need a way to create the feature data as well as actually do the learning! In this sense Lingpipe is good as you can do tokenization and all that first as opposed to writing your own functions to do this or having to cobble other tools together into your own feature generation code.
A guide to MALLET can be found here: http://courses.washington.edu/ling570/fei_fall10/11_15_Mallet.pdf
NLTK may solve problem.
i found below web service api handy and off the shelf to use...
http://text-processing.com/demo/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I think tow months ago. I found a google's open source project that can store key value pairs with high performance. But i forget the name. Could anybody tell me? or you can have some other suggestions for me? I have been using BerkerlyDB, but I found BerkerlyDb is not fast enough for my program. However, berkerylyDB is convenient to use as it appears as a java lib jar, which can be integraed with my program seamlessly. My program is also written in Java.
Two strong competitors in the DHT (Distributed Hash Table) 'market':
Cassandra (created by Facebook, in use by Digg and Twitter)
HBase
Here is a presentation about Cassandra. On slide 20 you'll see some speed benchmarks- 0.12 ms / write
(You can search around for the whole presentation, including Eric Evans talking)
Nobody mentions leveldb and yet this post is at the top when searching for "good key value store". Leveldb in my experience is simply awesome. It's so fast I couldn't believe it.
I've been trying quite a few databases for a task I was doing. I tried:
windows azure table storage (expensive, value size max 1 Mb and each property size is max 64 Kb)
redis (awesome if you have as much ram as you please)
mongodb (awesome as long as there is enough ram, breaks after that point)
sql server (expensive, needs maintenance, such as rebuilding indexes and eventually still not fast enough)
sqlite (free, but not as simple to use as leveldb and not fast)
leveldb. If you can model your job as to reading large consecutive chunks of data through an iterator then you'll get great speed. Writing is also pretty fast. Combine it with ssd disk and you'll love it.
Bigtable?
Redis
http://code.google.com/p/redis/
Maybe you should describe what features you need. If it doesn't need to be distributed (does it?) then I would try using the H2 Database. For those who think "it can't be fast because it's using SQL" please note that when using prepared statement, SQL parsing is only done once. Disclaimer: I'm the main author of H2.
Many answer seem to automatically assume need for distribution; but that seems odd if question refers to BDB.
With that in mind, beyond Redis and H2 (which are both good), there is also Tokyo Cabinet to consider, which seems to offer benefits over BDB. And one more newer possibility is Krati.
I think you saw Guava or Google collections.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
As a practical developer I would like to make a good algorithm for my specific task, built from blocks, like a 'boundary extraction', or 'gamma correction' and so on, but I don't want to implement the wheel, making all that stuff, so I wander - if there's any powerful CV library, like C++'s OpenCV?
Saying "the best", I mean library having following properties:
Lot of different algorithms implemented
Extensibility - I can create new stuff in terms of the library
High performance
Thread safety
You might be interested in a pure Java open source computer vision library I have developing, BoofCV. BoofCV supports many common image processing operations, advanced feature detection, wavelet denoising, camera calibration, stereo vision and structure from motion . It's also very fast. Currently has the fastest SURF implementation out of any open source library, including C/C++ ones. Speed wise, it is very competitive with OpenCV for mid to high level vision algorithms. OpenCV is of course faster for low level image processing.
Website: http://boofcv.org
OpenCV vs BoofCV: http://boofcv.org/index.php?title=Performance:OpenCV:BoofCV
SURF Performance Study: http://boofcv.org/index.php?title=Performance:SURF
Ok enough marking. Hope you guys like it!
Much of the code is already in place, just missing a couple of components.
Shaman,
I have been looking a long time for a image processing library comparable to opencv in Java. For the amount of automated tasks opencv performs there is nothing that comes close to it for the advanced machine vision type applications.
In terms of image processing though imagej has a large amount of preimplemented algorithms and plugins. I use this library all the time to preprocess things I need to send into opencvs machine vision utilities. This is also open source with easy ways of adding additional features through plugins or direct manipulations so I think it could meet most of your requirements.
OpenCV has Java wrappers:
OpenCV Java and Processing library
JavaCV
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
This is along similar lines as these recent questions:
Best Java Obfuscation Application For Size Reduction
Creating non-reverse-engineerable Java programs
However, one ends up recommending yGuard and the other ProGuard but neither mention both. I wonder if we could get a comparison of each one and hear peoples experiences from both sides of the fence. Looking at this comparison chart on the ProGuard website its clearly angled towards ProGuard. But what about real-world experience of each - which one produces smaller output? which one is harder to decompile from? what Java versions are supported by each?
Personally I'm particularly interested from a J2ME point of view but please don't limit the discussion to that.
Results for my project.
Obfuscation - both fine.
Optimisation - ProGuard produced 20% faster code (for the measured app bottleneck).
Compactness - ProGuard about 5% smaller.
Configuration / Ant - YGuard is much easier to configure.
So, I'd advise ProGuard - but configuration and ant integration could definitely be improved.
Proguard is a better product; especially if you take the time to go through the settings for J2ME.
Specifically for J2ME there is a far better (commercial) product called mBooster
I've been getting around 25% improvement in size on my application after its been through Proguard. This is mainly to do with the better Zip compression on the Jar file and comprehensive support for class merging and preverification.
My opinion is - ProGuard is better. Output is smaller a bit. Optimizing is better and much faster.
Decompiling is simple in both cases. Well, i mean, if u know Java well and really know business-logic of what you're decompiling, there is no problem to get it back to sources from obfuscated classes.
So, my opinion is ProGuard is better.