Java Open Source Text Mining Frameworks [closed] - java

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.
I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements.

I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK. It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast.
Thank you all for your interesting answers.
Best Regards,
ukrania

Although not a specialized text mining framework, Weka has a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others.
It also has a few filters to wok with textual data like the StringToWordVector filter which can perform TF/IDF transformation.
Check out the Weka wiki website for more information.

Maybe have a look at Java Open Source NLP and Text Mining tools.

I've used LingPipe -- a suite of Java libraries for the linguistic analysis of human language -- for text mining (and other related) tasks.
It is a very well documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.
The source code is also very easy to follow and well documented which, for me, is always a big plus.
As for Machine Learning algorithms, there are plenty, from Naïve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).
In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.

You may already know about GATE: http://gate.ac.uk/
...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.

I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt http://sourceforge.net/projects/maxent/ for a course once.
Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.

We use lucene to process live streams from the internet. It has a native java api.
http://lucene.apache.org/java/docs/
You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.
http://lucene.apache.org/mahout/

Related

Math Intensive, Calculation Based Website - Which Language Should I Use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I am very new to programming. I am familiar with HTML, C++ and learning PHP to start a database.
I want to make a website which tracks a stock price. I have written various algorithms in Matlab however, MATLAB only has a to-Java conversion.
I was wondering what language would be the best to do a lot of calculations. I want my calculations to be done in real time and plotted. Would Java be the best language for this?
I can do the calculations in C++ but I don't know how to put the plots on the website. Likewise I believe I can do everything in Matlab but the conversion looks a little sketchy.
I would be very thankful if someone with experience with Java, or I also heard python, would comment on my post.
My advice: write the website code in Python with Django and do the calculations in Numpy/Scipy. Those two libraries provide a very Matlab-like API for heavy computations. Their performance is excellent. Matplotlib is the associated plotting library.
It's not so much the language that matters, it's making sure that you have a good mathematics library for it. MATLAB is neat because it does all that matrix math super fast for you, but of course you need to link it with another language like you said.
Your goal should be to either find a good math library for the language you like, or find a language with a good math library you like.
For What It's Worth: I know Python has NumPy (scientific computing package) and Sage Math (a libre Mathematica clone).
I think you can use PHP or Java Web.
I would do C++ and write them to a database, then using php you can grab them from the same database and show them online, otherwise then java can do all that but make sure all calculations aren't done on the fly since that will kill your server, especially with stocks that can turn into a lot of data.
if you want to plot data, then you may be able to pass off some of the calculation to the google chart api:
http://code.google.com/apis/chart/

Tools to be used for Image processing [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm Developing a CBIR (Content based Image Retrieval System) as part of my BE project.
Which of the below mentioned tools will be better to be used for image processing??
1-> Matlab
2-> Mathematica
I'm planning to develop this system using java as a front end of the system. which of above systems will be better. Or should I go for some 3rd party image processing API's available for java ????
I used Mathematica for years and still found it easier to learn Matlab from scratch in order to do an image processing project. The thing that makes Matlab better here is that many state-of-the-art image algorithms have code available. For instance, for content-based image retrieval you need to extract content features, and vl_sift library does that. Also, you can bundle your Matlab library to run as a stand-alone executable, and I don't know if that's possible with Mathematica.
I previously suggested ImageJ and others mentioned ImageMagick since I mentioned Java environment. However, I would like to change my suggestion. I came across Intel's OpenCV (Opensource Computer Vision) libraries. This is a great set of libraries for use with C, C++ and Python. This is cross platform too! So porting the code shouldn't be too difficult.
Why I think OpenCV is great is because even novices (like me) in Image Processing can use it. for example, smoothing an image is as easy as calling one function cvSmooth() with a few parameters on which type of smoothing (blur, gaussian etc). It supports much more advanced functions such as Optical flow and blob tracking. And the great thing is, its quick to test out or build simple image transforms.
for more info please go to http://opencv.willowgarage.com/wiki/ . Here you'll find cheatsheats, reference manuals, examples and some tips. great help and starting point.
Thanks
What are your criteria for measuring the relative superiority of programs for image processing ? For example, if you are a Mathematica expert then you will find it easier to use Mathematica for the task. On the other hand, if you are a penniless student then you will find Java and some of its libraries more to your taste.
EDIT: in answer to OP's comments ...
'ease of image processing' is entirely subjective -- if you don't know Mathematica then it will be difficult to use it for image processing -- so this one is your call.
'processing time' is entirely objective -- but do you have the time to try out all 3 of your suggested options and compare them ? For a BE project you'll be far better using the tool you are most comfortable with and spending as little time as you can wrestling with an unfamiliar tool for the sake of a bit of extra speed.
'cellular automata' for image processing -- don't know how relevant it is, but Mathematica has inbuilt functionality for cellular automata.
I would look into the ImageMagick/GraphicsMagick family (SO discussion), which has several Java wrappers (e.g., JMagick).
you could use ImageMagick or why not look into the JMF (Java Media Framework)
Matlab is the better of the two. It has huge built in libraries and implementations of thousands of algorithms. Its fast, easy and well documented.

Large scale Machine Learning [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I need to run various machine learning techniques on a big dataset (10-100 billions records)
The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems and ways to solve them)
What would be the best implementation? I'm experienced in ML but do not have much experience how to do it for huge datasets
Is there any extendable and customizable Machine Learning libraries utilizing MapReduce infrastructure
Strong preference to c++, but Java and python are ok
Amazon Azure or own datacenter (we can afford it)?
Unless the classification state space you are attempting to learn is extremely large, I would expect that there is significant redundancy in a text-mining-focused dataset with 10-100 billion records or training samples. As a rough guess, I would doubt that one would need much more than a 1-2% random sample subset to learn reliable classifiers that would hold up well under cross-validation testing.
A quick literature search came up with the following relevant papers. The Tsang paper claims O(n) time complexity for n training samples, and there is software related to it available as the LibCVM toolkit. The Wolfe paper describes a distributed EM approach based on MapReduce.
Lastly, there was a Large-Scale Machine Learning workshop at the NIPS 2009 conference that looks to have had lots of interesting and relevant presentations.
References
Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung (2005). "Core Vector Machines: Fast SVM Training on Very Large Data Sets", Journal of Machine Learning Research, vol 6, pp 363–392.
J Wolfe, A Haghighi, D Klein (2008). "Fully Distributed EM for Very Large Datasets", Proceedings of the 25th International Conference on Machine Learning, pp 1184-1191.
Olivier Camp, Joaquim B. L. Filipe, Slimane Hammoudi and Mario Piattini (2005). "Mining Very Large Datasets with Support Vector Machine Algorithms ", Enterprise Information Systems V, Springer Netherlands, pp 177-184.
Apache Mahout is what you are looking for.
Late answer, but here is a good link for large scale data mining and machine learning:
The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies.
Im not aware of any ML library that uses map/reduce. Maybe you have the capability to use an ML library and a Map/Reduce library together? You might want to look into Hadoop's Map/Reduce:
http://hadoop.apache.org/mapreduce/
you would have to implement the reduce and the map methods. The fact that you use so many techniques might complicate this.
you can run it on your own cluster or if you are doing research maybe you could look into BOINC (http://boinc.berkeley.edu/).
On the other hand, maybe you can reduce your data-set. I have no idea what you are training on, but there must be some redundancy in 10 billion records...
I don't know of any ML libraries that can support 10 to 100 billion records, that's a bit of an extreme so I wouldn't expect to find anything off the shelf. What I would recommend is that you take a look at NetFlix prize winners: http://www.netflixprize.com//community/viewtopic.php?id=1537
The NetFlix prize had over 100 million entries, so while it's not quite as big as your data set you may still find their solutions to be applicable. What the BelKor team did was to combine multiple algorithms (something similar to ensemble learning) and weight the "prediction" or output of each algorithm.
Take a look at http://hunch.net/?p=1068 for info on Vowpal Wabbit; it's a stochastic gradient descent library for large-scale applications.
A friend of mine has worked on a similar project. He used perl for text mining and matlab for techniques as bayesian methods, latent semantic analysis and gaussian mixture...
See this list of large-scale machine learning resources (courses, papers etc): http://www.quora.com/Machine-Learning/What-are-some-introductory-resources-for-learning-about-large-scale-machine-learning

which is the best java gui testing tool? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a java gui testing tool in which tests can be created by recording my gui actions (buttons pressed, windows closed, etc.)
A scripting mechanism for writing tests is not required.
It could be free or commercial, but cheap and great is better than expensive and great.
My application is a rich-client app written in Java SE 6.
Yoav
If it's a Swing app you could take a look at Marathon.
I concur with Kettelerij, Marathon's the way to go.
It's easy to integrate into external systems like Subversion & CruiseControl, becasue all the scripts are human readable (Jython) and not locked into some proprietary format that requires an export (like most of the commercial tools).
It is able to record scripts in your choice of Jython or JRuby, which are essentially python and ruby that give you access to Java API. Very easy to understand.
For advanced testers, you are able to identify which GUI component you want to select using not just their names, but instead a a unique subset of their properties, for example
click('{Text: OK Enabled: true}')
... finds a component whose getText() is "OK" and isEnabled() is "true". This makes the scripts highly dynamic and easier to maintain.
I used Jemmy some years ago. Now I'm mostly doing webapps, so my experience in this field may be somewhat old. :-)
A scripting mechanism for writing
tests is not required.
Yes, it is. Pure capture/replay simply does not work in practice, you always have to edit the resulting scripts. And you often end up spending so much time doing that in an inadequate environment that you save no time over a pure scripting solution tailored for efficient script writing.
I have been impressed with Quick Test Pro. It is pay software from HP, but it has been able to get at some software that most tools can't work with. It has some data features so that tests can be run multiple times with varying data inputs. It is scriptable through VB so most Tester/Developer people will be able to work with it. I have been using it lately to execute tests on many machines for use in performance testing.
Try QEngine will do the record and play back. Has scripting options also
jameleon is very useful for testing web based applications. It combines a number of frameworks providing great flexability to your approach contained in a single launch framework.
There is no capture for jameleon I think you may be confusing this with selenium capture and record. Jameleon is a pure scripting framework.
You also have IBM's Rational Functional Tester:
http://www-01.ibm.com/software/awdtools/tester/functional/
I used an older version to test .NET forms applications (it also works with java apps, windows native apps, web pages). It failed a lot of times, and the integration with .NET was not so great. I don't quite recommend it for that purpose.
However, it is known to work a better with Java apps (RFT itself is made in java, and Java apps were the original target I think), specially in its most recent versions.
It's a very expensive application though. Personally I wouldn't use it again, unless I didn't have another choice.

Are there any decent free Java data plotting libraries out there? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
On a recent Java project, we needed a free Java based real-time data plotting utility. After much searching, we found this tool called the Scientific Graphics Toolkit or SGT from NOAA. It seemed pretty robust, but we found out that it wasn't terribly configurable. Or at least not configurable enough to meet our needs. We ended up digging very deeply into the Java code and reverse engineering the code and changing it all around to make the plot tool look and act the way we wanted it to look and act. Of course, this killed any chance for future upgrades from NOAA.
So what free or cheap Java based data plotting tools or libraries do you use?
Followup: Thanks for the JFreeChart suggestions. I checked out their website and it looks like a very nice data charting and plotting utility. I should have made it clear in my original question that I was looking specifically to plot real-time data. I corrected my question above to make that point clear. It appears that JFreeChart support for live data is marginal at best, though. Any other suggestions out there?
I've had success using JFreeChart on multiple projects. It is very configurable. JFreeChart is open source, but they charge for the developer guide. If you're doing something simple, the sample code is probably good enough. Otherwise, $50 for the developer guide is a pretty good bargain.
With respect to "real-time" data, I've also used JFreeChart for these sorts of applications. Unfortunately, I had to create some custom data models with appropriate synchronization mechanisms to avoid race conditions. However, it wasn't terribly difficult and JFreeChart would still be my first choice. However, as the FAQ suggests, JFreeChart might not give you the best performance if that is a big concern.
I just ran into a similar issue (displaying fast-updating data for engineering purposes), and I'm using JChart2D. It's pretty minimalist and has a few quirks but it seems fairly fast: I'm running a benchmark speed test where it's adding 2331 points per second (333x7 traces) to a strip chart and uses 1% of the CPU on my 3GHz Pentium 4.
Live Graph supports real-time rendering.
I'm using GRAL for real-time plotting. It's an LGPL Java library. Although it's not as powerful as JFreeChart it has a nicer API. I got a plot up and running in very short time. They also ship a real-time plotting example.
I found this question when I was googling for open source plotting libraries for java. I wasn't quite happy with the answers posted here so I did some further research on the issue.
Although this question has been posted back in 2008 this might still be interesting to someone.
Here is a list of Open Source Charting & Reporting Tools in Java
http://autoplot.org/ allows for real-time updates and can be used to create many types of scientific plots.
To update the plot, specify the URL to a data file and then append &filePollUpdates=1&tail=100. See the example at http://autoplot.org/cookbook#Loading_Data
Waterloo Scientific Graphics is a new LGPL project. Data objects are observable and could be updated in a real time plotting scenario.
For details see http://waterloo.sourceforge.net/
A few screenshots:
The library I wrote, Plot4j, also supports real-time plotting.
I used JFreeChart (http://www.jfree.org/jfreechart/) on a previous project. It has some very good built-in capabilities, and the design was WAY extensible so you could always roll your own extension later if you needed some custom chart annotation or wanted an axis to render differently, or whatever. It's definitely worth checking out.
Check ILOG's JViews - they have a lot of stuff and something might fit your needs. All of them are extremely configurable and quite fast. Not free though.
I've used JFreeChart in a rather complex application that needed to visualize data streams and calculations based on the data. We implemented the ability to visually edit the data plots by mouse and had a very large set of data points. JFreeChart handled it very well.
Unfortunately I was stuck with v0.7, but the newest release are sooo much better when it comes to API clarity. The community is very helpful and the developers are responding to mails too.
If you're doing a web application and don't want to bother with libraries, you can check the Google Chart API. Didn't use it myself, but I started some tests which were very promising.
For real-time plotting you can use QN Plot, JOpenChart or its fork Openchart2.
JHandles is an alternative graphics package for Octave (a math package). It is probably worth looking into, but being Octave specific may not have what you need.
-Adam
PtPlot
may be a good choice. Formerly called Ptolemy.
jcckit can handle real-time plotting. It's a bear to use though.
I forked it, and made a very simple wrapper around it for non-realtime plotting. The underlying complicated interface can be used directly too.
https://bitbucket.org/hughperkins/easyjcckit
You might want to check out JMathPlot

Categories

Resources