Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Which other frameworks exists besides Mahout for implementing Machine Learning algorithms in JAVA such that the underlying framework takes the JAVA code and runs it on Hadoop?
I am looking for alternatives to Mahout because I am need of a SVM and an Agglomerative Clustering implementation on Hadoop, and only SVM is supported in Mahout.
I recommend you guys for Apache Hadoop based machine learning / data mining library like Apache Mahout.
http://www.openankus.org/pages/viewpage.action?pageId=2195722
It is so simple and easy mapreduce job processing. Are you interested in? See more wiki (http://www.openankus.org)
Well, if SVM is on hadoop, the rest is easy to implement!
Note that naive agglomerative clustering algorithm is not efficient for large data ( O(n^2) complexity). Such complexity makes it impossible to run the algorithm on a large dataset, even on a big cluster, unless you try one of its extensions like this one: ftp://193.167.42.127/franti/papers/GraphPnn-TPAMI.pdf
Pattern. It has a Java API and you can use R too.
http://www.cascading.org/pattern/
A quick Googling gave the following
http://java-ml.sourceforge.net/ - After close to 3 years, there was a release. Not sure how well it is supported and what algorithms are implemented.
http://sourceforge.net/projects/weka/ - Some recent recommendations by others look good.
Also, see this thread.
Haven't tried both of them.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I have a few text files in CSV format. Some of them are over 500 MB but less than 1 GB. I need to load each of them to a SQL Server 2008 R2 database as table.
I considered using Python. Is Python a good option (performance-wise) to do things like this? Any Python plugin should be used? I am more of a Java man. How is it compared to Java?
Anyone has the experience? Thanks!
Cheers, Alex
In general, no scripting language performs as well as native utilities for loading bulk data.
Unless your CSV is malformed and requires pre-scrubbing and transformation, there is no need to limit your choices to programming languages. Use a tool instead. SSIS, BCP, DTS all come to mind for CSV.
If you have need for customized load logic, or client based load, then by all means, Python, Perl, Java, C# can all do it. But it won't load as fast as a tool already built for it (and speed seems to be what you are concerned with).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Is there a way to find an online Hadoop database and practice on it using Java?
I found that you can practice on www.gethue.com, but I don't think you can do it using java.
Thank you
You can try Cloudera Live.
It's in beta, but seems to work pretty well.
I made a small list of free offers enabling you to manage your own hadoop cluster. It's not technically an available database, but you can fill these clusters with the data you want.
Here is the list :
Microsoft Azure HDInsight : they offer you 150€ to spend on their products. You can rent a Hadoop cluster and work on it.
Qubole : they give you preconfigured Hadoop clusters, you have 75 computing hours for free
Joyent : you can have one VM for free for a year.
You may also try amazon's Elastic Map Reduce, although I'm not sure this specific offer is included in their free trial. An advantage of using it is you can access free datasets more easily (for instance, this one).
Please also note that all these services (except Qubole) require a credit card for registration.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new to NLP and was doing research about what language toolkit I should be using to do the following. I would like to do one of the two things which accomplishes the same thing:
I basically would like to classify a text, usually one sentence that contains 15 words. Would like to classify if the sentence is talking about a specific subject.
Is there a tool that given a sentence, it finds out the subject of a sentence.
I am using PHP and Java but the tool can be anything that runs on Linux command line
Thank you very much.
The most basic way of doing this is create a set of labeled training data and using it to train a classifier. How the classifier works is a more complicated issue- for spam filtering and many other things, just looking at the word frequency works pretty well.
Here is a basic example: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html
It is trivial to write a Naive Bayes classifier; a package like MALLET will also have this plus better machine learning methods. Lingpipe will also have this sort of stuff.
What you really should care about is the quality of data and what your features are. By quality of data I mean lots of data without that many borderline cases, and by features I mean are you choosing just words, or combinations of words (word ngrams), or dependency features, or something more complex. You need a way to create the feature data as well as actually do the learning! In this sense Lingpipe is good as you can do tokenization and all that first as opposed to writing your own functions to do this or having to cobble other tools together into your own feature generation code.
A guide to MALLET can be found here: http://courses.washington.edu/ling570/fei_fall10/11_15_Mallet.pdf
NLTK may solve problem.
i found below web service api handy and off the shelf to use...
http://text-processing.com/demo/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I've never done anything in Java before but I'd like to use Lucene for the search on a site.
I'm having trouble find a good step by step tutorial for a complete beginner at this.
Can anyone recommend a good tutorial?
Thanks
Along with user428747 answer, you can also read this article.
As well as this one (which is kind of old compared to the first one).
On a side note, if you want to use Lucene, did you consider using Solr?
It uses the lucene search library and extends it as you can read here.
The classics: Lucene in Action
this website might help you a bit..
http://www.lucenetutorial.com/lucene-in-5-minutes.html
This is not a direct reply to your question on Lucene tutorials (For that, my answer is same as some of the other posters: Bob Carpenter's Lucene in 60 seconds tutorial on the Lingpipe blog).
If you don't want to learn Java just for Lucene, any full-text search database (Postgres/Mysql/etc) should solve your purpose. In particular Sphinx is recommended.
This decision particularly relevant if you need your search app to have high performance / scalability (since you will be learning two things - Java and Lucene). Unless you have an in-house java expert, it is better to fight one war than two at the same time.
maybe apache solr is better for you: http://lucene.apache.org/solr/
If you're using Zend, why aren't you using Zend's PHP port of lucene? See here for a tutorial on it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What is the best open source java workflow framework (e.g. OSWorkflow, jBPM, XFlow etc.)?
Here's an article that compares kBPM, OpenWFE, and Enhydra Shark that looks like it has some good, thorough info.
It depends what kind of initial investment you want to make. jBPM is the best in terms of features and flexibility, but OSWorkflow is a more lightweight, easier to get up and running and has with a smaller learning curve.
Drools Flow is the best workflow solution that I came across recently. It has a luxury to be better than other solutions, since it is built and designed recently, and based on lessons learned from other long existing, somewhat over engineered frameworks.
Drools Flow comes as a community project along with an official Drools 5 release that besides Flow includes: Guvnor, Expert and Fusion.
Unfortunately Drools Flow does not have an official Red Hat support contract yet, and that is a stopper for some big corporations to consider it. One might think the support is not there for political reasons due to the jBPM project living under same support roof.
I'll cast a vote for jBPM. We used it on a larg-ish ETL platform in-house and it seemed to work quite well. I don't have anything to compare it to, however.
YAWL - Yet another workflow Language
http://en.wikipedia.org/wiki/YAWL