I have heard about Lucene a lot, that it's one of the best search engine libraries in Java. Is there any similar (as powerful) library for Ruby?
Well, there's Ferret, which is a port of Lucene to Ruby. Also, Lucene is very easy to use from JRuby, if that's an option for you.
Depending on your needs, you might also want to take a look at Solr, which is a higher-level front-end built on Lucene. There is a Ruby interface, solr-ruby, that interacts with Solr via HTTP.
Ferret is what you're looking for:
"Ferret is a high-performance, full-featured text search engine library written for Ruby. It is inspired by Apache Lucene Java project."
I would try one of them in combination with sphinx.
Thinking Sphinx
http://freelancing-god.github.com/ts/en/rails3.html
Riddle
http://riddle.freelancing-gods.com/
http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html
CLucene is a cross-platform C++ port of Lucene. It can be wrapped and used also from every high-level language (there are also a few legacy Swift projects you could start with). See:
http://sourceforge.net/projects/clucene
http://clucene.git.sourceforge.net/git/gitweb.cgi?p=clucene/clucene;a=summary
unfortunately, in most cases, ferret is not what you're looking for, it's got recurring issues with re-indexing speed, index corruption and segfaults on the server. I think most people are going to SOLR, sphinx, and Xapian. I recall seeing some Tsearch / postgres apps mentioned, Tsearch seems to be a industrial-strength solution
Take a look here
Full Text Searching with Rails
Related
i want to write a Java Application (for University) which uses Latent Drichlet Allocation (LDA). The only Framework i found which offers LDA was Mahout.
I have quite some expierience in Java programming, even though i would not consider myself a Java Pro (I am coming from PHP).
The application will not be used in a distributed computing context, so the mahout / hadoop way might be a way over the top, but if i am right it should at least work.
My Problem:
The Mahout wiki etc. does not really help me, in fact i do not understand a single word. I dont want to use mahout in that "terminal way". I just want to load the classes into my application and kind of do something like that:
documents = obj.load(Documents);
mahout.doLDA(documents);
(I know it will not be that easy, but i am sure you know what i mean).
thanks
Mahout's libraries could be used in local mode, without full Hadoop cluster. You can look to examples from "Mahout in Action" book to see how this could be done.
I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.
Is it possible to use OLE Automation in Java? If not, why is it not possible in Java?
I'm looking to automate the exporting of excel spreadsheets in different format (ie, .csv etc...)
Thanks for the answers in advance :)
Recently (March 2013), an independent contributor added support for generic COM Automation to JNA, which is the last man standing in terms of native platform API integration from Java. JNA is still very actively maintained, unlike Jawin/JACOB/etc.
See here for an example of how it is used. The pre-cooked bindings to the Office APIs are very simple so far, but looking at the code, it seems very easy to use the COM Automation APIs (IDispatch, Variant, etc) to do late binding to almost any COM interface.
I would like to see, however, a more complete binding of the Office COM APIs, since they are by far the most often used COM API in the world. Maybe there could also be an "MSExcel2007.java", "MSExcel2010.java", etc. to cover the different API versions. So it's very much a work in progress, but JNA is now as generally useful for COM Automation as JACOB/Jawin, with the bonus that it's extremely actively maintained (as of April 2013).
You can use JACOB. But there will be some pain involved as it's not documented very well and the performance is not the best. It can also be hard to get it running correctly for you environment depending on which version of Windows you are targetting. I would definitely not use it if you are building a scalable web application. Another option would be Apache POI which has really come a long way from its early roots and is used in alot of production ready tools like JBoss Drools. If you decide to go with JACOB then I recommend you read this SO thread:
Is there a good reference for using OLE Automation (from Java)?
There is a library called JACOB that allows precisely what you're looking for. What do you mean by "from the Java API?" You mean from from the official J2SE packages? I'm not sure how to answer that other than to say that J2SE doesn't include libraries for every conceivable need under the sun, especially those that only work on a single operating system. That's why third party packages exist.
Commercial, but they seem to have a free Open-Source and Academic license...
JExcel
JExcel Developer Documents
I have no affiliation.
I hope you all have a nice day.
I want to write a web service that would check some web page HTML code every 20 minutes and e-mail it to my mail box. Here I was given a suggestion to use Google App Engine for this task. Having briefly read through that site I learned that two languages could be used there: Java and Python.
Which one do you think would fit best for my task and, therefore, I would have to start learning? (I don't know either language).
Both the languages and their App Engine implementations are pretty solid and mature. As a language, Python is faster to learn, but Java comes with richer tools such as Eclipse that may partly compensate. A lot depends on what other languages you have background in -- for example, coming from C#, Java would be simpler than for somebody coming from, say, C. For such a simple task, the issues of power of the two languages and additional libraries &c doesn't really come into play.
I've tried both languages with GAE and here's my general feeling about the choice of language for it:
Python is generally simpler. So, if you're using bare GAE API, Python's one is simpler to learn and simpler to write a webapp in it.
Java is more compatible. Python's API is generally GAE-specific, while Java API resembles some standard Java technologies (servlets, JDO, deployment etc.)
So, Java is a good choice if you either have an experience with web development in Java or if you're going to use third-party libraries extensively. Otherwise, Python is better.
For your particular task, I'd suggest Python, mostly because of the existence of Beautiful Soup, an excellent HTML parser that handles poorly formed documents.
Hi I've been working with Django for a few months and find it really helpful. Is there a similar framework for other programming languages such as Java or C#?
The problem I have with Django is finding a server to host the project because supporting servers are more expensive and harder to find.
In Django I find the following items useful: the object-relational mapper, admin interface and url management.
Thanks!
If you're only looking for an alternative because of the hosting aspect of it, I suggest you simply find suitable hosting as opposed to throwing away the framework you like.
If you are looking for a good Django host, I HIGHLY recommend Webfaction.
If they're not your cup of tea, check out djangofriendly.com, which has a huge list of good Django hosts.
If you're looking for the cheapest hosting then PHP is probably your choice. The downside is that PHP is a horrible cobbled together language, and a lot of the PHP code out there is equally terrible (par for the course, I suppose).
Actually since django can run on fcgi, its theoretically possible to run it on any shared host. Here's some instructions for site5 http://www.codekoala.com/blog/2008/installing-django-shared-hosting-site5/
Getting hosting for django should be much easier and cheaper than java and asp.net.
Consider deploying on GAE, which is free for small sites.
http://code.google.com/intl/da/appengine/articles/app-engine-patch.html
If you would like to develop with the help of the vast number of modules in CPAN, then Catalyst - Web Framework is a good choice.
You can create the Dynamic Data Site in Visual Studio 2010, which does the same thing like Django-admin site. It requires Entity Framework.