Semantic search system in Java - java

I want to implement a semantic search system in Java. Sesame will be embedded into my system to store and manipulate rdf data directly, and I want to use Tomcat, JSP and Servlet. But I also need to do natural language processing, which I know Python is really good at. So it there any way that I can merge Python code in my Java web codes? Or is there any good tools dealing with NLP in java?
I think I'm a little confused since I know little about NLP area.
Thanks in advance!

Apace Lucene is the best http://lucene.apache.org/core/ - also read about it extension like a Solr

NLP can be done with java too..here is the link
http://opennlp.apache.org/
And also if you want to go for python for NLP then you can use NLTK

Related

Machine Learning Library Specialized for documents

I am doing a project and I need to find out a machine learning library written in java specialized for document classification. Can anyone please give me some examples?
Here are two famous Java libraries
Stanford core NLP - http://nlp.stanford.edu/software/classifier.shtml
GATE - http://osdir.com/ml/ai.gate.general/2007-05/msg00003.html, https://gate.ac.uk/sale/tao/splitch19.html#chap:ml
Depends on the kind of ML you are looking for.
There is the linguistic part of the problem (parsing documents, extracting entities, etc.) which can significantly improve the result, and the ML algorithms part.
For the latter look at Apache Mahout, for example - it also has examples of document classifications coming with it. Especially if you plan to deal with a lot of data. Stanford classifier is also a good choice to start with.
Both machine learning frameworks MALLET (http://mallet.cs.umass.edu/classification.php) and Weka (http://www.cs.waikato.ac.nz/ml/weka/) can do document classification. They are both easy to get started with compared to say Mahout or Spark.

HTML to Textile Java library

I need to parse a String from HTML to Textile.
I've been looking at Textile4J, Textile-J, JTextile, PLextile.
But so far, none of them provide the functionality I'm looking for.
They do provide the reverse functionality (Textile to HTML).
Worst case scenario, I can use another programming language, but I have not really looked into that.
For now, I don't believe the functionality I want is available in any java Textile library.
I'll try and update this post if and when that changes.
Based on the libraries mentioned above, I have created my own (limited) functionality.
There are also several solutions available in python / ruby.

library for text classification in java

I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.

I need to make a SVM in weka to filter documents using Java

I am an absolute beginner. Never made a classifier or anything in weka using Java I have used the interface before. Basically I am kind of lost I've looked at the filter class for weka and played around with it a little. My documents are text documents and I need to separate them into 2 categories.
I'm not sure how I define the categories or how I load the documents into an IDE to be classified
:-(
Any help/tutorials or pointers would be greatly appreciated.
I found this java tutorial very helpful, although there are very few resources online available (that I have found)
http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
hope this helps
Using weka for the first time is a pain, but you will need to go through it.
Also, I tried out weka, but I had to dump it due to JVM out of memory exceptions. I wrote my own small clustering algo using Ruby, it's performance was way better.
Any way, here is how to use SVM in WEKA:
You can follow this tutorial of how to use SVM in weka: www.stat.nctu.edu.tw/~misg/WekaInC.ppt
Now, you will need data in ARFF format (and I recommend you use this, as per my exp, it helps, data looks more structured from WEKA's prespective). So, you can do that using XML2ARFF-Converter which I wrote for my self. You can modify it to read text files and convert your text file to ARFF.

Ruby alternative for Lucene

I have heard about Lucene a lot, that it's one of the best search engine libraries in Java. Is there any similar (as powerful) library for Ruby?
Well, there's Ferret, which is a port of Lucene to Ruby. Also, Lucene is very easy to use from JRuby, if that's an option for you.
Depending on your needs, you might also want to take a look at Solr, which is a higher-level front-end built on Lucene. There is a Ruby interface, solr-ruby, that interacts with Solr via HTTP.
Ferret is what you're looking for:
"Ferret is a high-performance, full-featured text search engine library written for Ruby. It is inspired by Apache Lucene Java project."
I would try one of them in combination with sphinx.
Thinking Sphinx
http://freelancing-god.github.com/ts/en/rails3.html
Riddle
http://riddle.freelancing-gods.com/
http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/files/README.html
CLucene is a cross-platform C++ port of Lucene. It can be wrapped and used also from every high-level language (there are also a few legacy Swift projects you could start with). See:
http://sourceforge.net/projects/clucene
http://clucene.git.sourceforge.net/git/gitweb.cgi?p=clucene/clucene;a=summary
unfortunately, in most cases, ferret is not what you're looking for, it's got recurring issues with re-indexing speed, index corruption and segfaults on the server. I think most people are going to SOLR, sphinx, and Xapian. I recall seeing some Tsearch / postgres apps mentioned, Tsearch seems to be a industrial-strength solution
Take a look here
Full Text Searching with Rails

Categories

Resources