I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.
Related
I am doing a project and I need to find out a machine learning library written in java specialized for document classification. Can anyone please give me some examples?
Here are two famous Java libraries
Stanford core NLP - http://nlp.stanford.edu/software/classifier.shtml
GATE - http://osdir.com/ml/ai.gate.general/2007-05/msg00003.html, https://gate.ac.uk/sale/tao/splitch19.html#chap:ml
Depends on the kind of ML you are looking for.
There is the linguistic part of the problem (parsing documents, extracting entities, etc.) which can significantly improve the result, and the ML algorithms part.
For the latter look at Apache Mahout, for example - it also has examples of document classifications coming with it. Especially if you plan to deal with a lot of data. Stanford classifier is also a good choice to start with.
Both machine learning frameworks MALLET (http://mallet.cs.umass.edu/classification.php) and Weka (http://www.cs.waikato.ac.nz/ml/weka/) can do document classification. They are both easy to get started with compared to say Mahout or Spark.
I want to implement a semantic search system in Java. Sesame will be embedded into my system to store and manipulate rdf data directly, and I want to use Tomcat, JSP and Servlet. But I also need to do natural language processing, which I know Python is really good at. So it there any way that I can merge Python code in my Java web codes? Or is there any good tools dealing with NLP in java?
I think I'm a little confused since I know little about NLP area.
Thanks in advance!
Apace Lucene is the best http://lucene.apache.org/core/ - also read about it extension like a Solr
NLP can be done with java too..here is the link
http://opennlp.apache.org/
And also if you want to go for python for NLP then you can use NLTK
I am an absolute beginner. Never made a classifier or anything in weka using Java I have used the interface before. Basically I am kind of lost I've looked at the filter class for weka and played around with it a little. My documents are text documents and I need to separate them into 2 categories.
I'm not sure how I define the categories or how I load the documents into an IDE to be classified
:-(
Any help/tutorials or pointers would be greatly appreciated.
I found this java tutorial very helpful, although there are very few resources online available (that I have found)
http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
hope this helps
Using weka for the first time is a pain, but you will need to go through it.
Also, I tried out weka, but I had to dump it due to JVM out of memory exceptions. I wrote my own small clustering algo using Ruby, it's performance was way better.
Any way, here is how to use SVM in WEKA:
You can follow this tutorial of how to use SVM in weka: www.stat.nctu.edu.tw/~misg/WekaInC.ppt
Now, you will need data in ARFF format (and I recommend you use this, as per my exp, it helps, data looks more structured from WEKA's prespective). So, you can do that using XML2ARFF-Converter which I wrote for my self. You can modify it to read text files and convert your text file to ARFF.
I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?
You may want to take a look at Content repository API for Java and the several implementations (some of them free).
Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.
The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?
I'm looking for a library capable of drawing dendrograms of data in Java (not calculating them, I can do it by myself).. do you have any clues? Already tried to search it over Google but haven't found anything that is not stand-alone (while I need to embed the generation inside my program).
Thanks!
Check out the JUNG graph library. It won't perform the actual clustering for you but is a really good library for visualising your results.
Take a look at Archaeopteryx. It has fairly many features; it's open source, and it is available in a pre-packaged jar file.
BTW, I use JUNG and really like it. It can perform various clusterings, but AFAIK, it has no inherent dendrogram capabilities. Because it has graphing capabilities, you could roll your own dendrogram, but it would take some work.