Create a dataset: extract features from text documents (TF-IDF) - java

I've to create a dataset from some text files, writing them as vectors of features.
Something like this:
doc1: 1,0.45 6,0.001 94,0.1 ...
doc2: 3,0.5 98,0.2 ...
...
each position of the vector represent a word, and the score is given by something like TF-IDF.
Do you know some library/tool/whatever for this? (java is better)

After some days i found the "perfect tool" for this: Word Vector Tool.
http://sourceforge.net/projects/wvtool/

mallet. including TF-IDF, POS, classification.

Sure there are many eg http://en.wikipedia.org/wiki/Lucene
However
I recommend that you write an basic IR system from scratch. Looking under the hood is always a great learning experience.

Related

Match words like recruit, recruiter and recruitment in Java

I want to write a code to match certain words. I don't care about the form of the word, it could be a noun and adding -ing to it, it can become a verb. Eg, add = adding, recruit = recruiting. Also, like recruit = recruitment = recruiter.
In simple words, all forms of the words are equal. Is there any Java program that I can use to achieve this.
I am somewhat familiar to Apache's OpenNLP, so if that could help in any way?
Thanks!!
It sounds like you want a stemmer or lemmatizer. You might want to check out Stanford CoreNLP which includes a lemmatizer. You might also want to try the Porter Stemmer.
My guess is that these will cover some of the cases but not all of them. For example "recruitment" won't be lemmatized to "recruit." For that, you'd need a more complex morphological analyzer but I don't know of a good existing system.

Java implementation for LDPC codes

Is there any open source Java implementation for LDPC (Low Density Parity Check) codes, I found only MATLAB codes.
My scenario is I will take text file and divide into block and I will delete some data in text file, and by using LDPC codes I need to recover data from text files.
Thanks.
I haven't tried this but the code here should get you started
http://www.cs.utoronto.ca/~radford/ftp/LDPC-2006-02-08/install.html
http://www.cs.utoronto.ca/~radford/ftp/LDPC-2006-02-08/examples.html
It's in C though. Might be easy to port. Or not.
I'd suggest looking into ways of calling matlab functions in java. I know there are a couple. Also why LDPC? While its one of the best FEC, it involves lots of matrix manipulation if I recall correctly. This is stuff much better suited for mat[rix]lab. The right tool for the right job...
There are also these two pure Java implementations:
https://github.com/a4a881d4/ldpc-java
https://github.com/pierroweb/LDPC-correcting-codes
I haven't tested them and would appreciate feedback from anyone else that has.
There's also a Java wrapper around a C++ library: http://cpham.perso.univ-pau.fr/MULTICAST/Java_wrapper_for_LDPC.html
Not the most promising results, but something to start from, at the very least.

Displaying sample text from the Lucene Search Results

Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.

What are good methods to perform spreadsheet-like calculations in a programming language?

What's the best way to do spreadsheet-like calculations in a programming language? Example: A multi-user application needs to be available over the web that crunches columns and cells of numbers like a spread-sheet based on user submission. What are the best data structures/ database models/patterns to handle this type of work so that handling the different columns are done efficiently and easily in php, java, or even .Net. Is it better to use data structures within the language, or is it better to use a database? If using a database is the way, how does one go about doing this?
To do the actual calculation, look at graph theory. Basically you want to represent each cell as a node in a graph and each dependency as a directed edge. Next, do a topological sort to calculate the value of each cell in the right order.
Aspose.Cells (formerly Aspose.Excel.Web) is a good way to get the functionality you are looking for.
Unless you are asking more for a "How is it done?" than "I need to do it." Then I would look at the other answers given.
Along the lines of "I need to do it"
Microsoft has Excel Services which does just what you want.
Spreadsheet operations on the server. It is available via a web services interface, so you can connect and drive calculations from Java, PHP, .NET, whatever.
Excel Services is part of Sharepoint 2007.
Resolver One is a Spreadsheet app made in IronPython.
There is an explanation of the overall mechanic for the calculation [pythonology.org] it uses for user generated ecuations.
The relevant image showing Resolver One's overall algorithm.
Should note that users can write python code to be interpreted both on the cells and a special 'outside of sheet' place.
Look at another question here in SO, from where I reused my answer.
I can't tell you how to do it. But I would recommend you to look at the code of PHPExcel. PHPExcel is a library that allows you to create Excel files within PHP.
The workflow of PHPExcel is simplified like this:
Create an empty Excel file object
Add cells (with either data or formulas) to the "Excel file"
Call the create function which is generating the file itself
In your case you would have to replace 3. with something like "Create web interface".
Therefore I would recommend you to look at the code of this open source project and look how the general structure is. This should help you solving your problem.
I once used a binary tree to store the output of parsing a string using BODMAS. Each node was an operation between two other nodes, which could be a number, a variable or another operation.
So y = x * x + 2
became:
+
* 2
x x
Sadly this was at school in Pascal and is stored on a 5 1/4" disk, so you don't want it :)
SpreadsheetGear for .NET will let you load Excel workbooks, plug in values, calculate and then get the results.
You can see a few simple ASP.NET calculation samples here, other ASP.NET samples here and download a free trial here.
Disclaimer: I own SpreadsheetGear LLC
I must point out that google spreadsheets already does this kind of stuff.

Generate Images for formulas in Java

I'd like to generate an image file showing some mathematical expression, taking a String like "(x+a)^n=∑_(k=0)^n" as input and getting a more (human) readable image file as output. AFAIK stuff like that is used in Wikipedia for example. Are there maybe any java libraries that do that?
Or maybe I use the wrong approach. What would you do if the requirement was to enable pasting of formulas from MS Word into an HTML-document? I'd ask the user to just make a screenshot himself, but that would be the lazy way^^
Edit: Thanks for the answers so far, but I really do not control the input. What I get is some messy Word-style formula, not clean latex-formatted one.
Edit2: http://www.panschk.de/text.tex
Looks a bit like LaTeX doesn't it? That's what I get when I do
clipboard.getContents(RTFTransfer.getInstance()) after having pasted a formula from Word07.
First and foremost you should familiarize yourself with TeX (and LaTeX) - a famous typesetting system created by Donald Knuth. Typesetting mathematical formulae is an advanced topic with many opinions and much attention to detail - therefore use something that builds upon TeX. That way you are sure to get it right ;-)
Edit: Take a look at texvc
It can output to PNG, HTML, MathML. Check out the README
Edit #2 Convert that messy Word-stuff to TeX or MathML?
My colleague found a surprisingly simple solution for this very specific problem: When you copy formulas from Word2007, they are also stored as "HTML" in the Clipboard. As representing formulas in HTML isn't easy neither, Word just creates a temporary image file on the fly and embeds it into the HTML-code. You can then simply take the temporary formula-image and copy it somewhere else. Problem solved;)
What you're looking for is Latex.
MikTex is a nice little application for churning out images using LaTeX.
I'd like to look into creating them on-the-fly though...
Steer clear of LaTeX. Seriously.
Check out JEuclid. It can convert MathML expressions into images.

Categories

Resources