how to implement idf in java? [duplicate] - java

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Any tutorial or code for Tf Idf in java
IDF is inverse document frequency.
IDF = log(document containing the term / number of documents)
How to do it in java ?
Any advices?

How about:
static double idf(int docTermCount, int totalNumDocuments)
{ return Math.log((double)docTermCount / (double)totalNumDocuments); }
(This is basically a humorous way of saying: Tell us more about your circumstances, and maybe we can help. What is a document? What it its representation?
Just Use Lucene
Apache Lucene(TM) is a
high-performance, full-featured text
search engine library written entirely
in Java. It is a technology suitable
for nearly any application that
requires full-text search, especially
cross-platform.
It provides IDF here.
If you don't use Lucene
OK, I can sketch a solution based on your comment below, and we'll see if it helps:
You'll need to identify the files to consider. Maybe you have an explicit list, or maybe you have a pattern of filenames?
Once you have the files identified, you'll need to iterate over them, probably as File objects in java.
With each file, you'll need to open the file (say by using a BufferedInputReader wrapped around an InputStreamReader wrapped around the File.getInputStream()).
You'll need to know how to tokenize the file contents; perhaps just using whitespace and a Scanner object or similar.
You'll need a data structure (maybe a Map<String,Record>) to map terms found in the file to a Record containing the term counts and locations.
You might consider use of a tool that can do some or all of this for you. I imagine that Lucene would probably have some infrastructure to use, for instance.

Related

Solution for COBOL viewcall response string to java bean

Currently I've a job to rewrite some library which communicate with COBOL Mainframe via ViewCall (Not sure if this is a universal term or not). The response will be a fixed length string according to the copybook. Are there any solutions or approach for Java to map this fixed length string to model class using copybook as a mapper instead of manually cut the strings and set it to model class?
Some possible Solutions:
For small Text Copybooks hand coding is feasible.
next step up is to use cb2xml and generate the code. Cb2xml will calculate position and length for you. This answer shows what can be done with cb2xml.
Use JRecord ~ CodeGen to generate java classes. See Generating Java Code for details on generating Java code. Java Jrecord is orientated to Files but should be usable.
Remember there is a fair overhead in passing the Cobol copybook.
For a one off request with a small record, hand coding is viable option but you run the risk of subsequent requests coming along.

Using a C++ Struct in Android app (Java and XML)?

I'm a decent C++ programmer, good enough to do what I want. But I'm working on my first Android App (obviously not C++ related), and I'm having an issue where I'd like to translate what I know from C++ over to the XML/Java used in Android Studio.
Basically I have (in C++) an array of structures. And maybe I didn't do the perfect search, but I sure as heck tried to look around for the answer, but I didn't come up with anything.
How would I go about placing an array of structures inside the XML file and utilizing it in Java?
As a bit of a buffer, let me say that I'm not really looking for code, just verification that this is possible, and a method on how to go about it. I don't mind researching to learn what I want, but I haven't come up with anything. Like I said, I probably haven't googled it properly because I'm unsure of exactly how to ask it.
EDIT: So it appears that XML doesn't have a structure (or anything similar? not sure). But I can utilize a Java class with public variables. Now my question is more or less: What would be the best way to go about inserting all the information into the array/class/variables?
In C++ terms, I could neatly place all the info into a text file and then read from it, using a FOR loop to place all the info in the structures. Or, if I don't want to use an outside source/file, I could hardcode the information into each variable. Tedious, but it'd work. I'm not sure, in Android terms, if I could use the same method and pack in a text file with the app, and read from the file using a FOR loop to insert the information into the array/class/variables
class answerStruct
{
public String a;
public boolean status;
};
class questionStruct
{
public String q;
answerStruct[] answer = new answerStruct[4];
};
I'm not placing this here to brag at my super high tech program, but to give a visual, and frankly that's less I have to write out. This is the method I plan on going with. But, being Java, I'm open to possibly better options. My question still stands as far as inputting information into the variables. Hard code? or does Android/Java allow me to place a text file with my app, and read from it into the variables?
XML is just a markup language for tree-structured data, and imposes no restrictions on how you name or structure your tree nodes.
What I think that you're looking for is an XML Object Serialiser: a way to serialise your in-memory structure into XML for a more permanent storage, and then at a later run, deserialise it back into memory. There are many XML Serialisers for Java, each with an own proprietary XML format.
I've used Simple XML in the past, and found it easy and flexible.

Mahout: converting one large text file to SequenceFile format

I have done a lot of searching on the web for this, but I've found nothing, even though I feel like it has to be somewhat common. I have used Mahout's seqdirectory command to convert a folder containing text files (each file is a separate document) in the past. But in this case there are so many documents (in the 100,000s) that I have one very large text file in which each line is a document. How can I convert this large file to SequenceFile format so that Mahout understands that each line should be considered a separate document? Thank you very much for any help.
Yeah, it is not quite apparent or very intuitive how to do this, although (lucky for you :P) I have answered that exact question several times here in stack, for instance here. Have a look ;)

Are there APIs for text analysis/mining in Java? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...
You can also find a code snippet at the javadoc of CRFClassifier:
Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output
For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.
I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs
I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

What are good methods to perform spreadsheet-like calculations in a programming language?

What's the best way to do spreadsheet-like calculations in a programming language? Example: A multi-user application needs to be available over the web that crunches columns and cells of numbers like a spread-sheet based on user submission. What are the best data structures/ database models/patterns to handle this type of work so that handling the different columns are done efficiently and easily in php, java, or even .Net. Is it better to use data structures within the language, or is it better to use a database? If using a database is the way, how does one go about doing this?
To do the actual calculation, look at graph theory. Basically you want to represent each cell as a node in a graph and each dependency as a directed edge. Next, do a topological sort to calculate the value of each cell in the right order.
Aspose.Cells (formerly Aspose.Excel.Web) is a good way to get the functionality you are looking for.
Unless you are asking more for a "How is it done?" than "I need to do it." Then I would look at the other answers given.
Along the lines of "I need to do it"
Microsoft has Excel Services which does just what you want.
Spreadsheet operations on the server. It is available via a web services interface, so you can connect and drive calculations from Java, PHP, .NET, whatever.
Excel Services is part of Sharepoint 2007.
Resolver One is a Spreadsheet app made in IronPython.
There is an explanation of the overall mechanic for the calculation [pythonology.org] it uses for user generated ecuations.
The relevant image showing Resolver One's overall algorithm.
Should note that users can write python code to be interpreted both on the cells and a special 'outside of sheet' place.
Look at another question here in SO, from where I reused my answer.
I can't tell you how to do it. But I would recommend you to look at the code of PHPExcel. PHPExcel is a library that allows you to create Excel files within PHP.
The workflow of PHPExcel is simplified like this:
Create an empty Excel file object
Add cells (with either data or formulas) to the "Excel file"
Call the create function which is generating the file itself
In your case you would have to replace 3. with something like "Create web interface".
Therefore I would recommend you to look at the code of this open source project and look how the general structure is. This should help you solving your problem.
I once used a binary tree to store the output of parsing a string using BODMAS. Each node was an operation between two other nodes, which could be a number, a variable or another operation.
So y = x * x + 2
became:
+
* 2
x x
Sadly this was at school in Pascal and is stored on a 5 1/4" disk, so you don't want it :)
SpreadsheetGear for .NET will let you load Excel workbooks, plug in values, calculate and then get the results.
You can see a few simple ASP.NET calculation samples here, other ASP.NET samples here and download a free trial here.
Disclaimer: I own SpreadsheetGear LLC
I must point out that google spreadsheets already does this kind of stuff.

Categories

Resources