Are there APIs for text analysis/mining in Java? [closed]

Are there APIs for text analysis/mining in Java? [closed] - java

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...
You can also find a code snippet at the javadoc of CRFClassifier:
Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs

I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

Related

Solution for COBOL viewcall response string to java bean

Currently I've a job to rewrite some library which communicate with COBOL Mainframe via ViewCall (Not sure if this is a universal term or not). The response will be a fixed length string according to the copybook. Are there any solutions or approach for Java to map this fixed length string to model class using copybook as a mapper instead of manually cut the strings and set it to model class?

Some possible Solutions:
For small Text Copybooks hand coding is feasible.
next step up is to use cb2xml and generate the code. Cb2xml will calculate position and length for you. This answer shows what can be done with cb2xml.
Use JRecord ~ CodeGen to generate java classes. See Generating Java Code for details on generating Java code. Java Jrecord is orientated to Files but should be usable.
Remember there is a fair overhead in passing the Cobol copybook.
For a one off request with a small record, hand coding is viable option but you run the risk of subsequent requests coming along.

how to implement idf in java? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Any tutorial or code for Tf Idf in java
IDF is inverse document frequency.
IDF = log(document containing the term / number of documents)
How to do it in java ?
Any advices?

How about:
static double idf(int docTermCount, int totalNumDocuments)
{ return Math.log((double)docTermCount / (double)totalNumDocuments); }
(This is basically a humorous way of saying: Tell us more about your circumstances, and maybe we can help. What is a document? What it its representation?
Just Use Lucene
Apache Lucene(TM) is a
high-performance, full-featured text
search engine library written entirely
in Java. It is a technology suitable
for nearly any application that
requires full-text search, especially
cross-platform.
It provides IDF here.
If you don't use Lucene
OK, I can sketch a solution based on your comment below, and we'll see if it helps:
You'll need to identify the files to consider. Maybe you have an explicit list, or maybe you have a pattern of filenames?
Once you have the files identified, you'll need to iterate over them, probably as File objects in java.
With each file, you'll need to open the file (say by using a BufferedInputReader wrapped around an InputStreamReader wrapped around the File.getInputStream()).
You'll need to know how to tokenize the file contents; perhaps just using whitespace and a Scanner object or similar.
You'll need a data structure (maybe a Map<String,Record>) to map terms found in the file to a Record containing the term counts and locations.
You might consider use of a tool that can do some or all of this for you. I imagine that Lucene would probably have some infrastructure to use, for instance.

Tips for writing a file parser in Java? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
EDIT: I'm mostly parsing "comma-seperated values", fuzzy brought that term to my attention.
Interpreting the blocks of CSV are the main question here.
I know how to read the file into something like a String[] and some of the basic features of String, but I don't think using methods like contains() and analyzing everything character by character will work.
What are some ways I can do this in a smarter way?
Example of a line:
-barfoob: boobs, foob, "foo bar"

There's a reason that everyone assumes you're talking about XML: inventing a proprietary text-based file format requires very strong justification in the face of the maturity and easy availability of XML parsers.
And your question indicates that you have very little prior knowledge about parsers (otherwise you'd be writing an ANTLR or JavaCC grammar instead of asking this question) - which is another strong argument against rolling your own, except as a learning experience.

Since the input is "formatted similarly to HTML", then it is likely that your data is best represented using a tree-like structure, and also, it is likely that it is XML or similar to XML.
If this is the case, I propose the smartest way to parse your file is to use an XML parser.
Here are some resources you may find helpful:
A chapter on XML parsing from Sun: http://java.sun.com/developer/Books/xmljava/ch03.pdf
An article that might help you get started qucikly: http://onjava.com/pub/a/onjava/2002/06/26/xml.html
HTH

If the document is valid XML, then any of the other answers will work. If it's not, you'll have to lex.

you should look at ANTLR even if you want to write the parser yourself, ANTLR is a great alternative. Or at least look at YAML

This and digging through wikipedia for related articles will probably suffice.

I think the java.util.Scanner will help you. Have a look at http://java.sun.com/javase/6/docs/api/java/util/Scanner.html

Depending on how complicated your "schema" is, a regular expression might be what you want. If there is a lot of nesting then it might be easiest to convert to XML or JSON and use a prebuilt parser.

People are right about standard formats being best practice, but let's set that aside.
Assuming that the example you give is representative, the task is pretty trivial.
You show a line with an initial token, demarked with a colon-space, then a list of comma-separated values. Separate at that first colon-space, and then use split() on the part to the right. Handling of the quotes is trivial, too.

After looking at your sample input, I fail to see any resemblance to HTML or XML:
-barfoob: boobs, foob, "foo bar"
If this is what you want to parse, I have an alternative suggestion, to use the Java properties parser (comes with standard Java), and then parse the remainder of each line using your own custom code. You will need to refactor your format somewhat in order for this to work, so it's up to you.
barfoob=boobs, foob, "foo bar"
Java properties will be be able to return you barfoob as the property name, and boobs, foob, "foo bar" as the property value. That's where you can use your custom code to split the property value into boobs, foob and foo bar.

I'd strongly advice to not reinvent the wheel and use an existing solution like Flatworm, Fixedformat4j or jFFP that can all parse positional or comma-separated values files (personally, I recommend Flatworm).

You may be able to use the Neko HTML parser to some degree. It depends on how it handles the non-standard HTML.

If the XML is valid, I personally prefer using http://www.xom.nu simply because it features a nice DOM model. As pointed out, though, there are parsers in J2SE.

Static Analysis tool to detect Internationalization issues

Are there any tools (free/commercial) that can audit an application for internationalization? (or localization-readiness, if you prefer)
Primarily interested in:
Mulitlingual Implementation tests
Examples:
* [javascript] alert('Oops wrong choice!');
* [java] String msg = resourcebundle.getString("key.x").concat("4");
* [jdbc] String query=".. order by abc"; //should be NLS_SORT or equiv.
Date Implementation tests
Examples:
* SimpleDateFormat used without Locale
* Apache's DateFormatUtils used
Numeric Implementation tests
Examples:
* NumberFormat used without Locale
javascript-validation tests
Examples:
* [javascript] checkIsDecimal { //decimal point checked against "." }
* [javascript] hardcoded character range [A-z]
Cheers.

Have a look at Globalyzer - http://lingoport.com/globalyzer - as it is just that, a tool for performing static analysis on code specifically for internationalization. It works with a variety of programming languages too. Supports detection and correction for embedded strings (string externalization capabilities too), potential locale-limiting methods/functions/classes depending upon the programming language and requirements, as well as other issues like programming patterns and embedded images. There are default "rule sets" which get you a good start, and then you can customize your rules for both detection and filtering of issues. Plus there's an underlying database that helps you tag or keep track of i18n issues as you work with them. There's a server component, where you create and share your rule sets with your team members, then desktop and command line clients which run locally on your machine to analyze your source, so you're not sending any code or reporting off your local machine.

Based on your examples, you mostly want to diagnose
functions that produce output, whose input isn't somehow
internationalized.
So for the alert case, you want to find any print call
that acquires a string that is not produced by
one of possibly several well-know translation routines.
For the jdbc case, you want to identify ordering constraints
that are not locale specific.
For the various date cases, you want date routines that
are known to produce locale-specific answers.
The javascript validation is harder to guess at intent;
presumaly you want to diagnose functions that are known
to be wired to a particular locale; this seems a lot like
the date case. For range checks, you want capture anything
that compares a character to another for less or greater than.
For the wired-locale functions, it seems just knowing their
name would be enough (although perhaps there has to be some overload resolution,
e.g., by number of arguments), so NumberFormat(?,?) is bad,
and NumberFormat(?,?,?) is OK.
Why can't you write a regular expression to look (hueristically) for the bad cases?
For the range case, you just need to recognize expressions
of the form of [exp] < [literal-char] or [exp] < [literal-string].
A regexp to look for just "< '.+" would seem adequate.
Are there common cases that these would miss?
EDIT (from comment below: "I've been using regexp but...")
If you want a tool that is deeper than regexp, you pretty much
have to go to language parsing, name/type resolution, and having
data flow analysis would be helpful. Since you want to process
multiple (computer) languages, the tool has to be multi-lingual capable.
And it appears you want to be able to customize it to check for
the specific cases relevant to your application.
The DMS Software Reengineering Toolkit
has all these properties, including
parsers for Java, JavaScript and SQL. It is designed to be customized,
so you have to do that in advance of using it.

I had studied IntelliJ IDEA's code analyzers, and it does have those that you requested. It's a commercial IDE, specialized in java, but knows other languages as well.
http://www.jetbrains.com/idea/

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.