What language can be recommended for text mining/parsing? [closed] - java

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm doing some text mining in web pages. Currently I'm working with Java, but maybe there is more appropriate languages to do what I want.
Example of some things I want to do:
Determine the char type of a word based on it parts (letter, digit, symbols, etc.) as Alphabetic, Number, Alphanumeric, Symbol, etc.(there is more types).
Discover stop words based on statistics.
Discover some gramatical class (verb, noun, preposition, conjuntion) based on statistics and some logics.
I was thinking about using Prolog and R (I don't know much about these languages), but I don't know if they are good for this or maybe, another language more appropriate.
Which can I use? Good libs for Java are welcome too.

python.!
They have a HELL-LOTTA libraries in this area.
but, i've got no knowledge about prologue and R.. but definitely py is LOT better than java in text mining, and AI stuff...

I highly recommend Perl. It has a lot of text-processing features, web search and parsing, and a large etc. Take a look at the available modules (>23.000 and growing) at CPAN.

I think Apache Solr and Nutch provides you the framework for that and on top of that you can extend it for your requirements.
Java has some basic support, but nothing like the above two products, they are awesome!

HTML Unit might give you some good APIs for fetching web pages, and traversing over elements in DOM by XPath. I have used it for sometime to perform simple to more complex operations.

Related

Is velocity templates good for evaluating conditions? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am thinking about externalizing some conditions instead of implementing them in java so that I can easily change them later as needed.
For example, I need to check if certain keys exist in a given map and if the values of certain keys in a map equal to something.
I was thinking about using spring's expression language, but since we are already using velocity templates, I thought maybe it is a good candidate.
Any idea? Thanks.
You can easily use the #if/#else, #foreach and other condition functionality of velocity to do business logic as part of the velocity template rendering.
However I usually try to separate business logic from rendering in velocity for a number of reasons:
Complexity: The Velocity template can become hard to read, especially
if the target output itself requires complex resulting layout. If you
add additional business logic to the mix, it quickly becomes
impossible to read for anybody else (or for yourself after a few
months of not looking at it constantly)
Testability: It's harder to test Velocity templates, there's
better support for unit/integration testing of Java code
Functionality: Velocity is not a full programming language by
design, so you will miss some things sooner or later and a macro
simply is not a function, e.g. variables by default have global
scope, ... You are bound to run into some of these if you make
your templates big and complex.

Best java API for distance functions [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I am new to this kind of computing. I don't know what are the existing distance functions that are helpful to calculate the distance between to double sets(arrays). Can some one suggest me at-least 10 distance functions so that i can select few among them which suits best for my problem domain. I just want to calculate the distance between two sets for my scientific approach to the problem domain. I also want to know whether i have to implement them manually or any java API that covers most distance functions? Suggestions can help me to minimize my effort and save my time..:)
Providing you with code is not really going to help. What you need to do is to read up on the mathematics of the the various measures of distance, and figure out which is most appropriate based on that knowledge.
You could start by reading the Wikipedia page on Distance and the linked pages and resources.
Only when you've decide on an appropriate measure do you need to go looking for code. In a lot of cases, it is probably simplest to implement the measure yourself.
Alternatively, if you want us to provide sensible suggestions of measures that are appropriate to your problem domain, tell us what the problem domain is.
Are we talking about statistical distance between two samples? If so, there is an abundance of methods, each one suiting a different problem.
If your problem domain is simple, subtracting the sample means (averages) could suffice. For more complex data, the Earth Movers' Distance is common, though newer and more robust methods (such as kernel functions) are available.
Coding is the least of your problems. You must provide a more accurate definition of your problem before I can further assist you.

Which algorithms are worth to learn or recall on preparation to Java developer interview? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
I know that is Collections.sort() method in Java but I think quicksort is worth to remember and try.
My work target is general Java: web, database access, integration, not game developer, scientific application or another one that depends on advanced algorithms.
Which algorithms should I learn to pass without stress Java developer interview?
Fizz Buzz
I usually don't care, if a developer knows the basic algorithms by heart. I do care, if he is capabale of understanding requirements and translating them in correct, testable and understandable pieces of code.
Ah, and I do care if he knows how to implement the most common design patterns. And he should know when and how to use collections, threads and - String#split - it's amazing how many "developers" don't know how to read and process a simple csv file.
Although I fully agree with Joachim comment, I would go for : collection selection. This is not an algorithm per se, but rather a good view of which collection is good for which purpose :
sorted content with constant lookup time ? TreeSet !
mapped data with memorization of insertion order ? LinkedHashMap !
using that, and some knowledge of design patterns behind collections, you will far too often reply to algorithms questions using the knuth answer (or the subtle variation : as long as Sun developpers implemented it correctly, I only have to choose wisely).

What is a good Java library for Parts-Of-Speech tagging? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far.
LingPipe
Stanford
LBJ
FastTag
Anybody got any recommendations?
Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger (java) for biomedical text.
For newswire text, Adwait Ratnaparkhi's MXPOST is very good and is the one I would recommend.
Other Java implementations include:
MontyLingua
Berkeley Parser (Not really a POS tagger but all full blown parsers will typically include POS taggers. Google for Java syntactic parsers and you will find many.)
QTag
LBJ
OpenNLP and Lingpipe as posted by the other posters are also pretty decent.
Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal (also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.
Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.
I have used OpenNLP with good results. You can also check out MorphAdorner.
I've used both LingPipe and Stanford's POS Tagger. The later is a state-of-the-art POS Tagger but, from my experience, it is too slow (although they do provide less accurate models, which are reasonably fast). Of course, it always depends on what you are trying to achieve, and there will always be a trade-off between speed and accuracy.
I've also once used an LBJ-based NER software and, although it was pretty accurate, the source code was a complete mess. Both LingPipe and Stanford's source is very clean and well documented.
You can also take a look at LTAG-spinal. I haven't used it yet, but from the algorithm description, and from the listed accuracy, it sure seems better than the alternatives you have so far.
Hope it helps.

Which NLP toolkit to use in JAVA? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results.
I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website.
What do I have :
A list of articles returned from a search.
Each article has an ID and an abstract.
The idea is to get keywords from each abstract text.
And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search.
Any ideas ?
I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc...
I just need to know the best aproahc to resolve this problem.
Thanks a lot.
i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml
However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4
There's an Apache project for that... I haven't used it but, OpenNLP an open source Apache project. It's in the incubator so it maybe a bit raw.
This post from jeff's search engine cafe has a number of other suggestions.
This might be relevant as well:
https://github.com/jdf/cue.language
It has stop words, word and ngram frequencies, ...
It's part of the software behind Wordle.
I ended up using the Alias`i Ling Pipe

Categories

Resources