As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results.
I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website.
What do I have :
A list of articles returned from a search.
Each article has an ID and an abstract.
The idea is to get keywords from each abstract text.
And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search.
Any ideas ?
I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc...
I just need to know the best aproahc to resolve this problem.
Thanks a lot.
i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml
However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4
There's an Apache project for that... I haven't used it but, OpenNLP an open source Apache project. It's in the incubator so it maybe a bit raw.
This post from jeff's search engine cafe has a number of other suggestions.
This might be relevant as well:
https://github.com/jdf/cue.language
It has stop words, word and ngram frequencies, ...
It's part of the software behind Wordle.
I ended up using the Alias`i Ling Pipe
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
There are a lot books/online resources about using patterns. But I didn't find any tasks for using it. But for good understanding of patterns it's need practice. Maybe someone faced with some resources where there are tasks for using patterns.
For example. Mediator pattern:1)write chat application where...
Thanks in advance.
UPDATE:
I found:
http://www.cs.sjsu.edu/~pearce/modules/labs/patterns/
How to study design patterns?
I'll give you five, with easy and/or moderate difficulty:
Singleton
easy: single database access class for the entire application.
Factory
easy: English-to-another-language translator. I need to be able to add and then access a new language translator with minimal code changes.
Observer
easy: Central data structure that has several copies within the application that need to be updated automatically when a change to the main DS occurs.
moderate: Make this work over a network with cooperating processes updating a central data structure.
Memento
easy: A simple game with the ability to save/load.
Decorator
easy: A simple persistence class with read/write ability. I want to be able to dynamically switch between XML or database persistence.
I know only of one such resource, and it is not formulated as you have specified, but maybe it'll help a bit: In the last chapters of the Head First Design Patterns book, the MVC pattern is explained as a compound pattern, involving several others : Composite, Strategy, Adapter etc.
It is explained with the help of a small application. You could look up the chapter and build the described to practice.
Ever use an iterator? Pattern. My guess is you use a lot of patterns without even really realizing you're using them. Created a buffered reader out of a file reader? Decorator; pattern. Don't set out trying to use patterns--let the problem discover them. They're everywhere, that's why they're patterns.
Things like facades, decorators, iterators, factories, etc. crop up in every single domain. Pick anything you're interested in writing, and discover the patterns already present. Refactor mercilessly--patterns.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm doing some text mining in web pages. Currently I'm working with Java, but maybe there is more appropriate languages to do what I want.
Example of some things I want to do:
Determine the char type of a word based on it parts (letter, digit, symbols, etc.) as Alphabetic, Number, Alphanumeric, Symbol, etc.(there is more types).
Discover stop words based on statistics.
Discover some gramatical class (verb, noun, preposition, conjuntion) based on statistics and some logics.
I was thinking about using Prolog and R (I don't know much about these languages), but I don't know if they are good for this or maybe, another language more appropriate.
Which can I use? Good libs for Java are welcome too.
python.!
They have a HELL-LOTTA libraries in this area.
but, i've got no knowledge about prologue and R.. but definitely py is LOT better than java in text mining, and AI stuff...
I highly recommend Perl. It has a lot of text-processing features, web search and parsing, and a large etc. Take a look at the available modules (>23.000 and growing) at CPAN.
I think Apache Solr and Nutch provides you the framework for that and on top of that you can extend it for your requirements.
Java has some basic support, but nothing like the above two products, they are awesome!
HTML Unit might give you some good APIs for fetching web pages, and traversing over elements in DOM by XPath. I have used it for sometime to perform simple to more complex operations.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
I know that is Collections.sort() method in Java but I think quicksort is worth to remember and try.
My work target is general Java: web, database access, integration, not game developer, scientific application or another one that depends on advanced algorithms.
Which algorithms should I learn to pass without stress Java developer interview?
Fizz Buzz
I usually don't care, if a developer knows the basic algorithms by heart. I do care, if he is capabale of understanding requirements and translating them in correct, testable and understandable pieces of code.
Ah, and I do care if he knows how to implement the most common design patterns. And he should know when and how to use collections, threads and - String#split - it's amazing how many "developers" don't know how to read and process a simple csv file.
Although I fully agree with Joachim comment, I would go for : collection selection. This is not an algorithm per se, but rather a good view of which collection is good for which purpose :
sorted content with constant lookup time ? TreeSet !
mapped data with memorization of insertion order ? LinkedHashMap !
using that, and some knowledge of design patterns behind collections, you will far too often reply to algorithms questions using the knuth answer (or the subtle variation : as long as Sun developpers implemented it correctly, I only have to choose wisely).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far.
LingPipe
Stanford
LBJ
FastTag
Anybody got any recommendations?
Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger (java) for biomedical text.
For newswire text, Adwait Ratnaparkhi's MXPOST is very good and is the one I would recommend.
Other Java implementations include:
MontyLingua
Berkeley Parser (Not really a POS tagger but all full blown parsers will typically include POS taggers. Google for Java syntactic parsers and you will find many.)
QTag
LBJ
OpenNLP and Lingpipe as posted by the other posters are also pretty decent.
Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal (also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.
Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.
I have used OpenNLP with good results. You can also check out MorphAdorner.
I've used both LingPipe and Stanford's POS Tagger. The later is a state-of-the-art POS Tagger but, from my experience, it is too slow (although they do provide less accurate models, which are reasonably fast). Of course, it always depends on what you are trying to achieve, and there will always be a trade-off between speed and accuracy.
I've also once used an LBJ-based NER software and, although it was pretty accurate, the source code was a complete mess. Both LingPipe and Stanford's source is very clean and well documented.
You can also take a look at LTAG-spinal. I haven't used it yet, but from the algorithm description, and from the listed accuracy, it sure seems better than the alternatives you have so far.
Hope it helps.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm wondering what's a good online judge for just practicing algorithms. I'm currently not very good at writing algorithms, so probably something easy (and least frustrating) would be good.
I've tried the UVA online judge, but it took me about 20 tries to get the first example question right; There was absolutely no documentation on how to read input, etc. I've read about Topcoder, but I'm not really looking to compete, merely to practice.
Take a better look at topcoder. Yes, they have competitions, but you can still easily just "play" by yourself. You are given a goal and a time limit and you choose your language, and then you code it. You can view the source code of the best coders to improve yourself.
I have used topcoder for awhile and have never been in any competition. Check it out.
You may also want to check out Project Euler. Not a judge, but there are mathematical problems and solutions available for many languages.
Have a look at SPOJ
CodingBat might give you some good practice. It responds instantly with test results.
This is a year old by now, so my answer is for future stumblers.
The ACM-ICPC Live Archive has a lot of great problems, and in a lot of different areas. (Project Euler is also great, but the problems are all number-theoretic.) And hoop-jumping is normal with these things... last I checked, Facebook Puzzles requires you to email a zip file containing the code and an Ant buildfile, and they take a long time to get back to you.
I've only sent Java code to UVa, so I'll elaborate a little on the Java particulars for anyone else who's struggling. Your class must be called Main, and its entry point must be the main method. You read from System.in. If you're on a Unix-y platform, after compiling you can use
Java Main < input.txt
to test your program.
The presentation has to be exact. For example, if they say "outputs should be separated by a blank line," that does not mean, "follow each output with a blank line." Finally, don't be afraid to check out their forums.
Reference: http://online-judge.uva.es/board/viewtopic.php?t=7429
(In their sample code, they read the input byte-by-byte. Don't do that; use Scanner instead. It's also not necessary to have the main method create an instance of the class. You can go 100% static, and often the problems are small enough that OOP doesn't buy you anything.)