java keyword extraction - java

Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.
It doesn't have to be particularly clever, just use stop words and stemming to match keywords.
I am looking at the KEA package http://code.google.com/p/kea-algorithm/ but I can't figure out how to use their code.
Ideally something simple which has a little example documentation would be good. In the meantime I will set about writing this myself!
EDIT: When I say I can't see how to figure out how to use their code, I mean I can't see a simple way. The individiual classes by themselves have useful methods that will do much of the work.

This is a fairly old question and probably the OP has already solved his problem, but putting it here for others who may stumble upon the question looking for how to use KEA.
For KEA, you will need a training set - some of your documents will need to have keywords already set. The training data consists of a directory of documents (.txt files) and corresponding keywords files (.key files), with one keyword per line. You train KEA on this set, then use the model to extract keywords on the rest of your documents, which are in another directory of .txt files. KEA will write out corresponding .key files in this directory.
For more information, take a look at one or more of the following:
1) The KEA source distribution has a TestKEA.java class which shows how to extract keywords from a small test corpus. The README has details on the directory format required.
2) This blog post has (a somewhat terse IMO) instructions on how to use KEA.
http://kea-pranay.blogspot.com/2010/02/kea-key-extraction-algorithm.html
3) My blog post which I wrote up last weekend while trying to learn how to generate keywords from a corpus I had (which were already manually annotated with keywords). It has Python code to pre-process data to the way KEA expects it, Scala (KEA provides a Java API) code to train and run the extractor, and Python code to do analyze and visualize the generated keywords.
http://sujitpal.blogspot.com/2014/08/keyword-extraction-with-kea.html

You might try the Porter Stemming algorithm: the java version is at http://tartarus.org/~martin/PorterStemmer/java.txt and the main page is at http://tartarus.org/~martin/PorterStemmer/. Its old, but doesn't do a bad job.

Related

Retrieving word definitions from google java

I have a list of words (1K+) in a file, and I would like to get their definitions and save them. I was thinking about getting their definitions from Google, as it's the first thing that it shows. The way I thought about doing that is quite rudimental, which is to create a URL instance pointing to the Goole search of the given word, and read the content using streams. Then, "filter" the definition, which is always in between "data-dobid="dfn"><.span>" and "<./span>"
For example:
[...]data-dobid="dfn"><.span>. unwilling or refusing to change one's views or
to agree about something<./span>.[...]
Which is the definition of intransigent
However I would like to know if there is a more "efficient" way of doing so, for example without retrieving all the other results of the search. And also, If it's possible to load multiple results in a background thread so that when I want to "decode" a definition and save it, I don't always have to be waiting for the search to be completed.
The more efficient approach is to download a dictionary which you can then load locally. This gives you a local file or database that is readily searchable.
This approach is not only computationally efficient but it also will ensure you're are using the information correctly under its license. What you are proposing is commonly called "scraping" and may go against various licenses and terms of service.
This blog post lists several freely available and freely licensed dictionaries.
This AskUbuntu.SE question describes some more of the technical work required to acquire a free dictionary and reference it from the command line. You would want to replicate these reading patterns to load the data in Java.
Yet another approach would be to use a freely available and appropriately licensed API such as https://dictionaryapi.com/ . This would still use HTTP calls but is clearly licensed and is also an explicit API for looking up human-language word defintions. This is an advantage over scraping Google because you won't have to parse HTML and it is appropriately licensed for you to use it.
Finally there are some similar, if not duplicate, questions on StackOverflow and StackExchange such as this one: How to implement an English dictionary in Java?

how to format i18n files?

i got three files for internationalization: messages_es.properties, messages_en.properties and messages_pt.properties, those files follow the rule:
message1=value
message2=value2
and it's values changes according the file. example:
messages_en.properties:
hello=welcome
messages_pt.properties:
hello=bem vindo
the problem is, along the project construction those files becames inconsistent, like, lines that exists in one file doesn't exist on the others, the lines are not ordened in these files... i want to know if there is some way to easy rearrange and format those i18n files so the lines that exists in one file and don't exists in the other should be copied and the lines be ordered equals?
Interesting question, you are dealing with text files so there are a lot of possible options to manage this situation but depends on your scenario (source control, ide, etc).
If your are using Eclipse check: http://marketplace.eclipse.org/content/eclipse-resourcebundle-editor
And for IntelliJ: https://www.jetbrains.com/idea/features/i18n_support.html
Yes, the messages should usually appear in each file, unless there's a default message for some key that doesn't need translating (perhaps technical terms). Different IDEs have different support for managing message files.
As far as ordering the messages, there's no technical need to do so, but it can help the human maintainers. Any text-editor's sort routine will work just fine.
The NetBeans IDE has a properties editor across languages, displaying them side-by-side in a matrix. Similarly there are stand-alone editors that allow to do this. One would assume that such an editor would keep the source text synchronized and in one consistent layout.
First go looking for a translator's editor that can maintain a fixed layout. A format like gettext (.po/.pot) which is similar to .properties might be a better choice, depending on the tool.
For more than three languages it would make sense to use a source format more directed at translators, like the XML format xliff (though .properties are well known). And generate from this source (via XSLT perhaps) the several .properties files, or even ListResourceBundles.
The effort for i18n should not stop at providing a list of phrases to
translate, but some info where needed (disambiguating note), and maybe
even a glossary for a consistent use of the same term. The text
presented to the user is a very significant of the products quality
and appeal. Using different synonyms may make the user-interface
fuzzy, needlessly unclear, tangled.
The problem you are facing is invalid Localization process. It has nothing to do with properties files and it is likely that you shouldn't even compare these files now (that is until you fix the process).
To compare properties files, you can use very simple trick: sort each one of them and use standard diff tool to show differences. Sure, you'll miss the comments and logical arrangement in the English file, but at least you can see what's going on. That could be done, but it is a lot of manual work.
Instead of manually fix the files, you should fix the broken process. The successful localization process is basically similar to this one:
Once English file is modified, send the English file for translation. By that I mean all the translations should be based on English file and the localization files should be recreated (stay tuned).
Use Translation Memory to fill up the translations you already have. This could be done by your translation service provider or yourself if you really know how to do it (guess what? it is difficult).
Have the translators translate strings that are missing.
Put localized file back.
Before releasing the software to public have somebody to walk the Linguistic Reviewer through the UI and correct mistranslations.
I intentionally skipped few steps (like localization testing, using pseudo-translations and searching for i18n defects, etc.), but if you use this kind of process, your properties files should always be in sync.
And now your question could be reduced to the one that was already asked (and answered):
Managing the localization of Java properties files.
Look at java.util.PropertyResourceBundle. It is a convenience class for reading a property file and you can obtain a Set<String> of the keys. This should help to compare the contents of several resource files.
But I think that a better approach is to maintain the n languages in a single file, e.g., using XML and to generate the resource files from a single source.
<entry>
<key>somekey</key>
<value lang="en">good bye</value>
<value lang="es">hasta luego</value>
</entry>

"Mechanically generated" java source files in the Java source code

As I was looking through the Java source code, I found some unusual files, mostly related to ByteBuffers in the java.nio package which had a very messy source code and were labelled This file was mechanically generated: Do not edit!.
These files also contained large portions of blank lines (some even in the middle of javadocs (!!?)), presumably to prevent the line numbers from changing. I have also seen a few java decompilers, such as procyon-decompiler, which have an option to keep line numbers, but I doubt that's the case, because putting blank lines before the final accolade changes nothing.
Here are a few of these files (I couldn't find any links to them online and didn't pastebin them because I don't want to break any copyright, but you can find them in the src.zip folder at the root of your JDK installation folder):
java.nio.ByteBuffer
java.nio.DirectByteBufferR
java.nio.Bits
java.nio.BufferOverflowException
I'd be curious to know:
Which tool generated these files?
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
Why would a tool be used to generate them, while all other classes are programmed by humans?
Why would the tool put blank lines randomly inside parentheses, before the final accolade, or even in javadocs?
I can probably not answer all of the questions, but some background is:
In the Makefile at http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/Makefile, they are generating different java source files from the same template file through some preprocessor:
...
$(BUF_GEN)/CharBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=char SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
$(BUF_GEN)/ShortBuffer.java: $(X_BUF_TEMPLATE) $(GEN_BUFFER_SH)
$(prep-target)
#$(RM) $#.temp
TYPE=short SRC=$< DST=$#.temp $(GEN_BUFFER_CMD)
$(MV) $#.temp $#
...
$(X_BUF_TEMPLATE) refers to X-Buffer.java.template, which is the source for typed buffers like CharBuffer, ShortBuffer and some more.
Note: The URLs might change in the future. Also sorry for referring to Java 7 - in Java 8 they have modified the build system, I did not find the corresponding Makefiles so far.
Which tool generated these files?
GEN_BUFFER_SH / GEN_BUFFER_CMD finally refers to genBuffer.sh, so the script which creates these files is http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/make/java/nio/genBuffer.sh.
Why would a tool be used to generate them, while all other classes are programmed by humans?
I don't have an authoritative answer for this specific case, but usually you are using code generation tools
if you need to create a lot of similar classes/methods which only differ in some detail, but which is subtle enough so that you can not use established mechanisms like generics or method parameters (probably the case here, since the buffers are generated for primitive types which can not be used with Generics)
if you need to create complex algorithms from a much simpler representation (like generating parsers from a grammar).
Why does the tool keep the line numbers the same? Is it to make debugging (stacktraces) easier?
I am guessing: yes, its to retain the line numbers in stack traces so that they match the template files. Other tools like the C preprocessor work similar.

Count number of Classes and Methods with Comments

Trying to find a way to count the Number of Java Classes, and methods that either have comments or need comments. We are trying to document all our code but its gonna take a while and we would like to post metrics on how far along we are. We are using Doxygen to convert Javadoc to webpages. I haven't found a way yet with Doxygen but doesn't mean its not there.
Turn on all the warnings in Doxygen, and then get a line count of the log file.
I've used QDox in the past for source code analysis. It parses java sources into a nice model for gathering statistics / automated code generation etc.

Generate a collection of changed lines between two revisions of a file using Java

I am writing an eclipse plugin which needs to be able to determine which lines of a file have changed compared to a different version of the same file.
Is there an existing class or library which I can use for this task?
The closest I have found is org.eclipse.compare.internal.merge.DocumentMerger. This can be used to find the information I need but is in an internal package so is not suitable for me to use. I could copy/paste the source of this class and adapt it to my requirements. However, I am hoping there is an existing library to handle textual comparisons.
For textual comparisons, try the google-diff-match-patch library. (I don't know whether Eclipse already has something similar built-in.)

Categories

Resources