Hyphenation for different languages with java - java

The problem : Given a string (which can be in different language) we have to hyphenate it.
What i tried : hypenator-j but this seems to be working only for English, I'm not sure how to hyphenate other languages, couldn't find free tex files for different languages.
What options do we have for solving hyphenation for different languages in java?

The implementation of the hyphenator-j or of a forked variant is able to use the original .tex hyphenation tables.
This tables can either be found
On your local machine, if you have already installed a TeX environment such as MiKTeX. In this case, the .tex hyphenation tables can be found in \tex\generic\hyphen
On the Web page of the TeX User Group and the corresponding Git: here
Once you obtained the .tex of your interest, you can load them using the API provided by hyphenator-j.

Given enough time and willpower you could implement hyphenation yourself based on this thesis for example http://www.tug.org/docs/liang/.
Implementing hyphenation yourself is not an easy task though, so you might want to opt for alternate solutions.
Hyphenator.js
Yes, this is a javascript project. However it is possible to call javascript functions from java. You can find more information of this here: http://docs.oracle.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html.
This offers support for a wide variety of languages.
Scrape dictionaries
Many dictionaries offer hyphenation rules. You can find these online though it will involve some searching. Next you can scrape these for the hyphenation rules, but this might be an uglier workaround than calling javascript from Java.
Either way, hyphenation is not an easy problem, implementing it yourself seems like quite an annoying task so maybe the javascript project is your best bet. OR, you could implement your own Java implementation based on hyphenator.js. At least you would not start from scratch then.

Related

hyphenation preprocessing

I need some leads for tools in PHP and/or java (Spring + Hibernate currently) to use for hyphenation of content. I have some text content in included files and some in a database. All text is utf-8 encoded and I need soft hyphens as the support for that is common in most browsers.
So this stored original:
<p> These words need hyphenation</p>
would turn up something like this
<p> The­se wor­ds need hyp­he­na­tion</p>
in the source of the finally loaded web page.
Any ideas how to achieve this?
Suggestions for text edit tools that includes hyphenation within HTML mark up would also be welcome for situations where there isn't any server-side code in use and only plain HTML source files.
Also, I have yet to find a good source for hyphenation word lists.
CSS3 defines client-side hyphenation.
This means that in supporting browsers¹, you only need to specify the language of your text and your desire for automatic hyphenation and it will be hyphenated automatically without any work on your part. Obviously this means that hyphenation points are controlled by the browser's linguistic resources.
For manual control, you can place discretionary hyphens at every hyphenation point that you wish to use and direct the browser to use only those.
In practice, to find hyphenation points and insert discretionary hyphens, the best course would probably be to use the venerable TeX-style hyphenation method where subword patterns specifying hierarchical hyphenation or no-hyphenation points are matched against the word to hyphenate. These patterns are now widely used (including by OpenOffice, LibreOffice and Adobe InDesign) and are available for most languages.
Implementing the algorithm only takes a few lines of code. What's more, there are ready-made implementations in numerous languages: PHP implementations like phpHyphenator, Java implementations like TeXHyphenator-J or Hyphenation and Java bindings for the C++ implementation of libhyphen like jhyphen.
¹ Currently, Firefox, Safari and IE have autohyphenation support, Chrome and Opera don't.
Hyphenation is actually extremely difficult. There aren't really any word lists out there. If you're using PHP, you may be able to make the Perl library TeX::Hyphen. I don't know of any Java solutions.
For more information, read this Wikipedia article.

library for text classification in java

I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.

Are there any sandboxable scripting engines which can be integrated with PHP/Python/other?

I'm performing a thought-experiment which, judging by other questions, isn't so novel after all, but I think I'll proceed anyway. I want to sandbox a user-supplied server-side script by, among other things, confining it to a virtual filesystem and setting the root directory, and further mapping certain virtual directories to specific physical ones, inconsistent with the actual directory structure. For example (using PHP string parsing), my preconception is "~$user/..." but the less-semantic "/$user/..." would work fine too; either might map to "users/$user/$script_name/data/...". Yes, under certain circumstances multiple users can be affected by the script.
Since this is a thought-experiment and I therefore don't consider the implementation language an issue, I'm expecting to do it on my localhost and would rather use PHP than install something else. (I also have Python 2 available, and could get mod_wsgi to use it instead. I'd install Python 3 if I had to.) Ideally, I wish a PEAR module would do this - but from what I can see none does.
I tried and failed to find a server module, such as SSJS, that could accomplish this. The closest things to answers that I found were << Looking for a locked down script interpreter >> and << Allowing users to script inline, what inline scripting engines are there for either .net or java? >>. I'll move to Java or, less likely, Mono if I absolutely have to, but I'm not enthusiastic to the idea. (I'm extremely rusty on Java, and have hardly used it server-side at all. Mono is totally alien to me.)
Since they're the most promising options so far, I also wonder how extensive the sandboxing facilities are in Java and Mono. For example, can they do virtual filesystems? Entering APIs from Java user-code into the engine? Are any standard APIs offered to the script, and if so can they be removed?
Clarification
I don't really care which way this goes, but I was actually expecting Java/Mono to be the implementation platform rather than the sandboxed one, based on the questions & answers I linked. I'm a little surprised to see that flipped in the answers, but either way works.
The Java sandbox (in the way implemented for browser applets) does not offer file access at all.
In general, the Java security model has only "allow or not allow" decisions for the security manager in most cases.
Of course you could design another API instead of the normal File IO api (and similar), and have your sandboxed script access then this way (and forbid the normal way by a security manager). (I suppose some of this is already implemented in the Java application engines on the market, but I do know about nothing about this).
I have never tried to truly sandbox Mono but this might give you a starting point:
http://www.mono-project.com/MonoSandbox
File system access in the sandbox is touched on in that link.
Popular choices for Mono scripting seem to be Boo and Python. Both ship with the latest version of Mono (2.10). Visual Basic, Ruby and F# (OCaml-ish) do as well.
The Mono C# compiler can be easily embedded as a service for scripting. Here is a nice article about it.
If you are partial to PHP, you should check out Phalanger.
There are many other choices. There are new .NET based scripting languages all the time. I came across this one earlier today.

Java Library with subgraph isomorphism problem support?

I'm trying to analyze the usage of "#include" in C files (what is included first, dependencies...).
To do so, I extract from a C file the "#include" and I'm building a graph. I would like to identify common patterns in this graph...
So far, I'm using JGraphT as the graph engine (not sure this is the correct expression) and JGraph for the rendering (however using jgraph is a bit problematic since the Layouts are no longer included in the free release).
I've been unable to find any isomorphism support in jgrapht. Do you know any solution providing this kind of support (something like igraph but for java)..?
I'm using java 1.5 and the proposed solution must be free...
Not sure one of them can do isomorphism but I've collected a couple of links to graph layout engines in my blog: http://blog.pdark.de/2009/02/11/graph-layout-in-java/
You might want to look at graphviz, too. It's not Java but has a very powerful layout engine.
As for isomorphism: You probably only need to check for patterns at level 0 (i.e. the direct includes) because anything below that must be isomorphic by definition (all files included by some include file will always be the same unless someone used a lot of #if magic in the includes section).
Have you looked at Parsemis?
It's a Java graph mining library, and (sub)graph isomorphism is fundamental to this process, so my guess is that they're solving this issue somehow.
Not sure about the license though, but I believe it's open source as it was developed for academic reasons.
I've been pondering this problem myself lately (looking for common markup structures to factor out of JSPs into tags, in my case).
A library for this would be great. I haven't found one yet. In the meantime, here are a couple of problems that may be related to yours (isomorphically?).
I was planning to research the technique mathematical software uses to analytically evaluate integrals in calculus problems. In this case, there are a bunch of known structural patterns, and the problem in question has to be matched to one of the known patterns. The best way to do this is not always obvious because it depends on what terms are grouped together, etc.
Algorithms used in biology to find corresponding structures in two complex molecules might also be adapted to this problem.
Looks like there was a mention of isomorphism in the "experimental" package of JGraphT a few months back, but apparently no documentation.
Isomorphism comparison is a fundamental requirement in cheminformatics software (technically it's monomorphism that's used). Atoms are "nodes" and bonds are "edges". Molecular graphs are undirected and can be cyclic. A few open source cheminformatics libraries written in Java are available. You might be able to find some clues for solving your problem by looking at these libraries.
For example, I've written a BSD-licensed cheminformatics library called MX that implements a monomorphism algorithm based on VF. I wrote a high-level overview of how the algorithm was implemented, and you can browse the source for the mapping package in my GitHub repository. Most of the work is done in the DefaultState class.
MX also includes a fast exhaustive ring detector and other graph manipulations that might be applicable to your problem.
I sure don't know of a particular graph library with subgraph isomorphism code — since it's known NP-complete, you can't do a lot other than search anyway. It shows up a lot in graph rewriting schemes, so AGG might help.

What's the best/easiest way to manipulate ActiveX objects in Java?

I want to open and manipulate Excel files with ActiveX. I've had success with Python's Win32 Extensions and Groovy's Scriptom libraries on other projects but need to do this is pure Java this time if possible.
I've tried the Jacob Java COM Bridge but that doesn't seem as straightforward or simple to use, and I couldn't get it to retrieve cell values (even though this is the library underlying Scriptom). Are there alternatives?
Jacob is really the tool for the job here. I recommend that you take the time to learn a bit about how COM and ActiveX work, and I think you'll find that it's easier to use. COM is quite an accomplishment, but it's hard. Wrappers like VB make it seem easy (For the limited use that they work for), but it is not at all easy. I have a great book on learning COM, but don't have the name handy right now...
You want to learn about the IDispatch interface (this is what most of Excel's COM interface is developed around). It's a nasty, nasty interface (one of those viral things that you can do so much with it that it becomes impossible to tell what is actually happening) - but learning it is key.
If you are having issues in just one area (i.e. getting a value from a cell), you could grab the source for Scriptom and see what they do (open source, after all!).
Another suggestion is to try to implement some test cases of your code in VBA and make sure that you are correctly thinking through all the return values. When we were doing Excel automation in one of our Java apps, we implemented the general algorithm from Word's VBA, worked through the problem cases, etc... After that, transferring over to Jacob was pretty straightforward.
K
how about http://www.nevaobject.com/_docs/_java2com/java2com.htm -- this is commercial but works better.
Have you looked at JExcelAPI? Instead of using ActiveX this is a Java library which directly reads and writes Excel files.
Not an exact answer to your questions but it might solve the problem just as well, especially if you're looking for a pure Java solution.
There's also JIntegra, which does a similar thing. Also commercial.
And there's JNIWrapper, which does a similar thing. again, also commercial.

Categories

Resources