hyphenation preprocessing

hyphenation preprocessing - java

I need some leads for tools in PHP and/or java (Spring + Hibernate currently) to use for hyphenation of content. I have some text content in included files and some in a database. All text is utf-8 encoded and I need soft hyphens as the support for that is common in most browsers.
So this stored original:
<p> These words need hyphenation</p>
would turn up something like this
<p> These words need hyphenation</p>
in the source of the finally loaded web page.
Any ideas how to achieve this?
Suggestions for text edit tools that includes hyphenation within HTML mark up would also be welcome for situations where there isn't any server-side code in use and only plain HTML source files.
Also, I have yet to find a good source for hyphenation word lists.

CSS3 defines client-side hyphenation.
This means that in supporting browsers¹, you only need to specify the language of your text and your desire for automatic hyphenation and it will be hyphenated automatically without any work on your part. Obviously this means that hyphenation points are controlled by the browser's linguistic resources.
For manual control, you can place discretionary hyphens at every hyphenation point that you wish to use and direct the browser to use only those.
In practice, to find hyphenation points and insert discretionary hyphens, the best course would probably be to use the venerable TeX-style hyphenation method where subword patterns specifying hierarchical hyphenation or no-hyphenation points are matched against the word to hyphenate. These patterns are now widely used (including by OpenOffice, LibreOffice and Adobe InDesign) and are available for most languages.
Implementing the algorithm only takes a few lines of code. What's more, there are ready-made implementations in numerous languages: PHP implementations like phpHyphenator, Java implementations like TeXHyphenator-J or Hyphenation and Java bindings for the C++ implementation of libhyphen like jhyphen.
¹ Currently, Firefox, Safari and IE have autohyphenation support, Chrome and Opera don't.

Hyphenation is actually extremely difficult. There aren't really any word lists out there. If you're using PHP, you may be able to make the Perl library TeX::Hyphen. I don't know of any Java solutions.
For more information, read this Wikipedia article.

Related

Hyphenation for different languages with java

The problem : Given a string (which can be in different language) we have to hyphenate it.
What i tried : hypenator-j but this seems to be working only for English, I'm not sure how to hyphenate other languages, couldn't find free tex files for different languages.
What options do we have for solving hyphenation for different languages in java?

The implementation of the hyphenator-j or of a forked variant is able to use the original .tex hyphenation tables.
This tables can either be found
On your local machine, if you have already installed a TeX environment such as MiKTeX. In this case, the .tex hyphenation tables can be found in \tex\generic\hyphen
On the Web page of the TeX User Group and the corresponding Git: here
Once you obtained the .tex of your interest, you can load them using the API provided by hyphenator-j.

Given enough time and willpower you could implement hyphenation yourself based on this thesis for example http://www.tug.org/docs/liang/.
Implementing hyphenation yourself is not an easy task though, so you might want to opt for alternate solutions.
Hyphenator.js
Yes, this is a javascript project. However it is possible to call javascript functions from java. You can find more information of this here: http://docs.oracle.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html.
This offers support for a wide variety of languages.
Scrape dictionaries
Many dictionaries offer hyphenation rules. You can find these online though it will involve some searching. Next you can scrape these for the hyphenation rules, but this might be an uglier workaround than calling javascript from Java.
Either way, hyphenation is not an easy problem, implementing it yourself seems like quite an annoying task so maybe the javascript project is your best bet. OR, you could implement your own Java implementation based on hyphenator.js. At least you would not start from scratch then.

PDF document types explanation (such as PDF/A-1)

I am working on software to store legal documents and I was thinking that PDF might be an ideal format to work in. However I am a little confused as to what would best suit my needs in this regard in the actual format of the PDF file.
I have the following requirements for the documents:
will be stored for a minimum of 7 years if not longer
not editable
contain both images and text (images will be in .jpg format ideally)
I was originally looking at using PDF/A-1 however I have discovered that this format does not seem to like using JPEG images, or at least it doesn't when using JODConverter.
Any suggestions/explanations as to which format would best meet these needs would be much appreciated!

For the requirements you described, PDF/A-1b (yes, b at the end!) is the ideal format. The b is for basic -- it has less strict requirements to meet than the PDF/A-1a (a at the end), which is for accessible (or advanced, as I mnemonic it).
If you have no difficulty implementing PDF/A-1a, you may as well go for it. However, depending on your source documents, PDF/A-1a may be extremely difficult and nearly impossible to generate (as it requires the additional tagging of the file's content for the accessibility features).
As for JPEG: of course PDF/A-1b supports JPEGs. It does not allow JPEG2000 compression to be used, because that algorithm was patent encumbered at the time of defining the PDF/A-1b standard. PDF/A-1b generating software therefor must re-compress objects using this type of compression with one of the other methods (which does not pose a big practical problem though.)
You may also want to look at the The PDF/A Competence Center (PDFA) website. (Disclosure: I'm a member of the PDFA.)

PDF/A-1 is a good format for long-term storage (as that's it's intention) and so it tries to remove external dependencies. This includes some things like embedding fonts and DISABLING external hyperlinks (which makes sense also, but can be a gotcha). Some useful info is on the Adobe site (look at the key-specifications tab). PDF sounds like the right answer to your requirements.
The images being embedded should not be a problem. JODReports perhaps is doing something wrong (or the version of OpenOffice/LibreOffice you are using underneath). You could try switching parts of that underlying infrastructure (OO/LO), try experimenting directly from OpenOffice/LibreOffice GUI - export PDF/A-1 and see what the results are or try some other tools in the chain (eg Docmosis though that is based on similar technology).

Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?

I need to parse a simple HTML page with a simple form in it. The answers to similar questions on StackOverflow suggest using one of a large variety of non-standard Java libraries such as TagSoup, JSoup, HTMLParser and many others.
However, a web search revealed that there exists some standard functionality in Java SE via this class: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator.html
My sub-questions are:
Is it really true that the standard ParserDelegator class can parse a use case like mine?
What are the limitations of the standard library that create the need for so many non-standard libraries?
Does the fact that ParserDelegator is within swing preclude using it in a regular EC2 cloud server for a web application? Would I have to jump through a lot of hoops to get around the headless aspect or would it be just a small tweak to the configuration?
If the standard one is not recommended, which non-standard one should I use, given: (a) my desire to not stray far from the standard; (b) my simple use case; (c) desire for a mature reliable implementation; and (d) no size or weight limitations since this is a server application as opposed to an embedded client. API is a far lower priority so while I do appreciate JSoup's CSS selector like API, the other concerns (a) through (d) override it.
Thank you.

JDK has built-in HTML parser that supports HTML 1.0 or so. It should support parsing of base text formatting tags and forms.
The reason to use other, third party parsers is requirement to support "real" HTML pages DHTML, JavaScript etc.
JSoup is one of popular parsers that can do the job. For more information about other implementations please take a look on the following discussion:
Pure Java HTML viewer/renderer for use in a Scrollable pane

Word document creation API in Java

I would like to create a word document using a template, replace some variables (fields) and save it as a new word document.
I was thinking using Apache POI, http://poi.apache.org/ is it the best for this purpose?
can you share your impression from it?

I've worked with POI before and it's certainly able to generate Word documents. But the devil is in the details.
Word has thousands of features: You can put numbered lists starting at #13 with negative indents into two joined cells of a table included in another table that is itself part of a bullet list... you get the idea. When the POI documentation says they are a work in progress, that reflects what will probably be an eternal state of trying to catch up to the (to us, undocumented) specification of Word.
Documents with a reasonably "normal" set of used features are well supported by POI, whose interfaces and methods are reasonable and consistent but sometimes require a bit of work. But as Pascal says, documents with a not too exorbitant set of features are also supported by RTF.
I have almost no experience "doing" RTF but it's probably a bit simpler than working with POI.
If you're working in an environment or for a customer who insists that your produced documents be .DOC rather than .RTF, then POI is pretty much your only choice, unless you can introduce a step where you use a bit of Office automation to convert RTF into DOC.
Update: I've had a couple more ideas in the meantime.
Using POI or creating RTF documents is something that you could do on practically any platform. At my job, all servers doing processing like this happen to be running Linux, for example.
However, in the likely case that your programs will run under Windows, there is another alternative: Jacob http://www.land-of-kain.de/docs/jacob/
Jacob is a COM interface for Java; it essentially allows you to "remote control" Windows programs such as Word and Excel. The document I linked to above is not to Jacob's own site but to a single page with "cookie cutter" recipes for using Jacob. The project itself is on SourceForge: http://sourceforge.net/projects/jacob-project/ But people claim, and rightly so, that the documentation is a bit lacking.
Jacob has the advantage over all other solutions that you're dealing with the "real" Word and therefore all capabilities of Word are available to you. This would be an alternative if there are detail aspects of your document that just can't be handled with POI or via the RTF format.

This is obviously way too late, But since 2013 there is a much better, more flexible solution to word document creation.
http://www.docx4java.org/trac/docx4j
I have had much more luck with docx4j than I ever did with POI.

I'm not sure of the exact status of the Word documents support in POI but, according to the POI website, work is still in progress (can't say what this mean exactly). So, at this time, I would not use POI but rather try to generate a RTF document. For this, you could :
Use RTFTemplate which is a RTF to RTF Engine that can generate RTF document as the result of the merge of a RTF model and data.
Use iText which is primarly a PDF generator but can also generate RTF.
Build your own custom solution (but I wouldn't do that).
I'd go for iText.

If you use a template, and do not want to create the word document from scratch, for what I know, POI is a pretty good solution. You open the template and select the zones you want to replace.
They say POI is still is developpement, but I've been using it in production environnement and it works pretty good at the moment.

I know this question is a bit old, but I think many people still find this with search engines, so I post another possibility to do what you want right here:
If the one and only goal is to have a Word Template and to replace some values in it, you might consider saving a Word Template as single xml (not docx) and then processing it with simple Java and without any Framework. If you want to do more (e.g. create lists or tables) you might also consider understanding the xml format and writing your own helpers before loading a Framework like POI.
Here is an example on how to do that:
http://dev-notes.com/code.php?q=10
This is the fast version, if you want a nice version, you could try using an XML processor.
PS: users might notice that the file extension is not doc but xml and they may blame you for that, but that's ok... just rename it to doc, word will recognize the format and everyone is happy again ;)

You should look into the Aspose.Words components. They have recently begun providing a Java version of the component.
See the following link: Aspose.Word for Java
This supports Word automation, creation and advanced features such as mail merging without the need for an instance of Microsoft Word on the machine. The real benefits are that you are able to work within the context of an actual word document and not having to compromise by creating RTFs etc.
The Java version is not currently as fully featured as the .Net version but the main core functionality is there and they are pushing very hard to have a feature equivalent version soon.
Also, if you purchase the Java version you get a years free upgrades / support as the new releases are created.

If you are working with docx documents, docx4j is an option. Like POI, its open source.

I created and use this: http://code.google.com/p/java2word

File format conversion library

Are there any well known solutions that meet/exceed below requirements?
conversion from multiple non-graphical document formats to and from HTML (e.g. doc<->HTML, pdf<->html, odt<->html, etc.)
command line or API (Java API is preferable)
cross-platform
commercial or open source

OpenOffice has a rich API that supports conversion between the various supported formats. Check out this question. It recommends using JODConverter.

With DocBook you can export to various output formats, but reverting is always hard. For pdf you can try iText

I (having written an all in one Tex/LaTeX -> HTML and ASCII text and RTF convertor),
would say this would be quite an undertaking.
The problem with this, is these various 'document' formats are intended for rather different purposes.
And while there are indeed such conversion tools between some of these formats,
there is often a conceptual disparity in the structure, meaning and implementation of 'document'
and it is very often is necessary to trade off on features supported by one format to hack together
an acceptable output in another.
For example, PDF is very strong in presentation, precise positioning and support for fonts, where
as HTML is more concerned about structure with practically no considuration for these things
(without CSS).
I am curious how do you envision such an API being used,
when usually someone simply wants a conversion program?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.