HTML to ODT – XSLT? - java

I'm trying to convert single pieces of HTML code to the XML Format the *.odt format (Open Office) is using. For example, <p>This is some text</p> should be translated to <text:p>This is some text</text:p>. Of course, this should also work with lists etc.
I'm not sure whether the best way to go would be using a XSLT processor (and if so, which one for Java?) and create the stylesheet myself – isn't there a Java library out there that can already do this?
I'm using jodconverter to go from ODT->PDF, but even though OpenOffice Writer can handle copy&pasting the content and display it in the desired way, jodconvert doesn't seem to be able to "translate" single pieces of HTML (or am I wrong about that?).
Any ideas and suggestions would be very welcome. I should add that I'm absolutely new to Java. Thanks in advance
Ingo

XSLT is the best way to do it. The OpenDocument group is working on a HTML to ODT xsl template. Sadly, it is not ready yet.
You can check on their website to stay in touch (and get beta work maybe).
Otherwise, you have non official project, also based on XSLT: like this one
It would be easy to apply a little transformation on your HTML to get a valid XHTML before processing it to ODT.
Or just check this other example.

Related

Use flexmark-java to clean markdown

Within a Java application, I would need to convert marked-down text into simple plain text instead of html (for example dropping all links addresses, bold and italic markers).
Which is the best way to do this? I was thinking using a markdown library like fleaxmark. But I cant find this feature at first sight. Is it there? Are there other better alternatives?
Edit
Commonmark supports rendering to text, by using org.commonmark.renderer.text.TextContentRenderer instead of the default HTML renderer. Not sure what it does with newlines, but worth a try.
Original answer, using flexmark HTML + JSoup
The ideal solution would be to implement a custom Renderer for flexmark, but this would force you to write a model-to-string for all language features in markdown. Unless it supports this out of the box, but I'm not aware of this feature...
A simpler solution may be to use flexmark (or any other lightweight markdown renderer) and let it create the HTML. After that, just run the generated HTML through https://jsoup.org/ and let it extract the text:
Jsoup.parse(htmlInputStream).text();
String org.jsoup.nodes.Element.text()
Gets the combined text of this element and all its children. Whitespace is normalized and trimmed.
For example, given HTML <p>Hello <b>there</b> now! </p>, p.text() returns Hello there now!
We use this approach to get a "preview" of the text entered in a rich content editor (summernote), after being sanitized with org.owasp.html.HtmlSanitizer.
flexmark also have mark down to text feature.
checkout this

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Parse javascript generated content using Java

http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

Java HTML normalizer?

IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?
For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.
Sort of like Lynx web browser but with minimal graphics.
The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.
It's easy to use:
Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
(from here)
and handles HTML in the wild (badly formatted) very well.
Here's the first few lines of this page formatted this way using Jericho:
Stack Exchange log in | careers | chat
| meta | about | faq
Stack Overflow
* Questions
* Tags
* Users
* Badges
* Unanswered
* Ask Question
Java HTML normalizer?
**
IS there a library which can transform
any given HTML page with JS, CSS all
over it, into a minimalistic uniform
format?
For instance, if we render
stackoverflow homepage, I want it to
be shown in a minimal format. I want
all other sites to be rendered down.
Sort of like Lynx web browser but with
minimal graphics.
java lynx link|edit|flag asked 2 days
ago Kim Jong Woo 593112 89% accept
rate Do you want to transform your
HTML code to simpler HTML code, or do
your want to show this "minimalistic
uniform format" to your user? Or do
you want to create a image? – Paŭlo
Ebermann yesterday simpler html code
without sacrificing the relative
positioning of the elements. – Kim
Jong Woo 16 hours ago
2 Answers
To answer your firtst question: No. I
don'nt think there is a library for
that purpose. (At least this is what
my "googeling" resulted in).
And i think the reason for this is,
that what you want is a very special
need.
So as a solution for your problem you
can parse the html and display it the
way you want to in a JEditorpane or
whatever you are using for display.
I can only suggest a way i would do it
(this is because i am familiar with
xml and everything around it).
*
Use a library to ensure that your html conforms to xhtml:
http://htmlcleaner.sourceforge.net/release.php
*
then either parse the xml with DOM or SAX parsers and display it the
way you want.
or
* use xslt to transform the document into some other html document
which results in a view that fits your
needs.
or
* use one of the available html parser librarys. (The most of which i
found where kind of outdated (2006))
but they could be an option for you.
This is just one suggestion how you
could do it. I'm sure there are
thousands of other ways which will do
the same thing.
To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).
And i think the reason for this is, that what you want is a very special need.
So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.
I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).
Use a library to ensure that your html conforms to xhtml: http://htmlcleaner.sourceforge.net/release.php
then either parse the xml with DOM or SAX parsers and display it the way you want.
or
use xslt to transform the document into some other html document which results in a view that fits your needs.
or
use one of the available html parser librarys. (The most of which i found where kind of outdated (2006)) but they could be an option for you.
This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.

Categories

Resources