Is there a decent, customisable, HTML to Markdown Java API? - java

I want to save text I scrape from various sources without the HTML tags that are on it, but also keeping as much of the structure as I reasonably can.
Markdown seems to be the solution to this (or possibly MultiMarkdown).
There is a question which offers a suggestion on converting from HTML to markdown, but I want to specify some specific things:
ALL links (including images) are referenced at the END only (i.e. no inline urls)
NO embeded HTML (I'm not even 100% sure yet how I'd like to deal with difficult HTML... but it won't be embeded!)
So my question is as stated in the title: Is there a decent, customisable, HTML to Markdown Java API?

You could try adapting HtmlCleaner which provides a workable interface onto the DOM:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
This would allow you to structure your output stream in any format that you want using a fairly simple API.

There is a great library for JS called Turndown, you can try it online here. It can be partially customized. For example, links can be referenced at the end. And as far as I know there is no embedded html, everything is transformed.
I needed it for Java (as the linked question), so I ported it. The library for Java is called CopyDown, it has the same test suite as Turndown.
To install with gradle:
dependencies {
compile 'io.github.furstenheim:copy_down:1.0'
}
Then to use it:
CopyDown converter = new CopyDown();
String myHtml = "<h1>Some title</h1><div>Some html<p>Another paragraph</p></div>";
String markdown = converter.convert(myHtml);
System.out.println(markdown);
> Some title\n==========\n\nSome html\n\nAnother paragraph\n

Related

Reading the spss file java

SPSSReader reader = new SPSSReader(args[0], null);
Iterator it = reader.getVariables().iterator();
while (it.hasNext())
{
System.out.println(it.next());
}
I am using this SPSSReader to read the spss file. Here,every string is printed with some junk characters appended with it.
Obtained Result :
StringVariable: nameogr(nulltpc{)(10)
NumericVariable: weightppuo(nullf{nd)
DateVariable: datexsgzj(nulllanck)
DateVariable: timeppzb(null|wt{l)
DateVariable: datetimegulj{(null|ns)
NumericVariable: commissionyrqh(nullohzx)
NumericVariable: priceeub{av(nullvlpl)
Expected Result :
StringVariable: name (10)
NumericVariable: weight
DateVariable: date
DateVariable: time
DateVariable: datetime
NumericVariable: commission
NumericVariable: price
Thanks in advance :)
I tried recreating the issue and found the same thing.
Considering that there is a licensing for that library (see here), I would assume that this might be a way of the developers to ensure that a license is bought as the regular download only contains a demo version as evaluation (see licensing before the download).
As that library is rather old (copyright of the website is 2003-2008, requirement for the library is Java 1.2, no generics, Vectors are used, etc), I would recommend a different library as long as you are not limited to the one used in your question.
After a quick search, it turned out that there is an open source spss reader here which is also available through Maven here.
Using the example on the github page, I put this together:
import com.bedatadriven.spss.SpssDataFileReader;
import com.bedatadriven.spss.SpssVariable;
public class SPSSDemo {
public static void main(String[] args) {
try {
SpssDataFileReader reader = new SpssDataFileReader(args[0]);
for (SpssVariable var : reader.getVariables()) {
System.out.println(var.getVariableName());
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I wasn't able to find stuff that would print NumericVariable or similar things but as those were the classnames of the library you were using in the question, I will assume that those are not SPSS standardized. If they are, you will either find something like that in the library or you can open an issue on the github page.
Using the employees.sav file from here I got this output from the code above using the open source library:
resp_id
gender
first_name
last_name
date_of_birth
education_type
education_years
job_type
experience_years
monthly_income
job_satisfaction
No additional characters no more!
Edit regarding the comment:
That is correct. I read through some SPSS stuff though and from my understanding there are only string and numeric variables which are then formatted in different ways. The version published in maven only gives you access to the typecode of a variable (to be honest, no idea what that is) but the github version (that does not appear to be published on maven as 1.3-SNAPSHOT unfortunately) does after write- and printformat have been introduced.
You can clone or download the library and run mvn clean package (assuming you have maven installed) and use the generated library (found under target\spss-reader-1.3-SNAPSHOT.jar) in your project to have the methods SpssVariable#getPrintFormat and SpssVariable#getWriteFormat available.
Those return an SpssVariableFormat which you can get more information from. As I have no clue what all that is about, the best I can do is to link you to the source here where references to the stuff that was implemented there should help you further (I assume that this link referenced to in the documentation of SpssVariableFormat#getType is probably the most helpful to determine what kind of format you have there.
If absolutely NOTHING works with that, I guess you could use the demo version of the library in the question to determine the stuff through it.next().getClass().getSimpleName() as well but I would resort to that only if there is no other way to determining the format.
I am not sure, but looking at your code, it.next() is returning a Variable object.
There has to be some method to be chained to the Variable object, something like it.next().getLabel() or it.next().getVariableName(). toString() on an Object is not always meaningful. Check toString() method of Variable class in SPSSReader library.

How to convert html with svg tags and css in to image in java

We can successfully convert an SVG into an image with Batik, however, I need to convert a whole HTML div, with SVG implemented within, along with its CSS presentation code, into an image.
Is there any modules / support within Batik or some other Java API for achieving this?
Selenium library for Java may help you. It can run a browser (ie, chrome, firefox, etc.) in background mode, and you can load an HTML and take a snapshot of the content.
Although it's designed for testing and automation, it's the only way I can offer to you.
Hope it helps.
http://www.seleniumhq.org/
We had the same problem, and we solved it by spawning an PhantomJS process.
Phantom takes an JavaScript file that will instruct its headless browser what to do.
You can wait until the page is fully loaded and then you can print the output into the console as a data URI.
Below is a very simple example from my PhantomJS scripts:
var page = require( "webpage" ).create();
var options = JSON.parse( phantom.args[ 0 ] );
page.open( options.url, "POST", decodeURIComponent( options.payload || "" ), function( status ) {
if ( status === "fail" ) {
phantom.exit( 1 );
}
var contents = page.renderBase64( "png" );
require( "system" ).stderr.write( contents );
});
This is not an easy task as what you are asking is the process called "html rendering" and is basically what browsers try to implement correctly for over 2 decades.
If the CSS you need rendered is fairly simple (no CSS3, no fancy stuff, etc.), then there is a high chance that one of the open-source renderers would be able to handle that (PhantomJS as an example). See #gustavohenke answer for more details.
If the CSS is moderately complex and if you are able to modify it if needed - then there are some fast but non-free renderers, like PrinceXML and DocRaptor.
If the CSS could be very complex and you are not able to make it simpler - then the only option would be to render it in a real browser. You can use Selenium for that as it has a way of running the browser, rendering your HTML in it and "screenshotting" the result all in automated fashion. See #Jorgeblom answer for more details.

How to sanitize HTML code in Java to prevent XSS attacks?

I'm looking for class/util etc. to sanitize HTML code i.e. remove dangerous tags, attributes and values to avoid XSS and similar attacks.
I get html code from rich text editor (e.g. TinyMCE) but it can be send malicious way around, ommiting TinyMCE validation ("Data submitted form off-site").
Is there anything as simple to use as InputFilter in PHP? Perfect solution I can imagine works like that (assume sanitizer is encapsulated in HtmlSanitizer class):
String unsanitized = "...<...>..."; // some potentially
// dangerous html here on input
HtmlSanitizer sat = new HtmlSanitizer(); // sanitizer util class created
String sanitized = sat.sanitize(unsanitized); // voila - sanitized is safe...
Update - the simpler solution, the better! Small util class with as little external dependencies on other libraries/frameworks as possible - would be best for me.
How about that?
You can try OWASP Java HTML Sanitizer. It is very simple to use.
PolicyFactory policy = new HtmlPolicyBuilder()
.allowElements("a")
.allowUrlProtocols("https")
.allowAttributes("href").onElements("a")
.requireRelNofollowOnLinks()
.build();
String safeHTML = policy.sanitize(untrustedHTML);
Thanks to #Saljack's answer. Just to elaborate more to OWASP Java HTML Sanitizer. It worked out really well (quick) for me. I just added the following to the pom.xml in my Maven project:
<dependency>
<groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
<artifactId>owasp-java-html-sanitizer</artifactId>
<version>20150501.1</version>
</dependency>
Check here for latest release.
Then I added this function for sanitization:
private String sanitizeHTML(String untrustedHTML){
PolicyFactory policy = new HtmlPolicyBuilder()
.allowAttributes("src").onElements("img")
.allowAttributes("href").onElements("a")
.allowStandardUrlProtocols()
.allowElements(
"a", "img"
).toFactory();
return policy.sanitize(untrustedHTML);
}
More tags can be added by extending the comma delimited parameter in allowElements method.
Just add this line prior passing the bean off to save the data:
bean.setHtml(sanitizeHTML(bean.getHtml()));
That's it!
For more complex logic, this library is very flexible and it can handle more sophisticated sanitizing implementation.
You could use OWASP ESAPI for Java, which is a security library that is built to do such operations.
Not only does it have encoders for HTML, it also has encoders to perform JavaScript, CSS and URL encoding. Sample uses of ESAPI can be found in the XSS prevention cheatsheet published by OWASP.
You could use the OWASP AntiSamy project to define a site policy that states what is allowed in user-submitted content. The site policy can be later used to obtain "clean" HTML that is displayed back. You can find a sample TinyMCE policy file on the AntiSamy downloads page.
HTML escaping inputs works very well. But in some cases business rules might require you NOT to escape the HTML. Using REGEX is not fit for the task and it is too hard to come up with a good solution using it.
The best solution I found was to use: http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It builds a DOM tree with the provided input and filters any element not previosly allowed by a Whitelist. The API also has other functions for cleaning up html.
And it can also be used with javax.validation #SafeHtml(whitelistType=, additionalTags=)
Regarding Antisamy, you may want to check this regarding the dependencies:
http://code.google.com/p/owaspantisamy/issues/detail?id=95&can=1&q=redyetidave

best way to externalize HTML in GWT apps?

What's the best way to externalize large quantities of HTML in a GWT app? We have a rather complicated GWT app of about 30 "pages"; each page has a sort of guide at the bottom that is several paragraphs of HTML markup. I'd like to externalize the HTML so that it can remain as "unescaped" as possible.
I know and understand how to use property files in GWT; that's certainly better than embedding the content in Java classes, but still kind of ugly for HTML (you need to backslashify everything, as well as escape quotes, etc.)
Normally this is the kind of thing you would put in a JSP, but I don't see any equivalent to that in GWT. I'm considering just writing a widget that will simply fetch the content from html files on the server and then add the text to an HTML widget. But it seems there ought to be a simpler way.
I've used ClientBundle in a similar setting. I've created a package my.resources and put my HTML document and the following class there:
package my.resources;
import com.google.gwt.core.client.GWT;
import com.google.gwt.resources.client.ClientBundle;
import com.google.gwt.resources.client.TextResource;
public interface MyHtmlResources extends ClientBundle {
public static final MyHtmlResources INSTANCE = GWT.create(MyHtmlResources.class);
#Source("intro.html")
public TextResource getIntroHtml();
}
Then I get the content of that file by calling the following from my GWT client code:
HTML htmlPanel = new HTML();
String html = MyHtmlResources.INSTANCE.getIntroHtml().getText();
htmlPanel.setHTML(html);
See http://code.google.com/webtoolkit/doc/latest/DevGuideClientBundle.html for further information.
You can use some templating mechanism. Try FreeMarker or Velocity templates. You'll be having your HTML in files that will be retrieved by templating libraries. These files can be named with proper extensions, e.g. .html, .css, .js obsearvable on their own.
I'd say you load the external html through a Frame.
Frame frame = new Frame();
frame.setUrl(GWT.getModuleBase() + getCurrentPageHelp());
add(frame);
You can arrange some convention or lookup for the getCurrentPageHelp() to return the appropriate path (eg: /manuals/myPage/help.html)
Here's an example of frame in action.
In GWT 2.0, you can do this using the UiBinder.
<ui:UiBinder xmlns:ui='urn:ui:com.google.gwt.uibinder'>
<div>
Hello, <span ui:field='nameSpan’/>, this is just good ‘ol HTML.
</div>
</ui:UiBinder>
These files are kept separate from your Java code and can be edited as HTML. They are also provide integration with GWT widgets, so that you can easily access elements within the HTML from your GWT code.
GWT 2.0, when released, should have a ClientBundle, which probably tackles this need.
You could try implementing a Generator to load external HTML from a file at compile time and build a class that emits it. There doesn't seem to be too much help online for creating generators but here's a post to the GWT group that might get you started: GWT group on groups.google.com.
I was doing similar research and, so far, I see that the best way to approach this problem is via the DeclarativeUI or UriBind. Unfortunately it still in incubator, so we need to work around the problem.
I solve it in couple of different ways:
Active overlay, i.e.: you create your standard HTML/CSS and inject the GET code via <script> tag. Everywhere you need to access an element from GWT code you write something like this:
RootPanel.get("element-name").setVisible(false);
You write your code 100% GWT and then, if a big HTML chunk is needed, you bring it to the client either via IFRAME or via AJAX and then inject it via HTML panel like this:
String html = "<div id='one' "
+ "style='border:3px dotted blue;'>"
+ "</div><div id='two' "
+ "style='border:3px dotted green;'"
+ "></div>";
HTMLPanel panel = new HTMLPanel(html);
panel.setSize("200px", "120px");
panel.addStyleName("demo-panel");
panel.add(new Button("Do Nothing"), "one");
panel.add(new TextBox(), "two");
RootPanel.get("demo").add(panel);
Why not to use good-old IFRAME? Just create an iFrame where you wish to put a hint and change its location when GWT 'page' changes.
Advantages:
Hits are stored in separate maintainable HTML files of any structure
AJAX-style loading with no coding at all on server side
If needed, application could still interact with loaded info
Disadvantages:
Each hint file should have link to shared CSS for common look-and-feel
Hard to internationalize
To make this approach a bit better, you might handle loading errors and redirect to default language/topic on 404 errors. So, search priority will be like that:
Current topic for current language
Current topic for default language
Default topic for current language
Default error page
I think it's quite easy to create such GWT component to incorporate iFrame interactions
The GWT Portlets framework (http://code.google.com/p/gwtportlets/) includes a WebAppContentPortlet. This serves up any content from your web app (static HTML, JSPs etc.). You can put it on a page with additional functionality in other Portlets and everything is fetched with a single async call when the page loads.
Have a look at the source for WebAppContentPortlet and WebAppContentDataProvider to see how it is done or try using the framework itself. Here are the relevant bits of source:
WebAppContentPortlet (client side)
((HasHTML)getWidget()).setHTML(html == null ? "<i>Web App Content</i>" : html);
WebAppContentDataProvider (server side):
HttpServletRequest servletRequest = req.getServletRequest();
String path = f.path.startsWith("/") ? f.path : "/" + f.path;
RequestDispatcher rd = servletRequest.getRequestDispatcher(path);
BufferedResponse res = new BufferedResponse(req.getServletResponse());
try {
rd.include(servletRequest, res);
res.getWriter().flush();
f.html = new String(res.toByteArray(), res.getCharacterEncoding());
} catch (Exception e) {
log.error("Error including '" + path + "': " + e, e);
f.html = "Error including '" + path +
"'<br>(see server log for details)";
}
You can use servlets with jsps for the html parts of the page and still include the javascript needed to run the gwt app on the page.
I'm not sure I understand your question, but I'm going to assume you've factored out this common summary into it's own widget. If so, the problem is that you don't like the ugly way of embedding HTML into the Java code.
GWT 2.0 has UiBinder, which allows you to define the GUI in raw HTMLish template, and you can inject values into the template from the Java world. Read through the dev guide and it gives a pretty good outline.
Take a look at
http://code.google.com/intl/es-ES/webtoolkit/doc/latest/DevGuideClientBundle.html
You can try GWT App with html templates generated and binded on run-time, no compiling-time.
Not knowing GWT, but can't you define and anchor div tag in your app html then perform a get against the HTML files that you need, and append to the div? How different would this be from a micro-template?
UPDATE:
I just found this nice jQuery plugin in an answer to another StackOverflow question.

HTML to Markdown with Java

is there an easy way to transform HTML into markdown with JAVA?
I am currently using the Java MarkdownJ library to transform markdown to html.
import com.petebevin.markdown.MarkdownProcessor;
...
public static String getHTML(String markdown) {
MarkdownProcessor markdown_processor = new MarkdownProcessor();
return markdown_processor.markdown(markdown);
}
public static String getMarkdown(String html) {
/* TODO Ask stackoverflow */
}
There is a great library for JS called Turndown, you can try it online here. It works for htmls that the accepted answer errors out.
I needed it for Java (as the question), so I ported it. The library for Java is called CopyDown, it has the same test suite as Turndown and I've tried it with real examples that the accepted answer was throwing errors.
To install with gradle:
dependencies {
compile 'io.github.furstenheim:copy_down:1.0'
}
Then to use it:
CopyDown converter = new CopyDown();
String myHtml = "<h1>Some title</h1><div>Some html<p>Another paragraph</p></div>";
String markdown = converter.convert(myHtml);
System.out.println(markdown);
> Some title\n==========\n\nSome html\n\nAnother paragraph\n
PS. It has MIT license
I am working on the same issue, and experimenting with a couple different techniques.
The answer above could work. You could use the jTidy library to do the initial cleanup work and convert from HTML to XHTML. You use the XSLT stylesheet linked above.
Unfortunately there is no library that has a one-stop function to do this in Java. You could try using the Python script html2text with Jython, but I haven't yet tried this!
There is a Java Library called flexmark which has such a feature.
Maven Dependency:
<dependency>
<groupId>com.vladsch.flexmark</groupId>
<artifactId>flexmark-html2md-converter</artifactId>
<version>0.64.0</version>
</dependency>
Using the class com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter you can convert an HTML String to a Markdown String in one line like this:
String md = FlexmarkHtmlConverter.builder().build().convert(html);
if you are using WMD editor and want to get the markdown code on the server side, just use these options before loading the wmd.js script:
wmd_options = {
// format sent to the server. can also be "HTML"
output: "Markdown",
// line wrapping length for lists, blockquotes, etc.
lineLength: 40,
// toolbar buttons. Undo and redo get appended automatically.
buttons: "bold italic | link blockquote code image | ol ul heading hr",
// option to automatically add WMD to the first textarea found.
autostart: true
};
There is a Haskell library called pandoc that can convert between most markup formats.
Although it is not a Java library, it can be used through its CLI in Java.
You can get and install the latest version from here. Read the getting started guides here.
var command = "pandoc --to=markdown_strict --output=result.md input.html";
var pandoc = new ProcessBuilder()
.command(command.split(" "))
.directory(new File(".")) // Working directory
.start();
pandoc.waitFor();
// The output result.md will be created in the working directory
This tool can also be used in GitHub Actions workflows.

Categories

Resources