HTML to Markdown with Java

HTML to Markdown with Java - java

is there an easy way to transform HTML into markdown with JAVA?
I am currently using the Java MarkdownJ library to transform markdown to html.
import com.petebevin.markdown.MarkdownProcessor;
...
public static String getHTML(String markdown) {
MarkdownProcessor markdown_processor = new MarkdownProcessor();
return markdown_processor.markdown(markdown);
}
public static String getMarkdown(String html) {
/* TODO Ask stackoverflow */
}

There is a great library for JS called Turndown, you can try it online here. It works for htmls that the accepted answer errors out.
I needed it for Java (as the question), so I ported it. The library for Java is called CopyDown, it has the same test suite as Turndown and I've tried it with real examples that the accepted answer was throwing errors.
To install with gradle:
dependencies {
compile 'io.github.furstenheim:copy_down:1.0'
}
Then to use it:
CopyDown converter = new CopyDown();
String myHtml = "<h1>Some title</h1><div>Some html<p>Another paragraph</p></div>";
String markdown = converter.convert(myHtml);
System.out.println(markdown);
> Some title\n==========\n\nSome html\n\nAnother paragraph\n
PS. It has MIT license

I am working on the same issue, and experimenting with a couple different techniques.
The answer above could work. You could use the jTidy library to do the initial cleanup work and convert from HTML to XHTML. You use the XSLT stylesheet linked above.
Unfortunately there is no library that has a one-stop function to do this in Java. You could try using the Python script html2text with Jython, but I haven't yet tried this!

There is a Java Library called flexmark which has such a feature.
Maven Dependency:
<dependency>
<groupId>com.vladsch.flexmark</groupId>
<artifactId>flexmark-html2md-converter</artifactId>
<version>0.64.0</version>
</dependency>
Using the class com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter you can convert an HTML String to a Markdown String in one line like this:
String md = FlexmarkHtmlConverter.builder().build().convert(html);

if you are using WMD editor and want to get the markdown code on the server side, just use these options before loading the wmd.js script:
wmd_options = {
// format sent to the server. can also be "HTML"
output: "Markdown",
// line wrapping length for lists, blockquotes, etc.
lineLength: 40,
// toolbar buttons. Undo and redo get appended automatically.
buttons: "bold italic | link blockquote code image | ol ul heading hr",
// option to automatically add WMD to the first textarea found.
autostart: true
};

There is a Haskell library called pandoc that can convert between most markup formats.
Although it is not a Java library, it can be used through its CLI in Java.
You can get and install the latest version from here. Read the getting started guides here.
var command = "pandoc --to=markdown_strict --output=result.md input.html";
var pandoc = new ProcessBuilder()
.command(command.split(" "))
.directory(new File(".")) // Working directory
.start();
pandoc.waitFor();
// The output result.md will be created in the working directory
This tool can also be used in GitHub Actions workflows.

Related

Reading the spss file java

SPSSReader reader = new SPSSReader(args[0], null);
Iterator it = reader.getVariables().iterator();
while (it.hasNext())
{
System.out.println(it.next());
}
I am using this SPSSReader to read the spss file. Here,every string is printed with some junk characters appended with it.
Obtained Result :
StringVariable: nameogr(nulltpc{)(10)
NumericVariable: weightppuo(nullf{nd)
DateVariable: datexsgzj(nulllanck)
DateVariable: timeppzb(null|wt{l)
DateVariable: datetimegulj{(null|ns)
NumericVariable: commissionyrqh(nullohzx)
NumericVariable: priceeub{av(nullvlpl)
Expected Result :
StringVariable: name (10)
NumericVariable: weight
DateVariable: date
DateVariable: time
DateVariable: datetime
NumericVariable: commission
NumericVariable: price
Thanks in advance :)

I tried recreating the issue and found the same thing.
Considering that there is a licensing for that library (see here), I would assume that this might be a way of the developers to ensure that a license is bought as the regular download only contains a demo version as evaluation (see licensing before the download).
As that library is rather old (copyright of the website is 2003-2008, requirement for the library is Java 1.2, no generics, Vectors are used, etc), I would recommend a different library as long as you are not limited to the one used in your question.
After a quick search, it turned out that there is an open source spss reader here which is also available through Maven here.
Using the example on the github page, I put this together:
import com.bedatadriven.spss.SpssDataFileReader;
import com.bedatadriven.spss.SpssVariable;
public class SPSSDemo {
public static void main(String[] args) {
try {
SpssDataFileReader reader = new SpssDataFileReader(args[0]);
for (SpssVariable var : reader.getVariables()) {
System.out.println(var.getVariableName());
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I wasn't able to find stuff that would print NumericVariable or similar things but as those were the classnames of the library you were using in the question, I will assume that those are not SPSS standardized. If they are, you will either find something like that in the library or you can open an issue on the github page.
Using the employees.sav file from here I got this output from the code above using the open source library:
resp_id
gender
first_name
last_name
date_of_birth
education_type
education_years
job_type
experience_years
monthly_income
job_satisfaction
No additional characters no more!
Edit regarding the comment:
That is correct. I read through some SPSS stuff though and from my understanding there are only string and numeric variables which are then formatted in different ways. The version published in maven only gives you access to the typecode of a variable (to be honest, no idea what that is) but the github version (that does not appear to be published on maven as 1.3-SNAPSHOT unfortunately) does after write- and printformat have been introduced.
You can clone or download the library and run mvn clean package (assuming you have maven installed) and use the generated library (found under target\spss-reader-1.3-SNAPSHOT.jar) in your project to have the methods SpssVariable#getPrintFormat and SpssVariable#getWriteFormat available.
Those return an SpssVariableFormat which you can get more information from. As I have no clue what all that is about, the best I can do is to link you to the source here where references to the stuff that was implemented there should help you further (I assume that this link referenced to in the documentation of SpssVariableFormat#getType is probably the most helpful to determine what kind of format you have there.
If absolutely NOTHING works with that, I guess you could use the demo version of the library in the question to determine the stuff through it.next().getClass().getSimpleName() as well but I would resort to that only if there is no other way to determining the format.

I am not sure, but looking at your code, it.next() is returning a Variable object.
There has to be some method to be chained to the Variable object, something like it.next().getLabel() or it.next().getVariableName(). toString() on an Object is not always meaningful. Check toString() method of Variable class in SPSSReader library.

vaadin javascript

I am trying to integrate some js project into my vaadin app. And I tried the methods to invoke external javascript code, only successed on some easy/javascript.
I tried the NativeJs lib
and the code is:
NativeJS nativeJScomponent = new NativeJS();
nativeJScomponent.setJavascript("alert('foo');");
nativeJScomponent.execute();
addComponent(nativeJScomponent);
However, when I used some other code like:
String jsCode = "<div /><script type=\"text/javascript\">var d = new Date();"
+ "var time = d.getHours();"
+ "if (time < 10) {document.write(\"<b>Good morning</b>\");}</script>";
NativeJS nativeJScomponent = new NativeJS();
nativeJScomponent.setJavascript(jsCode);
nativeJScomponent.execute();
addComponent(nativeJScomponent);
It failed to display content.
And I also used the customelayout and label to run javascript. The same results came.
Is there any method to integrate vaadin with js?

Window window = new Window("");
window.executeJavascript("javascript:testFunctionCall();");

The correct way to do this is to implement AbstractJavaScriptComponent. This allows you to include a plain Javascript file in your project and control that library from the server side through RPC calls and shared state.
An easy tutorial can be found here and the Book of Vaadin goes more in-depth here. You can also take a look at the source code of my Vaadin addon which integrates a Javascript media library with Vaadin.

have you looked and tried to run the source in http://uilder.virtuallypreinstalled.com/run/Calling_External_JavaScript/

Is there a decent, customisable, HTML to Markdown Java API?

I want to save text I scrape from various sources without the HTML tags that are on it, but also keeping as much of the structure as I reasonably can.
Markdown seems to be the solution to this (or possibly MultiMarkdown).
There is a question which offers a suggestion on converting from HTML to markdown, but I want to specify some specific things:
ALL links (including images) are referenced at the END only (i.e. no inline urls)
NO embeded HTML (I'm not even 100% sure yet how I'd like to deal with difficult HTML... but it won't be embeded!)
So my question is as stated in the title: Is there a decent, customisable, HTML to Markdown Java API?

You could try adapting HtmlCleaner which provides a workable interface onto the DOM:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
This would allow you to structure your output stream in any format that you want using a fairly simple API.

There is a great library for JS called Turndown, you can try it online here. It can be partially customized. For example, links can be referenced at the end. And as far as I know there is no embedded html, everything is transformed.
I needed it for Java (as the linked question), so I ported it. The library for Java is called CopyDown, it has the same test suite as Turndown.
To install with gradle:
dependencies {
compile 'io.github.furstenheim:copy_down:1.0'
}
Then to use it:
CopyDown converter = new CopyDown();
String myHtml = "<h1>Some title</h1><div>Some html<p>Another paragraph</p></div>";
String markdown = converter.convert(myHtml);
System.out.println(markdown);
> Some title\n==========\n\nSome html\n\nAnother paragraph\n

How to sanitize HTML code in Java to prevent XSS attacks?

I'm looking for class/util etc. to sanitize HTML code i.e. remove dangerous tags, attributes and values to avoid XSS and similar attacks.
I get html code from rich text editor (e.g. TinyMCE) but it can be send malicious way around, ommiting TinyMCE validation ("Data submitted form off-site").
Is there anything as simple to use as InputFilter in PHP? Perfect solution I can imagine works like that (assume sanitizer is encapsulated in HtmlSanitizer class):
String unsanitized = "...<...>..."; // some potentially
// dangerous html here on input
HtmlSanitizer sat = new HtmlSanitizer(); // sanitizer util class created
String sanitized = sat.sanitize(unsanitized); // voila - sanitized is safe...
Update - the simpler solution, the better! Small util class with as little external dependencies on other libraries/frameworks as possible - would be best for me.
How about that?

You can try OWASP Java HTML Sanitizer. It is very simple to use.
PolicyFactory policy = new HtmlPolicyBuilder()
.allowElements("a")
.allowUrlProtocols("https")
.allowAttributes("href").onElements("a")
.requireRelNofollowOnLinks()
.build();
String safeHTML = policy.sanitize(untrustedHTML);

Thanks to #Saljack's answer. Just to elaborate more to OWASP Java HTML Sanitizer. It worked out really well (quick) for me. I just added the following to the pom.xml in my Maven project:
<dependency>
<groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
<artifactId>owasp-java-html-sanitizer</artifactId>
<version>20150501.1</version>
</dependency>
Check here for latest release.
Then I added this function for sanitization:
private String sanitizeHTML(String untrustedHTML){
PolicyFactory policy = new HtmlPolicyBuilder()
.allowAttributes("src").onElements("img")
.allowAttributes("href").onElements("a")
.allowStandardUrlProtocols()
.allowElements(
"a", "img"
).toFactory();
return policy.sanitize(untrustedHTML);
}
More tags can be added by extending the comma delimited parameter in allowElements method.
Just add this line prior passing the bean off to save the data:
bean.setHtml(sanitizeHTML(bean.getHtml()));
That's it!
For more complex logic, this library is very flexible and it can handle more sophisticated sanitizing implementation.

You could use OWASP ESAPI for Java, which is a security library that is built to do such operations.
Not only does it have encoders for HTML, it also has encoders to perform JavaScript, CSS and URL encoding. Sample uses of ESAPI can be found in the XSS prevention cheatsheet published by OWASP.
You could use the OWASP AntiSamy project to define a site policy that states what is allowed in user-submitted content. The site policy can be later used to obtain "clean" HTML that is displayed back. You can find a sample TinyMCE policy file on the AntiSamy downloads page.

HTML escaping inputs works very well. But in some cases business rules might require you NOT to escape the HTML. Using REGEX is not fit for the task and it is too hard to come up with a good solution using it.
The best solution I found was to use: http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It builds a DOM tree with the provided input and filters any element not previosly allowed by a Whitelist. The API also has other functions for cleaning up html.
And it can also be used with javax.validation #SafeHtml(whitelistType=, additionalTags=)

Regarding Antisamy, you may want to check this regarding the dependencies:
http://code.google.com/p/owaspantisamy/issues/detail?id=95&can=1&q=redyetidave

Unable to incorporate Eclispe JDT codeAssist facilities outside a Plug-in

Using Eclipse jdt facilities, you can traverse the AST of java code snippets as follows:
ASTParser ASTparser = ASTParser.newParser(AST.JLS3);
ASTparser.setSource("package x;class X{}".toCharArray());
ASTparser.createAST(null).accept(...);
But when trying to perform code complete & code selection it seems that I have to do it in a plug-in application since I have to write codes like
IFile file = ResourcesPlugin.getWorkspace().getRoot().getFile(new Path(somePath));
ICodeAssist i = JavaCore.createCompilationUnitFrom(f);
i.codeComplete/codeSelect(...)
Is there anyway that I can finally get a stand-alone java application which incorporates the jdt code complete/select facilities?
thx a lot!
shi kui
I have noticed it that using org.eclipse.jdt.internal.codeassist.complete.CompletionParser
I can parse a code snippet as well.
CompletionParser parser =new CompletionParser(new ProblemReporter(
DefaultErrorHandlingPolicies.proceedWithAllProblems(),
new CompilerOptions(null),
new DefaultProblemFactory(Locale.getDefault())),
false);
org.eclipse.jdt.internal.compiler.batch.CompilationUnit sourceUnit =
new org.eclipse.jdt.internal.compiler.batch.CompilationUnit(
"class T{f(){new T().=1;} \nint j;}".toCharArray(), "testName", null);
CompilationResult compilationResult = new CompilationResult(sourceUnit, 0, 0, 0);
CompilationUnitDeclaration unit = parser.dietParse(sourceUnit, compilationResult, 25);
But I have 2 questions:
1. How to retrive the assist information?
2. How can I specify class path or source path for the compiler to look up type/method/field information?

I don't think so, unless you provide your own implementation of ICodeAssist.
As the Performing code assist on Java code mentions, Elements that allow this manipulation should implement ICodeAssist.
There are two kinds of manipulation:
Code completion - compute the completion of a Java token.
Code selection - answer the Java element indicated by the selected text of a given offset and length.
In the Java model there are two elements that implement this interface: IClassFile and ICompilationUnit.
Code completion and code selection only answer results for a class file if it has attached source.
You could try opening a File outside of any workspace (like this FAQ), but the result wouldn't implement ICodeAssist.
So the IFile most of the time comes from a workspace location.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML to Markdown with Java - java

Related

Reading the spss file java

vaadin javascript

Is there a decent, customisable, HTML to Markdown Java API?

How to sanitize HTML code in Java to prevent XSS attacks?

Unable to incorporate Eclispe JDT codeAssist facilities outside a Plug-in

Categories

Resources