Manipulating HTML nodes with java javascript scripting API

Manipulating HTML nodes with java javascript scripting API - java

I'm using the Java Scripting API which is working quite well. Now I have a function where I want to get all <a> tags from a String and then add/remove attributes before returning the manipulated String. The problem of course is, that I can't just use document.getElementsByTagName. Is there any easy option that comes to your mind without going through regex-hell?
Please note that I'm currently running on Java 7 (with Rhino), planning to update to Java 8 (with Nashorn), so I don't want to use any Rhino specific APIs.

In the book "Learning JavaScript Design Patterns" by Addi Osmani, author mentions 3 alternatives to a similar problem, obviously being getElementById() the fastest.
Excerpt from book:
Imagine that we have a script where for each DOM element found on page
with class "foo," we wish to increment a counter. What's the most
efficient way to query for this collection of elements? Well, there
are a few different ways this problem could be tackled:
Select all of the elements in the page and then store references to them. Next, filter this collection and use regular expressions (or
another means) to store only those with the class "foo."
Use a modern native browser feature such as querySelectorForAll() to select all of the elements with the class "foo."
Use a netive feature such as getElementsByClassName() to similarly...
Another way is, since you're using Nashorn/Rhino, you could use the Java implementation of the Xerces library to manipulate the DOM.
Hope this helps you find out the solution.

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.

Parsing non-fixed format binary payload with a custom javascript conversion in Vorto

We are using Vorto now mainly as a normalized format and are starting to look into using the mapping engine for mapping different payload formats to Vorto model as well. I more or less understand how to map functionblock properties from JSON or binary payload using xpath and the conversion functions. However, I'm not clear how to support parsing of non-fixed format binary payload using this method.
For instance we have an off the shelf LoRaWAN sensor which transmits in the following format:
<length><frame type>[<sensor-id><sensor-value>] where length is the total frame length and sensor-id (for eg temperature, humidity, battery, ...) describes how to parse the sensor-value (ie length, datatype). In one frame multiple of these readings may be present in random order.
Parsing this can be done easily in for instance loraserver.io using a small javascript function which iterates over all the bytes en returns the parsed properties. The same way will work in the Ditto payload mapping engine afaik.
However, currently I don't see how to do something similar in Vorto mapping. This is just one specific sensor example of course, but more examples exist on the market using similar dynamic payload format. I know there is already an open issue (#1535) to improve the documentation, but it would already be helpful to know if such flexible parsing would be possible using the mapping DSL.
I tried passing the raw payload as bytearray to the javascript function. In order to test this I duplicated the org.eclipse.vorto.mapping.engine.converter.binary.BinaryMappingTest#testMappingBinaryContaining2DataPoints and adapted the model to use a custom javascript function like this
evaluator.addScriptFunction(new ScriptClassFunction("extractTemperature",
"function extractTemperature(value) { " +
" print(\"parameter of type \" + typeof value + \", value = \" + value);" +
" print(value[1]);" +
"}"));
The output of this function is
parameter of type number, value = 1
undefined
Where the value 1 is the first element of the bytearray used.
So the function does not seem to receive the parameter as bytarray.
The model is configured with .withXPathStereotype("custom:extractTemperature(data)", "demo") so the payload is passed (as BinaryData) in the same way as in the testMappingBinaryContaining2DataPoints test (.withXPathStereotype("custom:convert(vorto_conversion1:byteArrayToInt(data,0,0,0,2))", "demo")). The only difference I see now is that in the testMappingBinaryContaining2DataPoints test is that the byetarray parameter is passed to a Java function instead of a javascript function. Or am I missing something?
Also, I noticed that loop keywords like for and while are not allowed in the javascript code. So even if I can access the bytearray parameter in the javascript function I see no way for now how to iterate over this.
On gitter I received following reply (together with the suggestion to move discussion to SO)
You are right. We restricted the Javascript function usage to very rudimentary set of language keywords excluding for loops as nasty stuff can be implemented there. What you could do Instead is to register a java function In your own namespace to the mapping engine. That function can hold a byte array. Later this function can be contributed to the mapping engine as a standard function to extract a certain value out for other developers to reuse.
I don't think this is solution to the problem however. As mentioned above this is just one example of an off the shelf sensor payload format, and I don't see how this can be generalized enough to include as a generic function in the mapping engine. And I don't think it should be required to implement a sensor specific conversion in Java, since (as an end-user of an IoT platform wanting to deploy a new sensor type) this is more complex to develop and deploy than a little javascript function which can be altered at runtime in the mapping spec. I see a lot of value in being able to do simple mappings in javascript, just like this can be done in for example loraserver.io and Eclipse Ditto.
I think being able to pass a byte array to javascript is a first step. Also I wonder where exactly the risk is in allowing loops in the javascript? For example Ditto also has some restrictions in the javascript sandbox (see here) but this allows loops and only prevents endless looping and recursion.
They state the following:
Using Rhino instead of Nashorn, the newer JavaScript engine shipped with Java, has the benefit that sandboxing can be applied in a better way.
Sandboxing of different payload scripts is required as Ditto is intended to be run as cloud service where multiple connections to different endpoints are managed for different tenants at the same time. This requires the isolation of each single script to avoid interference with other scripts and to protect the JVM executing the script against harmful code execution.
Would using Rhino in Vorto as well allow to control the risks you see and allow loop construct in Vorto mapping?
PS: can someone with enough SO reputation points add the tag eclipse-vorto please?

I created an issue for you request to support this in the Javascript converters: https://github.com/eclipse/vorto/issues/2029
As stated in the issue, as a current workaround, you can register your own custom converter function with Java and re-use this function across your mappings. In these java converter functions, you have all the power of the java language to convert to extract the right property from the arbitrary list.
In order to find out how to implement your own custom converter function with Java, take a look here: https://github.com/eclipse/vorto/tree/master/mapping-engine#Advanced-Usage

Since Eclipse Vorto 0.12.3 release, a fix for your request is available. With this it is possible to pass array object to javascript Converter as well as use for loops inside javascript functions. You might wanna give it a try.
See release notes https://github.com/eclipse/vorto/blob/master/docs/release-notes.md

Use XPath to create DOM-Nodes

I'm using javax.xml.xpath to read nodes out of a XML-DOM and to create Java-Objects from the read data.
After changing the data of these objects and perhaps creating new objects, I would like to write them back to the XML-DOM.
So I was wondering if it is possible to use xpath to also create nodes at specific positions in the XML-DOM. I am not sure if xpath is designed to write to DOM because its a "Query"-Language. But on the other hand SQL is also a query-language and is able to write data to databases.
So my general question is: Is it possible to create DOM-Nodes with XPaths?

No, you will need to use low-level DOM methods to create new nodes.
Are you sure you are using the right approach? Could the whole application be written more easily in XSLT? Even if you want to use a Java tree-based API, why DOM, which is so slow and unwieldy compared with subsequent tree models such as JDOM and XOM?

No that is not possible as far as I know. But the elements returned are 'live' which means any change made on them is directly reflected in the dom.

Java: Best way to remove Javascript from HTML

What's the best library/approach for removing Javascript from HTML that will be displayed?
For example, take:
<html><body><span onmousemove='doBadXss()'>test</span></body></html>
and leave:
<html><body><span>test</span></body></html>
I see the DeXSS project. But is that the best way to go?

JSoup has a simple method for sanitizing HTML based on a whitelist.
Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:
There are still a number of known XSS attacks that DeXSS does not yet detect.
A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.
Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.
Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.
But obviously doing all this involves some overhead if you're doing this at page render time.

java: using a variable's value as an object name (not the eval() way)

Ok, so coming from a background of mostly perl, and mostly writing dirty little apps to automate my tasks, I've read the pages about the evils of eval(), and I always use a hash (in perl). I'm currently working on a little project (mostly for me and a couple of other technical people at work), for creating "canned response" e-mails. To allow for additions, subtractions, edits, etc., I'd like to essentially describe the response form(s) in XML, and have my app parse the XML and create the response forms at runtime. I want to use Java (to integrate it into an existing Java tool that I created), and boiled down to a trivial example, what I'm trying to do is take some XML like:
<Form Name="first" Title="Title!">
<Label Name="before">Your Request:</Label>
<Textbox Name="input"/>
<Label Name="after">has been completed.</Label>
<Output>%before%%input%%after%</Output>
</Form>
<Form Name="second">
...
and from parsing that, I want to create a JFrame named first, which contains a JLabel named before with the obvious text, then a textbox, then another JLabel... you get the idea (I eventually want to use the output tag to control exactly how the response is formatted).
I can parse the XML, and get the element name and such, but I don't know how to instantiate the Objects with a name that is the value of a variable, effectively:
JFrame $(thisNode.getAttributes().getNamedItem("Name").getNodeValue()) = new JFrame(thisNode.getAttributes().getNamedItem("Title").getNodeValue());
I've read basically the whole first page of google results on java reflection, but I haven't come across anyone doing quite what I'm looking for (at least not that I could tell). Having basically zero experience with reflection, I'm curious if this is something that can be accomplished using it, or if I should take the same approach as I would in Perl, and create a HashMap or HashTable of Objects, and tie them to a entry in a Hash of JFrames. Or, I'm open to ideas that don't fall into those two categories. The Hash is sort of my stand-by answer, because I've done it in Perl plenty of times, and I'm sure I can make it work in Java, but if there's a feature (like reflection) that's made to do this task, then why not do it the way it was intended to be done?

What you're asking isn't possible in Java. It doesn't work that way and these sorts of tricks, which are common in dynamic languages, aren't the Java way. You can certainly do:
JFrame frame = JFrameBuilder.buildFromTemplate("frame.xml");
where you create a JFrameBuilder class that reads the XML and creates an object from it but the variable name can't be dynamic. You have to remember that there are two steps in Java.
Java source files are compiled into bytecode;
The bytecode is read by a Java interpreter (JVM) and executed.
What you want is essentially asking to execute code in step (1). Now annotations can do things in a compile step (like adding interfaces, implementing methods and so on) but local variable naming is not one of those things.

You could (not necessarily that you should) generate Java source based on your XML, compile the generated code, and finally, execute the compiled code. This could be more efficient if you saved the generated .class files and reused them instead of parsing the XML every time the program is run (it can check the timestamp on the XML and only generate and compile if it's been modified since the last code generation).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Manipulating HTML nodes with java javascript scripting API - java

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

Parsing non-fixed format binary payload with a custom javascript conversion in Vorto

Use XPath to create DOM-Nodes

Java: Best way to remove Javascript from HTML

java: using a variable's value as an object name (not the eval() way)

Categories

Resources