In Grails how to upload java object arrays in aws cloudsearch? - java

In Grails(which provide complete set of plugins) also provide searchable plugin which is sufficient to do indexing and searching job. Now since we are moving to AWS cloudsearch we have achieved how to search(based on lucene) any object. Now I was getting error while uploading the documents(which is actually array of objects)?
What is the best way to upload any array of object(java) which maps to the domain of the cloudsearch? This is to achieve the bulkupload from grails.

CloudSearch does not support objects. It supports the following types, as well as arrays of each type (except latlon).
date
double
int
latlon
literal
text
Reference
My team is switching to AWS Elasticsearch service for this and other reasons. CloudSearch hasn't seen an update since January, 2013 and there's no indication that they plan to make any updates or improvements to it.

Related

Sandboxed java scripting replacement for Nashorn

I've been using Nashorn for awk-like bulk data processing. The idea is, that there's a lot of incoming data, coming row by row, one by another. And each row consists of named fields. These data are processed by user-defined scripts stored somewhere externally and editable by users. Scripts are simple, like if( c>10) a=b+3, where a, b and c are fields in the incoming data rows. The amount of data is really huge. Code is like that (an example to show the use case):
ScriptEngine engine = new NashornScriptEngineFactory().getScriptEngine(
new String[]{"-strict", "--no-java", "--no-syntax-extensions", "--optimistic-types=true"},
null,
scr -> false);
CompiledScript cs;
Invocable inv=(Invocable) engine;
Bindings bd=engine.getBindings(ScriptContext.ENGINE_SCOPE);
bd.remove("load");
bd.remove("loadWithNewGlobal");
bd.remove("exit");
bd.remove("eval");
bd.remove("quit");
String scriptText=readScriptText();
cs = ((Compilable) engine).compile("function foo() {\n"+scriptText+"\n}");
cs.eval();
Map params=readIncomingData();
while(params!=null)
{
Map<String, Object> res = (Map) inv.invokeFunction("foo", params);
writeProcessedData(res);
params=readIncomingData();
}
Now nashorn is obsolete and I'm looking for alternatives. Was googling for a few days but didn't found exact match for my needs. The requirements are:
Speed. There's a lot of data so it shall be really fast. So I assume as well, precompilation is the must
Shall work under linux/openJDK
Support sandboxing at least for data access/code execution
Nice to have:
Simple, c-like syntax (not lua;)
Support sandboxing for CPU usage
So far I found that Rhino is still alive (last release dated 13 Jan 2020) but I'm not sure is it still supported and how fast it is - as I remember, one of reasons Java switched to Nashorn was speed. And speed is very important in my case. Also found J2V8 but linux is not supported. GraalVM looks like a bit overkill, also didn't get how to use it for such a task yet - maybe need to explore further if it is suitable for that, but looks like it is complete jvm replacement and cannot be used as a library.
It's not necessary shall be javascript, maybe there are other alternatives.
Thank you.
GraalVM's JavaScript can be used as a library with the dependencies obtained as any Maven artifact. While the recommended way to run it is to use the GraalVM distribution, there are some explanations how to run it on OpenJDK.
You can restrict things script should have access to, like Java classes, creating threads, etc:
From the documentation:
The following access parameters may be configured:
* Allow access to other languages using allowPolyglotAccess.
* Allow and customize access to host objects using allowHostAccess.
* Allow and customize host lookup to host types using allowHostLookup.
* Allow host class loading using allowHostClassLoading.
* Allow the creation of threads using allowCreateThread.
* Allow access to native APIs using allowNativeAccess.
* Allow access to IO using allowIO and proxy file accesses using fileSystem.
And it is several times faster than Nashorn. Some measurements can be found for example in this article:
GraalVM CE provides performance comparable or superior to Nashorn with
the composite score being 4 times higher. GraalVM EE is even faster.

Parsing non-fixed format binary payload with a custom javascript conversion in Vorto

We are using Vorto now mainly as a normalized format and are starting to look into using the mapping engine for mapping different payload formats to Vorto model as well. I more or less understand how to map functionblock properties from JSON or binary payload using xpath and the conversion functions. However, I'm not clear how to support parsing of non-fixed format binary payload using this method.
For instance we have an off the shelf LoRaWAN sensor which transmits in the following format:
<length><frame type>[<sensor-id><sensor-value>] where length is the total frame length and sensor-id (for eg temperature, humidity, battery, ...) describes how to parse the sensor-value (ie length, datatype). In one frame multiple of these readings may be present in random order.
Parsing this can be done easily in for instance loraserver.io using a small javascript function which iterates over all the bytes en returns the parsed properties. The same way will work in the Ditto payload mapping engine afaik.
However, currently I don't see how to do something similar in Vorto mapping. This is just one specific sensor example of course, but more examples exist on the market using similar dynamic payload format. I know there is already an open issue (#1535) to improve the documentation, but it would already be helpful to know if such flexible parsing would be possible using the mapping DSL.
I tried passing the raw payload as bytearray to the javascript function. In order to test this I duplicated the org.eclipse.vorto.mapping.engine.converter.binary.BinaryMappingTest#testMappingBinaryContaining2DataPoints and adapted the model to use a custom javascript function like this
evaluator.addScriptFunction(new ScriptClassFunction("extractTemperature",
"function extractTemperature(value) { " +
" print(\"parameter of type \" + typeof value + \", value = \" + value);" +
" print(value[1]);" +
"}"));
The output of this function is
parameter of type number, value = 1
undefined
Where the value 1 is the first element of the bytearray used.
So the function does not seem to receive the parameter as bytarray.
The model is configured with .withXPathStereotype("custom:extractTemperature(data)", "demo") so the payload is passed (as BinaryData) in the same way as in the testMappingBinaryContaining2DataPoints test (.withXPathStereotype("custom:convert(vorto_conversion1:byteArrayToInt(data,0,0,0,2))", "demo")). The only difference I see now is that in the testMappingBinaryContaining2DataPoints test is that the byetarray parameter is passed to a Java function instead of a javascript function. Or am I missing something?
Also, I noticed that loop keywords like for and while are not allowed in the javascript code. So even if I can access the bytearray parameter in the javascript function I see no way for now how to iterate over this.
On gitter I received following reply (together with the suggestion to move discussion to SO)
You are right. We restricted the Javascript function usage to very rudimentary set of language keywords excluding for loops as nasty stuff can be implemented there. What you could do Instead is to register a java function In your own namespace to the mapping engine. That function can hold a byte array. Later this function can be contributed to the mapping engine as a standard function to extract a certain value out for other developers to reuse.
I don't think this is solution to the problem however. As mentioned above this is just one example of an off the shelf sensor payload format, and I don't see how this can be generalized enough to include as a generic function in the mapping engine. And I don't think it should be required to implement a sensor specific conversion in Java, since (as an end-user of an IoT platform wanting to deploy a new sensor type) this is more complex to develop and deploy than a little javascript function which can be altered at runtime in the mapping spec. I see a lot of value in being able to do simple mappings in javascript, just like this can be done in for example loraserver.io and Eclipse Ditto.
I think being able to pass a byte array to javascript is a first step. Also I wonder where exactly the risk is in allowing loops in the javascript? For example Ditto also has some restrictions in the javascript sandbox (see here) but this allows loops and only prevents endless looping and recursion.
They state the following:
Using Rhino instead of Nashorn, the newer JavaScript engine shipped with Java, has the benefit that sandboxing can be applied in a better way.
Sandboxing of different payload scripts is required as Ditto is intended to be run as cloud service where multiple connections to different endpoints are managed for different tenants at the same time. This requires the isolation of each single script to avoid interference with other scripts and to protect the JVM executing the script against harmful code execution.
Would using Rhino in Vorto as well allow to control the risks you see and allow loop construct in Vorto mapping?
PS: can someone with enough SO reputation points add the tag eclipse-vorto please?
I created an issue for you request to support this in the Javascript converters: https://github.com/eclipse/vorto/issues/2029
As stated in the issue, as a current workaround, you can register your own custom converter function with Java and re-use this function across your mappings. In these java converter functions, you have all the power of the java language to convert to extract the right property from the arbitrary list.
In order to find out how to implement your own custom converter function with Java, take a look here: https://github.com/eclipse/vorto/tree/master/mapping-engine#Advanced-Usage
Since Eclipse Vorto 0.12.3 release, a fix for your request is available. With this it is possible to pass array object to javascript Converter as well as use for loops inside javascript functions. You might wanna give it a try.
See release notes https://github.com/eclipse/vorto/blob/master/docs/release-notes.md

What is the best way to store data in a Page Object Model Framework

Im building an automation framework in selenium using the Page Object Design Pattern.
Following are some of the data that Im using and where i have stored them
PageObjects (xpath, id etc) - In the Page Classes itself
Configuration Data (wait-times, browser type , the URL etc) - In a properties file.
Other data - In a class as static variables.
Once the framework starts growing it would be hard to store all the data it would be hard to organize the data. I did a some research on how others have implemented the way they store data in their framework. Here is what I found out,
Storing data (mostly page objects) in classes itself
Storing data in JSON
And some even suggested storing data in a database so that it would reduce reading times
Since there are lot of options out there, I thought of getting some feedback on what is the best way to store data and how everyone else has stored there data.
JSON or Any temp data storage is the best option as it is a framework and the purpose of it is to reuse for different projects.
I don't see any problem with the way you have stored your data.
Locators (by POM definition) should be stored in the page objects themselves.
Config data can be stored in some sort of config file... whatever you find convenient. You can use plain text, JSON, XML, etc. We use XML but that really comes down to personal preference.
I think this is fine also.
The framework doesn't really grow, the automation suite does. As long as you keep the data stored in the 3 places above consistently, I think you should be fine. The only issue I've run into with this approach is that sometimes certain pages have a LOT of functionality on them so the page objects grow quite large. In those cases, we found a way to divide the page into smaller chunks, e.g. one page had 22 tabs, each consisting of a different panel. In that case, we broke the page object into 22 different class files to keep the size more manageable and then hooked them all back into the main page as properties, e.g. mainPage.Panel1.someMethodOnPanel1();
I advice using Interfaces for each device type to store multiple type selectors, example:
import static org.openqa.selenium.By.cssSelector;
import static org.openqa.selenium.By.linkText;
import static org.openqa.selenium.By.xpath;
public interface DesktopMainPageSelector {
By FIRST_ELEMENT = cssSelector("selector_here");
By SECOND_ELEMENT = xpath("selector_here");
By THIRD_ELEMENT = id("selector_here");
}
than, just implement these selectors from whatever you need them.
You can also use enums with for a more complex structure.
I found this as best solution, because its easy to manage large numbers of selectors

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!
If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier
The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)
As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Retrieving dates from Google Custom Search

I am currently developing a Java application based on Google Custom Search API, using their Java libraries.
According to Google's documentation, they associate a date to each indexed Web page:
Page Dates: Google estimates the date for a page based the URL, title, byline date and other features. This date can be used with the sort operator using the special structured data type date, as in &sort=date.
I want to retrieve the date associated to all the results returned for a given request. However, I didn't find anything related to this task in Google's documentation: there are parameters one can use to sort the results by date, or focus on a certain period of time, but nothing regarding retrieving the precise dates themselves. And I couldn't find any reference to this problem on the Web neither.
So, I am turning to SO to ask these questions:
Is it even possible to do that through Google's API? How?
Otherwise, is there a workaround?

Categories

Resources