I try Corb to search and update node in large number of documents:
Sample input:
<hcmt xmlns="http://horn.thoery">
<susceptible>X</susceptible>
<reponsible>foresee–intervention</reponsible>
<intend>Benefit Protagonist</intend>
<justified>Goal Outwiegen</justified>
</hcmt>
Xquery:
(: let $resp := "foresee–intervention" :)
let $docs :=
cts:search(doc(),
cts:and-query((
cts:collection-query("hcmt"),
cts:path-range-query("/horn:hcmt/horn:responsible", "=", $resp)
))
)
return
for $doc in $docs
return
xdmp:node-replace($doc/horn:hcmt/horn:responsible, "Foresee Intervention")
Expected output:
<hcmt xmlns="http://horn.thoery">
<susceptible>X</susceptible>
<reponsible>Foresee Intervention</reponsible>
<intend>Benefit Protagonist</intend>
<justified>Goal Outwiegen</justified>
</hcmt>
But node-replace didn’t happen in Corb and no error returns. Other queries work fine in Corb. How can the node-replace work correctly in Corb?
Thanks in advance for any help.
I create functions to reconcile the encoding matters. This not only mitigates potential API transaction failures but also is a requisite to validate & encode parameter or element/property/uri name.
That said, a sample MarkLogic Java API implementation is:
Create a dynamic query construct in the filesystem, in my case, product-query-option.xml (use the query value directly: Chooser–Option)
<search xmlns="http://marklogic.com/appservices/search">
<query>
<and-query>
<collection-constraint-query>
<constraint-name>Collection</constraint-name>
<uri>proto</uri>
</collection-constraint-query>
<range-constraint-query>
<constraint-name>ProductType</constraint-name>
<value>Chooser–Option</value>
</range-constraint-query>
</and-query>
</query>
</search>
Deploy the persistent query options to modules database, in my case, search-lexis.xml, the options file is like:
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="Collection">
<collection prefix=""/>
</constraint>
<constraint name="ProductType">
<range type="xs:string" collation="http://marklogic.com/collation/en/S1">
<path-index xmlns:prod="schema://fc.fasset/product">/prod:requestProduct/prod:_metaData/prod:productType</path-index>
</range>
</constraint>
</options>
Follow on from Dynamic Java Search
File file = new File("src/main/resources/queryoption/product-query-option.xml");
FileHandle fileHandle = new FileHandle(file);
RawCombinedQueryDefinition rcqDef = queryMgr.newRawCombinedQueryDefinition(fileHandle, queryOption);
You can, assuredly, combine the query and the options as one handle in QueryDefinition.
Your original node-replace is translated as Java Partial Update
make sure the DocumentPatchBuilder setNamespaces with the correct NamespaceContext.
For batch data operation, the performant approach is MarkLogic Data Movement: instantiate the QueryBatcher with the searched Uris, supply the replace value or data fragment PatchBuilder.replaceValue
, and complete the batch with
dbClient.newXMLDocumentManager().patch(uri, patchHandle);
MarkLogic Data Services: If you succeed above, perhaps, then go at a more robust and scalable enterprise SOA approach, please review Data Services.
The implementation with Gradle is like:
(Note, all of the transformation metrics should be parameters, including path/element/property name, namespace, value…etc. Nothing is hardcoded.) One proxy service declared in service.json can serve multiple end points (under /root/df-ds/fxd ) with different types of modules which give you the free rein to develop pure Java or extend the development platform to handle complex data operations.
If these operations are persistent node update, you should consider in-memory node transform before the ingestion. Besides the MarkLogic data transformation tools, you can harness the power of XSLT2+.
Saxon XPathFactory could be a serviceable vehicle to query/transform node. Not sure if it is a reciprocity, ML Java API implements the XPath compile to split large paths and stream transaction. XSLT/Saxon is not my forte; therefore, I can’t comment how comparable it is with this encode/decode particularity or how it handles transaction (insert, update…etc) streaming.
Related
What would be the best way to serialize informations about SAP RFC Function Module parameters (i.e. parameter names and parameter values) with which the function was called in SAP (with parameter 'DESTINATION' provided as SAP JCO SERVER) and then captured in JCo?
The point is that the serialization should be done in Jco (using Java) and then this data would be send back to SAP and saved in a Z-Table in SAP, so that later using this entries in SAP Table it will be possible to "reserialize" this data in ABAP and call again the given funcion with exactly the same parameters and values.
To make it easier to understand, I will give an example:
Step 1. We call RFC FM in ABAP:
CALL FUNCTION 'remotefunction'
DESTINATION jco_server
EXPORTING e1 = exp1
* IMPORTING i1 =
TABLES t1 = tab1.
Step 2. We catch this function call in Jco and need to serialize the information about the parameters with which the function was called and save it in a SAP table, for example:
String ImportParList = function.getImportParameterList().toXML().toString(); //serialization of import parameters
String ExportParList = function.getExportParameterList().toXML().toString(); //serialization of export parameters
String TableParList = function.getTableParameterList().toXML().toString(); //serialization of table parameters
String ParList = ImportParList + ExportParList + TableParList;
//Call function to save content of variable "ParList" in a SAP table
Step 3. Using ABAP we need to select the data from SAP table and "reserialize" (e.g. using CALL TRANSFORMATION FM in ABAP) to be able to recall again the FM "remotefunction" with the same parameters and values as earlier.
To sum it up:
A. Is there maybe any standard Java method in Jco for such serialization (better than manually converting this to xml/JSON and saving as String)?
B. How to deal with deep ABAP structures, e.g. tables in which subsequent tables are nested? Also convert it to XML/JSON just like the rest?
C. Do you have any other ideas how to perform this process better than what I presented?
Thanks in advance!
What I describe here may not be appropriate to you due to the complexity of installation and setup, but just for sake of use for other community members and for the acknowledgement that such functionality exists.
SAP has a special framework for enabling the case you want, called LOWGWIN (LOGCOM 200). Instructions for installation are in SAP Note 1870371.
Details of the feature set:
The logging of RFCs allows you to establish which users had access to
which data at what point in time. You can log data on RFC Function
Module (FM) level, for example:
Type of parameters
Name and corresponding values of parameters
In order to minimize the amount of logged data, you can do the following:
Restrict logging to certain users
Filter the parameters that need to be logged before they are included in the log records
Enable logging on client level only for the RFC Function Modules that
you want to log
You can fine-tune wich RFC calls (modules) will be logged including successful or failed ones by BAdI /LOGWIN/BADI_RFC_LOG_FILTER.
Initially the log is stored temporarily in SAP and can be viewed via transaction /LOGWIN/SHOW_LOG, after that you can transfer necessary log records to external repository (which you should set up in advance) by transaction /LOGWIN/TSF_TO_EXT.
Architecture overview:
So you can setup external repository in Java storage or leave it as is and read parameter values from SAP, after that you can re-run failed modules from SAP or Java side.
Also, there is a bunch of other settings about data archiving, user permissions, exclusion, mappings etc. which are too big to describe in the answer.
More documentation is here:
Configuration guide
Application guide (PDF)
installation is done from https://support.sap.com/swdc > Software Downloads > Installations and Upgrades > A–Z Index > L > LOGGING OF RFC AND WEB SVCS 2.0
also check notes 1870371 and 1878916
I am trying to get the forest Data directory in MarkLogic. I used the following method to get data directory...using the Server Evaluation Call Interface running queries as admin. If not, please let me know how I can get forest data directory
ServerEvaluationCall forestDataDirCall = client.newServerEval()
.xquery("admin:forest-get-data-directory(admin:get-configuration(), admin:forest-get-id(admin:get-configuration(), \"" + forestName +"\"))");
for (EvalResult forestDataDirResult : forestDataDirCall.eval()) {
String forestDataDir = null;
forestDataDir = forestDataDirResult.getString();
System.out.println("forestDataDir is " + forestDataDir);
}
I see no reason for needing to hit the server evaluation endpoint to ask this question to the server. MarkLogic comes with a robust REST based management API including getters for almost all items of interest.
Knowing that, you can use what is documented here:
http://yourserver:8002/manage/v2/forests
Results can be in JSON, XML or HTML
It is the getter for forest configurations. Which forests you care about can be found by iterating over all forests or by reaching through the database config and then to the forests. It all depends on what you already know from the outside.
References:
Management API
Scripting Administrative Tasks
First of all, I want to clarify that my experience working with wikidata is very limited, so feel free to correct if any of my terminology is wrong.
I've been playing with wikidata toolkit, more specifically their wdtk-wikibaseapi. This allows you to get entity information and their different properties as such:
WikibaseDataFetcher wbdf = WikibaseDataFetcher.getWikidataDataFetcher();
EntityDocument q42 = wbdf.getEntityDocument("Q42");
List<StatementGroup> groups = ((ItemDocument) q42).getStatementGroups();
for(StatementGroup g : groups) {
List<Statement> statements = g.getStatements();
for(Statement s : statements) {
System.out.println(s.getMainSnak().getPropertyId().getId());
System.out.println(s.getValue());
}
}
The above would get me the entity Douglas Adams and all the properties under his site: https://www.wikidata.org/wiki/Q42
Now wikidata toolkit has the ability to load and process dump files, meaning you can download a dump to your local and process it using their DumpProcessingController class under the wdtk-dumpfiles library. I'm just not sure what is meant by processing.
Can anyone explain me what does processing mean in this context?
Can you do something similar to what was done using wdtk-wikibaseapi in the example above but using a local dump file and wdtk-dumpfiles i.e. get an entity and it's respective properties? I don't want to get the info from online source, only from the dump (offline).
If this is not possible using wikidata-toolkit, could you point me to somewhere that can get me started on getting entities and their properties from a dump file for wikidata please? I am using Java.
This is most similar to this question.
I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and have all the relevant values attached to it (based on a key) before being written to a database.
The trouble is that the mapping dataset from BigQuery is very large - any attempt to use it as a side input fails with the Dataflow runners throwing the error "java.lang.IllegalArgumentException: ByteString would be too long". I have attempted the following strategies:
1) Side input
As stated,the mapping data is (apparently) too large to do this. If I'm wrong here or there is a work-around for this, please let me know because this would be the simplest solution.
2) Key-Value pair mapping
In this strategy, I read the BigQuery data and Pubsub message data in the first part of the pipeline, then run each through ParDo transformations that change every value in the PCollections to KeyValue pairs. Then, I run a Merge.Flatten transform and a GroupByKey transform to attach the relevant mapping data to each message.
The trouble here is that streaming data requires windowing to be merged with other data, so I have to apply windowing to the large, bounded BigQuery data as well. It also requires that the windowing strategies are the same on both datasets. But no windowing strategy for the bounded data makes sense, and the few windowing attempts I've made simply send all the BQ data in a single window and then never send it again. It needs to be joined with every incoming pubsub message.
3) Calling BQ directly in a ParDo (DoFn)
This seemed like a good idea - have each worker declare a static instance of the map data. If it's not there, then call BigQuery directly to get it. Unfortunately this throws internal errors from BigQuery every time (as in the entire message just says "Internal error"). Filing a support ticket with Google resulted in them telling me that, essentially, "you can't do that".
It seems this task doesn't really fit the "embarrassingly parallelizable" model, so am I barking up the wrong tree here?
EDIT :
Even when using a high memory machine in dataflow and attempting to make the side input into a map view, I get the error java.lang.IllegalArgumentException: ByteString would be too long
Here is an example (psuedo) of the code I'm using:
Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<String, TableRow>> mapData = pipeline
.apply("ReadMapData", BigQueryIO.read().fromQuery("SELECT whatever FROM ...").usingStandardSql())
.apply("BQToKeyValPairs", ParDo.of(new BQToKeyValueDoFn()))
.apply(View.asMap());
PCollection<PubsubMessage> messages = pipeline.apply(PubsubIO.readMessages()
.fromSubscription(String.format("projects/%1$s/subscriptions/%2$s", projectId, pubsubSubscription)));
messages.apply(ParDo.of(new DoFn<PubsubMessage, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
JSONObject data = new JSONObject(new String(c.element().getPayload()));
String key = getKeyFromData(data);
TableRow sideInputData = c.sideInput(mapData).get(key);
if (sideInputData != null) {
LOG.info("holyWowItWOrked");
c.output(new TableRow());
} else {
LOG.info("noSideInputDataHere");
}
}
}).withSideInputs(mapData));
The pipeline throws the exception and fails before logging anything from within the ParDo.
Stack trace:
java.lang.IllegalArgumentException: ByteString would be too long: 644959474+1551393497
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.concat(ByteString.java:524)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:576)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.balancedConcat(ByteString.java:575)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString.copyFrom(ByteString.java:559)
com.google.cloud.dataflow.worker.repackaged.com.google.protobuf.ByteString$Output.toByteString(ByteString.java:1006)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:951)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1000)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Check out the section called "Pattern: Streaming mode large lookup tables" in Guide to common Cloud Dataflow use-case patterns, Part 2. It might be the only viable solution since your side input doesn't fit into memory.
Description:
A large (in GBs) lookup table must be accurate, and changes often or
does not fit in memory.
Example:
You have point of sale information from a retailer and need to
associate the name of the product item with the data record which
contains the productID. There are hundreds of thousands of items
stored in an external database that can change constantly. Also, all
elements must be processed using the correct value.
Solution:
Use the "Calling external services for data enrichment" pattern
but rather than calling a micro service, call a read-optimized NoSQL
database (such as Cloud Datastore or Cloud Bigtable) directly.
For each value to be looked up, create a Key Value pair using the KV
utility class. Do a GroupByKey to create batches of the same key type
to make the call against the database. In the DoFn, make a call out to
the database for that key and then apply the value to all values by
walking through the iterable. Follow best practices with client
instantiation as described in "Calling external services for data
enrichment".
Other relevant patterns are described in Guide to common Cloud Dataflow use-case patterns, Part 1:
Pattern: Slowly-changing lookup cache
Pattern: Calling external services for data enrichment
I have a problem with DBUnit in my testc cases. When I create my data in db unit I currently explicitly specify Ids. It looks something like this.
<users user_id="35" corpid="CORP\35" last_login="2014-10-27 00:00:00.0" login_count="1" is_manager="false"/>
<plans plan_id="18332" state="1" owned_by_user="35" revision="4"/>
<plan_history plan_history_id="12307" date_created="2014-08-29 14:40:08.356" state="0" plan_id="18332"/>
<plan_history plan_history_id="12308" date_created="2014-08-29 16:40:08.356" state="1" plan_id="18332"/>
<goals goal_id="12331" goal_name="Dansa" description="Dans"/>
<personal_goals plan_id="18332" personal_goal_id="18338" date_finished="2014-10-28 00:00:00.192" goal_id="12331" state="0"/>
<personal_goal_history personal_goal_id="18338" personal_goal_history_id="18005" date_created="2014-08-29 14:40:08.356" state="1" />
<activities activity_id="13001"/>
<custom_activities activity_name="customActivity" description="Replace" activity_id="13001"/>
<personal_activities personal_activity_id="17338" personal_goal_id="18338" date_finished="2014-10-28 00:00:00.192" state="0" activity_id="13000"/>
<personal_activity_history personal_activity_id="17338" personal_activity_history_id="18338" date_created="2014-08-29 14:40:29.073" state="1" />
Since the id of the user is specified literally we often get merge problems between tests and they are really cumbersome to solve. This is because we may be working on different branches and several people may have allocated the same ids. The solution then becomes updating all ids in the seed data and all relational ids as well as updating the test-files. This work is really cumbersome.
Im there for looking for some way to autogenerate ids. For instance functions like getNextId("User") and getLatestId("User") would be of great help. Is there something like this in DB unit or could I myself some how create such functions?
If there are other suggestions to how this problem can be avoided Id gladly here them as well.
Sounds like you are using same test data file for all tests. It is better practice to have multiple test files - one per test for its test specific data and common files used for "master list" data. The "master list" data does not change per test, thereby not encountering the mentioned problem when merging test data files.