How to pass document uri & database name to marklogic spark connector? - java

I am trying this marklogic spark connector tutorial.
https://developer.marklogic.com/blog/marklogic-spark-example
I was able to execute this. What I found is, it picks the documents database by default.
Question is:
Given code looks like this:
JavaPairRDD<DocumentURI, MarkLogicNode> mlRDD = context.newAPIHadoopRDD( hdConf, Configuration DocumentInputFormat.class, InputFormat DocumentURI.class, Key Class MarkLogicNode.class, Value Class );
I was wondering how I can pass the specific Document URI and Database to just get a specific document in a database.
For Example;
Documents database with xml files created on importing a csv file. Mentioned below: Marklogic : Multiple XML files created on document on importing a csv. How to get root Document URI path?
Can some one share a sample code on how to pass the document URI and database name as parameters?

If you refer to documentation for MarkLogic Connector for Hadoop, specifically
Input Configuration Properties - You will find the property mapreduce.marklogic.input.documentselector which takes the XQuery path expression that allows you to select sepcific documents from the database.

The sample uses The Hadoop Connector.
Using MarkLogic 8, I believe you can set the database like this: com.marklogic.output.databasename in the job configuration.
http://docs.marklogic.com/guide/mapreduce/quickstart#id_38329

Related

Create XML based on XSD from data in DB

this is first time I work with XML so maybe it is very easy problem, but I would like to ask what is the best way I should create XML filled with data from DB when I know the schema.
Of course there is possibility to do it manually, but I would like to do something like this:
create configuration file which would specify column name, xpath, default value (if DB column is not populated) and based on this configuration file to create XML based on known schema.
Is there some tool in java which would allow be something like this?
MagicXMLTool tool = new MagicXMLTool(mySchema.xsd);
tool.set("some/xpath/",value);
tool.set("another/xpath",anotherValue)
String xml = tool.generateXML();
Thanks a lot!

How can I write gerrit.config format?

File gerrit.config
The audit configuration can be defined in the main gerrit.config
in a specific section dedicated to the audit-sl4j plugin.
gerrit.audit-sl4j.format
: Output format of the audit record. Can be set to either JSON
or CSV. By default, CSV.
gerrit.audit-sl4j.logName
: Write audit to a separate log name under Gerrit logs directory.
By default, audit records are put into the error_log.
How can I write the section gerrit.audit-sl4j.logName?
I have tried this :
But it doesn't work.
You forgot to paste the example that doesn't work for you. While you update it I can share a working example in case it can be of any help.
This is the audit-sl4j configuration part of a working gerrit.config:
[plugin "audit-sl4j"]
format = JSON
logName = audit_log
In this example, we are writing the audit logs to a file called audit_log in JSON format.
I hope this help.

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().
If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.
It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

Reading N-Quads in Jena

I'm trying to read an N-Quads file with Jena, but all I get is an empty model. The file I'm trying to read is taken from the example in N-Quads documentation:
<http://example.org/#spiderman> <http://www.perceive.net/schemas/relationship/enemyOf> <http://example.org/#green-goblin> <http://example.org/graphs/spiderman> .
(I saved it as a file named file.nq).
The way I'm loading the model is using the RDFDataMgr. But it didn't work with Model.read either.
RDFDataMgr.loadModel("file.nq", Lang.NQUADS)
yields an empty model.
What am I missing? Doesn't Jena support N-Quads out-of-the-box?
Yes, Jena supports N-Quads. Try loadDataset.
N-Quads is for multiple graphs and you have read it into one graph. What you get is just the default graph triples, in this case, none.
There is a warning emitted:
WARN riot :: Only triples or default graph data expected : named graph data ignored
If you didn't get that then (1) you are running an old copy (2) you have turned logging off (3) the file is empty.

Creating XML Schema from URL works but from Local File fails?

I need to validate XML Schema Instance (XSD) documents which are programmatically generated so I'm using the following Java snippet, which works fine:
SchemaFactory factory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema xsdSchema = factory.newSchema( // Reads URL every time...
new URL("http://www.w3.org/2001/XMLSchema.xsd"));
Validator xsdValidator = xsdSchema.newValidator();
xsdValidator.validate(new StreamSource(schemaInstanceStream));
However, when I save the XML Schema definition file locally and refer to it this way:
Schema schema = factory.newSchema(
new File("test/xsd/XMLSchema.xsd"));
It fails with the following exception:
org.xml.sax.SAXParseException: schema_reference.4: Failed to read schema document 'file:/Users/foo/bar/test/xsd/XMLSchema.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
I've ensured that the file exists and is readable by doing exists() and canRead() assertions on the File object. I've also downloaded the file with a couple different utilities (web browser, wget) to ensure that there is no corruption.
Any idea why I can validate XSD instance documents when I generate the schema from the HTTP URL but I get the above exception when trying to generate from a local file with the same contents?
[Edit]
To elaborate, I've tried multiple forms of factory.newSchema(...) using Readers and InputStreams (instead of the File directly) and still get exactly the same error. Moreover, I've dumped the file contents before using it or the various input streams to ensure it's the right one. Quite vexing.
Full Answer
It turns out that there are three additional files referenced by XML Schema which must be also stored locally and XMLSchema.xsd contains an import statement whose schemaLocation attribute must be changed. Here are the files that must be saved in the same directory:
XMLSchema.xsd - change schemaLocation to "xml.xsd" in the "import" element for XML Namespace.
XMLSchema.dtd - as is.
datatypes.dtd - as is.
xml.xsd - as is.
Thanks to #Blaise Doughan and #Tomasz Nurkiewicz for their hints.
I assume you are trying to load XMLSchema.xsd. Please also download XMLSchema.dtd and datatypes.dtd and put them in the same directory. This should push you a little bit further.
UPDATE
Is XMLSchema.xsd importing any other schemas by relative paths that are not on the local file systen?
Your relative path may not be correct wrt your working directory. Try entering a fully qualified path to eliminate the possibility that the file can not be found.
org.xml.sax.SAXParseException: schema_reference.4: Failed to read
schema document 'file:/Users/foo/bar/test/xsd/XMLSchema.xsd', because
1) could not find the document; 2) the document could not be read; 3)
the root element of the document is not .

Categories

Resources