I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")
// Load documents, splitting wrt <page> tag.
val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.
Related
I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.
Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))
I am new to java, i am struggling in one program i don't know how to write it
I need some code that will read in the tasks and apply the appropriate task findings markup into the val.xml file.
For example:
A task in val.xml:
<task name="12-19" additionalIntervalInformationNeeded="No">
Converter (Cleaning)
</task>
The matching task-findings markup in the findings.xml:
<tf taskid="olive-12-19">
<task-findings val="28">
<task-finding>
<title>Left Converter</title>
</task-finding>
</task-findings>
</tf>
So the goal is to use the tasked attribute value from the element to locate the correct task-findings markup.
Incorporate the element and all child elements into the task markup (just inside the ending tag.
The result to the above examples would be as such:
<task name="12-19" additionalIntervalInformationNeeded="No">
Converter (Cleaning)
<tf taskid="olive-12-19">
<task-findings val="28">
<task-finding>
<title>Left Converter</title>
</task-finding>
</task-findings>
</tf>
</task>
Please suggest me how to write code.
From your use case, it appears that you can write a program to read in the two xml files and then edit and write them as an output file. XML files can be read and written just like TXT files in Java, you'll just need to change the file extension while reading and saving the files. This will need you to write your own parser or use regex etc methods.
Another way to go is by writing a JAXP or the Java API for XML, provided by Oracle. This will help you read, process and edit XML files via Java.
There are other parser APIs called DOM Parser API & SAX (Simple API for XML) API. That can be used to read and alter XML files. This was used by older Java versions and are useful for small XML files. Currently, the StaX or Streaming API for XML is used instead.
The tutorial blog here will help you get an idea of StaX library parsing XML files via Java.
How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().
If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.
It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.
I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:
Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
fs.open(source), jsonSchema, Record.class)) {
reader.initialize();
try (ParquetWriter<Record> writer = AvroParquetWriter
.<Record>builder(outputPath)
.withConf(new Configuration)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(jsonSchema)
.build()) {
for (Record record : reader) {
writer.write(record);
}
}
}
I had the same problem, and what I understood that there are not much samples available for parquet write without using avro or other frameworks. Finally I went with Avro. :)
Have a look at this, may help you.
I have some ebooks in xml format. The books' pages are marked using processing instructions(e.g. <?pg 01?>). I need to extract the content of the book in plain text, one page at a time and save each page as a text file. What's the best way of doing this?
The easiest way, assuming you need to integrate this into a Java program (as the tag implies), is probably to use a SAX parser such as XMLReader provides. You write a ContentHandler callback for text and processing instructions.
When your p-i handler is called, you open a new output file.
When your text handler is called, you copy the character data to the currently open output file.
This tutorial has some helpful example code.
However if you don't need to integrate this into a Java program, I might use XSLT 2.0 (Saxon is free). XSLT 1.0 will not allow multiple output documents, but XSLT 2.0 will, and it will also make grouping by "milestone markup" (your "pg" processing instructions) easier. If you're interested in this approach, just ask... and give more info about the structure of the input document.
P.S. Even if you do need to integrate this into a Java program, you can call XSLT from Java - Saxon for example is written in Java. However I think if you're just processing PI's and text, it would be less effort to use a SAX parser.
I would probably use castor to do this. It's a java tool that allows you to specify bindings to java objects, which you can then output as text to file
You need an ebook renderer for the format your books are in (and I highly doubt that it's XML if they use backslashes as processing instructions). Also, XPath works wonders if all you want to do is get the actual text, simply use //text() for all the text.
You could try converting it to YAML and editing it in a word processor--then a simple macro should fix it right up.
I just browsed for this XML to YAML conversion utility--it's small but I didn't test it or anything.
http://svn.pyyaml.org/pyyaml-legacy/trunk/experimental/XmlYaml/convertyaml_map.py
Use an XSL stylesheet with <xsl:output method="text"/>.
You can even debug stylesheets in eclipse nowadays.
You can do this with Apache Tika like:
byte[] value = //your xml content as a byte array
Parser parser = new XMLParser()
org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
Metadata metadata = new Metadata()
ParseContext context = new ParseContext()
parser.parse(new ByteArrayInputStream(value), textHandler, metadata, context)
return textHandler.toString()
if using maven, you'd probably want both of the below:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.13</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
</dependency>