How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().
If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.
It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.
Related
I want to read data from a single source, then write it back to same source then write to another source.
Source can be different: HDFS, Mongo, Kafka ...
With local testing, I have a weird behaviour.
This is my first test: I persist the data on disk to not recompute the data from original source:
Dataset<Row> rootDataframe = sparkSession
.read()
.option("header", true)
.csv("folder1");
Dataset<Row> cachedDataFrame = rootDataframe.persist(StorageLevel.DISK_ONLY());
cachedDataFrame
.write()
.option("header", true)
.mode(SaveMode.Append)
.csv("folder1");
cachedDataFrame
.write()
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("folder2");
cachedDataFrame.unpersist();
This doesn't work as I intended, because of persist function. The cached data is invalidated by my first write that writes it back to original source. In my folder2, I get duplicated data (the original one and the one written by my first operation). This JIRA ticket looks like my problem: https://issues.apache.org/jira/browse/SPARK-24596.
But if I don't use persist, this work as I want, my second write operation isn't affected by my first one. I think this is because the cache isn't invalidated and metadata stay unchanged.
If you take a look at physical plan, I think that InMemoryFileIndex isn't recalculated even if I clean cache manually.
== Physical Plan ==
*(1) FileScan csv [COUNTRY#10,CITY#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/D:/dev/Intellij-project/folder1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<COUNTRY:string,CITY:string>
Dataset<Row> rootDataframe = sparkSession
.read()
.option("header", true)
.csv("folder1");
Dataset<Row> cachedDataFrame = rootDataframe
cachedDataFrame.explain(true);
cachedDataFrame
.write()
.option("header", true)
.mode(SaveMode.Append)
.csv("folder1");
sparkSession.sharedState().cacheManager().clearCache();
cachedDataFrame
.write()
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("folder2");
But this behaviour is different with other data sources, for example with MongoDB, if I don't persist the data, the data is, as intended duplicated, because the second write operation will read the data from the first write.
Is there a way to insert the same data in multiple data sources, potentially including root data source in different target using only Spark?
This seems to be impossible using dataframe persisting. Maybe with dataframe checkpointing? Or should I use external data storage as an intermediate data store? Dirty coding to break the lineage? Of course, I can't reverse write order, this is a simplified example, I can have multiple source that write to same multiple target.
Spark: 2.4.4
Java 8
Couldn't find it in docs, but AFAIK persist() is a lazy operation, so just like where or select etc it won't execute/do anything, it just adds onto the execution plan. It'll execute only when you trigger some action like count or write.
Assuming you have a valid usecase for duplicating data in folder1 (as baffling as it is) by reading from it and writing back to it. I would have more more confidence in following options:
Change the order. Write back to folder1 last.
df = spark.read.csv('folder1')
# changed order folder2 then 1
df.write.mode(SaveMode.Overwrite).csv('folder2')
df.write.mode(SaveMode.Append).csv('folder1')
Make a copy of folder1 (instead of persist()) and use that as source for all subsequent reads.
shutils.copy('folder1', 'folder1_copy')
# OR spark.read.csv('folder1').write.csv('folder1_copy')
df = spark.read.csv('folder1_copy')
# original order folder1 then 2
df.write.mode(SaveMode.Append).csv('folder1')
df.write.mode(SaveMode.Overwrite).csv('folder2')
I am building an application that handles the batch processings with using a SpringBatch. As the ItemReaders can handle a dynamic schemas (e.g. reading a JSON files (JsonItemReader), XML files (StaxEventItemReader), getting a data from the MongoDB (MongoItemReader) and so on) I am wondering, how can I leverage a SpringBatch to use dynamically a FlatFileItemWriter as an last stage in the step and produce a CSV file.
Normally, it requires to get a fixed schema once I initialize a Writer (before I even start writing an objects). As the schema can differ in the JSON Objects, each product in each chunk can potentially have a different headers. Is there any workaround that I can use to include a FlatFileItemWriter as an output if the domain objects have a various schemas that are unknown until the Runtime?
That's the current code for initializing a FlatFileItemWriter but with using a static schema, that needs to be provided before I create a Writer.
FlatFileItemWriter<Row> flatFileItemWriter = new FlatFileItemWriter<>();
Resource resource = new FileSystemResource(path);
flatFileItemWriter.setResource(resource);
CSVLineAggregator lineAggregator = CSVLineAggregator.builder()
.schema(schema)
.delimiter(delimiter)
.quoteCharacter(quoteCharacter)
.escapeCharacter(escapeCharacter)
.build();
flatFileItemWriter.setLineAggregator(lineAggregator);
flatFileItemWriter.setEncoding(encoding);
flatFileItemWriter.setLineSeparator(lineSeparator);
flatFileItemWriter.setShouldDeleteIfEmpty(shouldDeleteIfEmpty);
flatFileItemWriter.setHeaderCallback(new HeaderCallback(schema.getColumnNames(), flatFileItemWriter, lineSeparator));
** The Row it's my domain object that is just a Map's based structure that stores the data in the Cells and Columns, along with the schema that can differ between the rows.
Thanks in advance for any tips!
I am trying to read data stored to HDFS acquired through Kafka and SparkStreaming.
I am using a Java app which saves some arbitrary data using JavaRDD.saveAsTextFile method to Hadoop HDFS. Basicaly like this:
kafkaStreams.get(i).foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
#Override
public void call(JavaRDD<ConsumerRecord<String, String>> consumerRecordJavaRDD) throws Exception {
consumerRecordJavaRDD.saveAsTextFile("/tmp/abcd_" + System.currentTimeMillis());
});
A textfile lines are pushed through Kafka. The data is saved and I can see it in default hadoop browser at localhost:50070.
Then, in a pyspark app I am trying to read the data using sparkContext.textFile.
The problem is the data I read (either with python or "by hand" at localhost:50070) also contain metadata. So every line is as follows (one long string):
"ConsumerRecord(topic = abcdef, partition = 0, offset = 3, CreateTime = 123456789, checksum = 987654321, serialized key size = -1, serialized value size = 28, key = null, value = aaaa, bbbb, cccc, dddd, eeee)"
I guess there is no sense reading the data as it is and then splitting and parsing the long string to just get the "value" contents is not the best idea.
How should I address this problem, then? Is it possible to read the "value" field only? Or is the problem in the saving itself?
IMO you are doing this in the wrong order. I would strongly recommend that you consume data from Kafka directly in your pyspark application.
You can write the Kafka topic to HDFS as well if you want to (remember, Kafka persists data, so when you read it in pyspark will not change what gets written to HDFS from the same topic).
Coupling your PySpark to HDFS when the data is already in Kafka doesn't make sense.
Here's a simple example of consuming data from Kafka in pyspark directly.
I have solved the issue.
As mentioned in comments under the original post, I saved the data in parquet file format which is column oriented and easy to use.
I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")
// Load documents, splitting wrt <page> tag.
val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.
I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar
To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.
public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".
Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems