Processing large CSV's using dataflow jobs

Processing large CSV's using dataflow jobs - java

I am trying to process 6GB CSV's file (750 MB in GZ) using GCP dataflow jobs. I am using machineType as n1-standard-4, which is 15GB RAM size with 4vCPU's.
My Data Flow Code
PCollection<TableRow> tableRow = lines.apply("ToTableRow", ParDo.of(new
StringToRowConverter()));
static class StringToRowConverter extends DoFn<String, TableRow> {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String inputLine = c.element();
String[] split = inputLine.split(",");
TableRow output = new TableRow();
c.output(new TableRow().set("id", split[0]).set("apppackage", split[1]));
}
}
My Job is running since last 2 hours and still not processed.
Once I break manually this large file in small parts, it works properly.
I have to process 400GB of compressed files to put in bigquery.All zipped Files are in GCP storage.
My query is if only 6GB file is processing in so much time, how can I process 400GB of zipped files?
Is there way I can optimise this process so that I will be able to insert this data in my BQ.

6GB in CSV is not much data. CSV is just an really inefficient way of storing numerical data, and for string-alike data, it still carries significant overhead and is hard to parse, and impossible to seek to specific positions at rest (need to be parsed first). So, we can be pretty optimistic that this will actually work out, data wise. It's an import problem.
Don't roll your own parser. For example: What about fields that contain a , in their text? There's enough CSV parsers.
You say you want to get that data into your BigQuery – so go google's way and follow:
https://cloud.google.com/bigquery/docs/loading-data-local#bigquery-import-file-java
as bigquery already comes with it's own Builder that supports CSV.

Related

How can we prevent empty file write in dataflow pipeline when collection size is 0?

I have a dataflow pipeline and I'm parsing a file if I got any incorrect records then I'm writing it on the GCS bucket, but when there are no errors in the input file data still TextIO writes the empty file on the GCS bucket with a header.
So, how can we prevent this if the PCollection size is zero then skip this step?
errorRecords.apply("WritingErrorRecords", TextIO.write().to(options.getBucketPath())
.withHeader("ID|ERROR_CODE|ERROR_MESSAGE")
.withoutSharding()
.withSuffix(".txt")
.withShardNameTemplate("-SSS")
.withNumShards(1));

TextIO.write() always writes at least one shard, even if it is empty. As you are writing to a single shard anyway, you could get around this behavior by doing the write manually in a DoFn that takes the to-be-written elements as a side input, e.g.
PCollectionView<List<String>> errorRecordsView = errorRecords.apply(
View.<String>asList());
// Your "main" PCollection is a PCollection with a single input,
// so the DoFn will get invoked exactly once.
p.apply(Create.of(new String[]{"whatever"}))
// The side input is your error records.
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(
#Element String unused,
OutputReceiver<String> out,
ProcessContext c) {
List<String> errors = c.sideInput(errorRecordsView);
if (!errors.isEmpty()) {
// Open the file manually and write all the errors to it.
}
}
}).withSideInputs(errorRecordsView);
Being able to do so with the native Beam writes is a reasonable request. This is not supported in the latest release of Beam by setting skipIfEmpty.

Large file to be read in chunks

I need a process a large file and insert into Db and don't want to spend lot of ram doing the same. I know we can read line in streaming mode by using apache commons API or buffered reader....bt I wish to insert in DB in batch mode like 1000 insertions at 1 go and not 1 by 1. ....is reading the file line by line ,adding to a list ,counting size ,inserting and refreshing the list of lines the only option to achieve this ?

According to your description, Spring-Batch fit very well.
Basically, it use chunk concept to read/process/write the content. By the way, it can be concurrent for performance.
#Bean
protected Step loadFeedDataToDbStep() {
return stepBuilder.get("load new fincon feed").<com.xxx.Group, FinconFeed>chunk(250)
.reader(itemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(itemProcessor(OVERRIDDEN_BY_EXPRESSION, OVERRIDDEN_BY_EXPRESSION_DATE, OVERRIDDEN_BY_EXPRESSION))
.writer(itemWriter())
.listener(archiveListener())
.build();
}
You can refer to here for more

How to read hdfs kafka data using pyspark?

I am trying to read data stored to HDFS acquired through Kafka and SparkStreaming.
I am using a Java app which saves some arbitrary data using JavaRDD.saveAsTextFile method to Hadoop HDFS. Basicaly like this:
kafkaStreams.get(i).foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
#Override
public void call(JavaRDD<ConsumerRecord<String, String>> consumerRecordJavaRDD) throws Exception {
consumerRecordJavaRDD.saveAsTextFile("/tmp/abcd_" + System.currentTimeMillis());
});
A textfile lines are pushed through Kafka. The data is saved and I can see it in default hadoop browser at localhost:50070.
Then, in a pyspark app I am trying to read the data using sparkContext.textFile.
The problem is the data I read (either with python or "by hand" at localhost:50070) also contain metadata. So every line is as follows (one long string):
"ConsumerRecord(topic = abcdef, partition = 0, offset = 3, CreateTime = 123456789, checksum = 987654321, serialized key size = -1, serialized value size = 28, key = null, value = aaaa, bbbb, cccc, dddd, eeee)"
I guess there is no sense reading the data as it is and then splitting and parsing the long string to just get the "value" contents is not the best idea.
How should I address this problem, then? Is it possible to read the "value" field only? Or is the problem in the saving itself?

IMO you are doing this in the wrong order. I would strongly recommend that you consume data from Kafka directly in your pyspark application.
You can write the Kafka topic to HDFS as well if you want to (remember, Kafka persists data, so when you read it in pyspark will not change what gets written to HDFS from the same topic).
Coupling your PySpark to HDFS when the data is already in Kafka doesn't make sense.
Here's a simple example of consuming data from Kafka in pyspark directly.

I have solved the issue.
As mentioned in comments under the original post, I saved the data in parquet file format which is column oriented and easy to use.

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().

If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.

It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar

To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.

public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".

Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing large CSV's using dataflow jobs - java

Related

How can we prevent empty file write in dataflow pipeline when collection size is 0?

Large file to be read in chunks

How to read hdfs kafka data using pyspark?

How to access SparkContext on executors to save DataFrame to Cassandra?

saveAsTextFile() to write the final RDD as single text file - Apache Spark

Categories

Resources