How to read hdfs kafka data using pyspark? - java

I am trying to read data stored to HDFS acquired through Kafka and SparkStreaming.
I am using a Java app which saves some arbitrary data using JavaRDD.saveAsTextFile method to Hadoop HDFS. Basicaly like this:
kafkaStreams.get(i).foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
#Override
public void call(JavaRDD<ConsumerRecord<String, String>> consumerRecordJavaRDD) throws Exception {
consumerRecordJavaRDD.saveAsTextFile("/tmp/abcd_" + System.currentTimeMillis());
});
A textfile lines are pushed through Kafka. The data is saved and I can see it in default hadoop browser at localhost:50070.
Then, in a pyspark app I am trying to read the data using sparkContext.textFile.
The problem is the data I read (either with python or "by hand" at localhost:50070) also contain metadata. So every line is as follows (one long string):
"ConsumerRecord(topic = abcdef, partition = 0, offset = 3, CreateTime = 123456789, checksum = 987654321, serialized key size = -1, serialized value size = 28, key = null, value = aaaa, bbbb, cccc, dddd, eeee)"
I guess there is no sense reading the data as it is and then splitting and parsing the long string to just get the "value" contents is not the best idea.
How should I address this problem, then? Is it possible to read the "value" field only? Or is the problem in the saving itself?

IMO you are doing this in the wrong order. I would strongly recommend that you consume data from Kafka directly in your pyspark application.
You can write the Kafka topic to HDFS as well if you want to (remember, Kafka persists data, so when you read it in pyspark will not change what gets written to HDFS from the same topic).
Coupling your PySpark to HDFS when the data is already in Kafka doesn't make sense.
Here's a simple example of consuming data from Kafka in pyspark directly.

I have solved the issue.
As mentioned in comments under the original post, I saved the data in parquet file format which is column oriented and easy to use.

Related

Processing large CSV's using dataflow jobs

I am trying to process 6GB CSV's file (750 MB in GZ) using GCP dataflow jobs. I am using machineType as n1-standard-4, which is 15GB RAM size with 4vCPU's.
My Data Flow Code
PCollection<TableRow> tableRow = lines.apply("ToTableRow", ParDo.of(new
StringToRowConverter()));
static class StringToRowConverter extends DoFn<String, TableRow> {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String inputLine = c.element();
String[] split = inputLine.split(",");
TableRow output = new TableRow();
c.output(new TableRow().set("id", split[0]).set("apppackage", split[1]));
}
}
My Job is running since last 2 hours and still not processed.
Once I break manually this large file in small parts, it works properly.
I have to process 400GB of compressed files to put in bigquery.All zipped Files are in GCP storage.
My query is if only 6GB file is processing in so much time, how can I process 400GB of zipped files?
Is there way I can optimise this process so that I will be able to insert this data in my BQ.
6GB in CSV is not much data. CSV is just an really inefficient way of storing numerical data, and for string-alike data, it still carries significant overhead and is hard to parse, and impossible to seek to specific positions at rest (need to be parsed first). So, we can be pretty optimistic that this will actually work out, data wise. It's an import problem.
Don't roll your own parser. For example: What about fields that contain a , in their text? There's enough CSV parsers.
You say you want to get that data into your BigQuery – so go google's way and follow:
https://cloud.google.com/bigquery/docs/loading-data-local#bigquery-import-file-java
as bigquery already comes with it's own Builder that supports CSV.

How to read a CSV uploaded via a Spring REST handler using spark?

I am new to Spark and Dataframes. I came across the below piece of code provided by the databricks library to read CSV from a specified path in the file system.
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downlos/2017.csv")
Is there any API in csv databricks that parses a byte array from a http request instead of reading from a file system?
Usecase here is to read a multipart(CSV) file uploaded using spring REST handler using Spark Dataframes. I'm looking for a dataframe API that can load a file/byte array as input instead of reading from file system.
From the file read, need to select only those columns in each row that match a given condition(eg. any column value that is not equal to string "play" in each parsed
row) and save only those fields back to the database.
Can anyone suggest if the above mentioned Usecase is feasible in spark using RDD's/Dataframes?..Any suggestions on this would be of much help.
You cannot directly convert it to String you have to convert it tostring then you can create RDD.
check this: URL contents to a String or file
val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()
ScalafromURLApi

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().
If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.
It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

VerticaCopyStream is very slow

I use vertica flex table to load json to vertica without defining the tables, and I got problems with my loading time.
I connect to my vertica with jdbc drive and then use this code..
String copyQuery = "COPY schema.tablename FROM STDIN PARSER fjsonparser()";
VerticaCopyStream vstream = new VerticaCopyStream((VerticaConnection)conn, copyQuery);
InputStream input;
vstream.start();
for(JsonNode json : jsonList){
input = new ByteArrayInputStream(json.toString().getBytes());
vstream.addStream(input);
input.close();
}
vstream.execute();
vstream.finish();
The command "vstream.execute()" takes 12 seconds for 5000 jsons but when I use COPY command from file it runs for less then a second.
Your problem is not with the VerticaCopyStream , the problem is with regard to the different parsers you used , you need to compare apple to apple , JSON parser should be more slower the simple CSV parser .
COPY FROM STDIN and COPY LOCAL stream data from the client. Running it on the server with just a COPY (no LOCAL or STDIN) will be a direct load straight from the vertica daemon with no network latency (assuming it is on local disk and not a NAS).
In addition, your method of reinstantiating the ByteArrayInputStream... wouldn't it be better to turn your jsonList into an InputStream and pass just that in instead of creating an input stream for every item?
if you run the samr insert by useing vsql it solve the problem

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar
To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.
public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".
Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

Categories

Resources