VerticaCopyStream is very slow

VerticaCopyStream is very slow - java

I use vertica flex table to load json to vertica without defining the tables, and I got problems with my loading time.
I connect to my vertica with jdbc drive and then use this code..
String copyQuery = "COPY schema.tablename FROM STDIN PARSER fjsonparser()";
VerticaCopyStream vstream = new VerticaCopyStream((VerticaConnection)conn, copyQuery);
InputStream input;
vstream.start();
for(JsonNode json : jsonList){
input = new ByteArrayInputStream(json.toString().getBytes());
vstream.addStream(input);
input.close();
}
vstream.execute();
vstream.finish();
The command "vstream.execute()" takes 12 seconds for 5000 jsons but when I use COPY command from file it runs for less then a second.

Your problem is not with the VerticaCopyStream , the problem is with regard to the different parsers you used , you need to compare apple to apple , JSON parser should be more slower the simple CSV parser .

COPY FROM STDIN and COPY LOCAL stream data from the client. Running it on the server with just a COPY (no LOCAL or STDIN) will be a direct load straight from the vertica daemon with no network latency (assuming it is on local disk and not a NAS).
In addition, your method of reinstantiating the ByteArrayInputStream... wouldn't it be better to turn your jsonList into an InputStream and pass just that in instead of creating an input stream for every item?

if you run the samr insert by useing vsql it solve the problem

Related

Any other way to run cucumber scenario multiple times other than from scenario outline?

Scenario outline :
Given I have a stream from system-env
When I request the streaming url
Then A http response 200 received
And I verify "data1" is accurate
And I verify "data2" is accurate
Examples:
|data1|data2|
|abc|def|
|test1|test2 |
What is the best way to make sure the above scenario is run for different input "stream" (currently received from system property to gradle task for a single stream as a tag and scenario is tagged with the same)?
I want to scale it to 50 streams or 100 streams later, I don't want to add all of those examples in examples as its too tedious.
I am thinking to collect all streams from an yaml file (suppose 50) and run the above scenario for each stream.

Here is the high level approach that you can follow using the Jackon library to read the data from yaml and use it in the script. You can get the stream url based on the system variable stream which holds the index of the streams.
Given("^I have the stream from system-env$", () ->
{
String myTargetStream ="";
String[] streams = get_text_from_yaml_using_jackon_library_here;//you have to implement this
int stream = System.getProperty("stream");
if (stream+1>streams.length){
myTargetStream = streams[0];
System.setProperty("stream", "0");
}else{
myTargetStream = streams[stream+1];
System.setProperty("stream", Integer.toString(stream+1));
}
// now use the myTargetStream in your test or generate the feature file here
});

Large file to be read in chunks

I need a process a large file and insert into Db and don't want to spend lot of ram doing the same. I know we can read line in streaming mode by using apache commons API or buffered reader....bt I wish to insert in DB in batch mode like 1000 insertions at 1 go and not 1 by 1. ....is reading the file line by line ,adding to a list ,counting size ,inserting and refreshing the list of lines the only option to achieve this ?

According to your description, Spring-Batch fit very well.
Basically, it use chunk concept to read/process/write the content. By the way, it can be concurrent for performance.
#Bean
protected Step loadFeedDataToDbStep() {
return stepBuilder.get("load new fincon feed").<com.xxx.Group, FinconFeed>chunk(250)
.reader(itemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(itemProcessor(OVERRIDDEN_BY_EXPRESSION, OVERRIDDEN_BY_EXPRESSION_DATE, OVERRIDDEN_BY_EXPRESSION))
.writer(itemWriter())
.listener(archiveListener())
.build();
}
You can refer to here for more

How to read hdfs kafka data using pyspark?

I am trying to read data stored to HDFS acquired through Kafka and SparkStreaming.
I am using a Java app which saves some arbitrary data using JavaRDD.saveAsTextFile method to Hadoop HDFS. Basicaly like this:
kafkaStreams.get(i).foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
#Override
public void call(JavaRDD<ConsumerRecord<String, String>> consumerRecordJavaRDD) throws Exception {
consumerRecordJavaRDD.saveAsTextFile("/tmp/abcd_" + System.currentTimeMillis());
});
A textfile lines are pushed through Kafka. The data is saved and I can see it in default hadoop browser at localhost:50070.
Then, in a pyspark app I am trying to read the data using sparkContext.textFile.
The problem is the data I read (either with python or "by hand" at localhost:50070) also contain metadata. So every line is as follows (one long string):
"ConsumerRecord(topic = abcdef, partition = 0, offset = 3, CreateTime = 123456789, checksum = 987654321, serialized key size = -1, serialized value size = 28, key = null, value = aaaa, bbbb, cccc, dddd, eeee)"
I guess there is no sense reading the data as it is and then splitting and parsing the long string to just get the "value" contents is not the best idea.
How should I address this problem, then? Is it possible to read the "value" field only? Or is the problem in the saving itself?

IMO you are doing this in the wrong order. I would strongly recommend that you consume data from Kafka directly in your pyspark application.
You can write the Kafka topic to HDFS as well if you want to (remember, Kafka persists data, so when you read it in pyspark will not change what gets written to HDFS from the same topic).
Coupling your PySpark to HDFS when the data is already in Kafka doesn't make sense.
Here's a simple example of consuming data from Kafka in pyspark directly.

I have solved the issue.
As mentioned in comments under the original post, I saved the data in parquet file format which is column oriented and easy to use.

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().

If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.

It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

Save pdf report on database using BIRT

So, I'm trying to save the pdf report in database using service methode. I saw that there's a way to specify the output of the generated report by calling : pdfOptions.setOutputStream(output). But how can I call my save methode this way?
I saw this post but i'm stack at the persist point
I apreciate any advice
PDFRenderOption pdfOptions = new PDFRenderOption(options);
pdfOptions.setOutputFormat(FORMAT_PDF);
pdfOptions.setOption(IPDFRenderOption.PAGE_OVERFLOW, IPDFRenderOption.OUTPUT_TO_MULTIPLE_PAGES);
pdfOptions.setOutputStream(response.getOutputStream());//opens report on browser
runAndRenderTask.setRenderOption(pdfOptions);

You are streaming the output directly to the client with
pdfOptions.setOutputStream(response.getOutputStream());//opens report on browser
If you do this, your output gets consumed and you'll not be able to save it to the database.
I would use a "tee" like approach, you know, with one input stream and two output streams.
You could write that yourself, our you just use something like the Apache TeeOutputStream.
This could look like this:
OutputStream blobOutputStream = ...; // for writing to the DB as BLOB.
OutputStream teeStream = TeeOutputStream(response.getOutputStream(), blobOutputStream);
pdfOptions.setOutputStream(teeStream);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

VerticaCopyStream is very slow - java

Your problem is not with the VerticaCopyStream , the problem is with regard to the different parsers you used , you need to compare apple to apple , JSON parser should be more slower the simple CSV parser .

if you run the samr insert by useing vsql it solve the problem

Related

Any other way to run cucumber scenario multiple times other than from scenario outline?

Large file to be read in chunks

How to read hdfs kafka data using pyspark?

How to access SparkContext on executors to save DataFrame to Cassandra?

Save pdf report on database using BIRT

Categories

Resources