I want to store Elasticsearch indices to HDFS files not using ES-Hadoop Connector.
A proposed solution is using Streaming Custom Receivers to read and save as parquet files and the code is like,
JavaDStream<String> jsonDocs = ssc.union(dsList.get(0), dsList.subList(1, dsList.size())); // I have a couple receivers
jsonDocs.foreachRDD( rdd -> {
Dataset<Row> ds = spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
ds.write().mode(SaveMode.Append).option("compression","gzip").parquet(path);
With this, I get some okay performance number, however, for I am new to Spark, I wonder if there is any room to improve.
For example, I see that json() and parquet() jobs take most of time, and is json() jobs taking long time necessary or can it be avoided?
(I have omitted some other jobs, e.g. count(), from the code snippet for simplicity.)
Using Structured Streaming looks a good but haven’t found a simple solution with Custom Receivers Streaming.
Thanks in advance,
spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
Looking above, reading json() might not best for performance sensitive work. Spark uses JacksonParser in it's data source api for reading json. If your json structure is simple try to parse it by yourself using map() function to create Row.
Related
I am running apache beam pipeline on GCP dataflow.
The dataflow pipeline suggests following item
A fusion break can be inserted after the following transforms to increase parallelism: ReadFromGCS/Match All/Match filepatterns/ParMultiDo(Match). The transforms had the following output-to-input element count ratio, respectively: 1006307.
my pipeline looks something like this
PCollection<String> records = p.apply("ReadFromGCS", TextIO.read().from(options.getInput())
.withHintMatchesManyFiles());
PCollection<Document> documents = records.apply("ConvertToDocument", ParDo.of(new ProcessJSON(options.getBatch())));
// Write to MongoDB using ParDo transform sink
documents.apply("WriteToMongoDB", MongoDbIO.write()
.withUri("mongodb+srv://"+options.getMongo())
.withDatabase(options.getDatabase())
.withCollection(options.getCollection())
.withBatchSize(options.getBatchSize()));
my input is a gcs bucket of pattern 'gs://test-bucket/test/*.json' which contains million json files
I want to understand what does the suggestion mean and how do I increase parallelism as suggested my dataflow in my case.
I tried this documentation but could not figure out how to solve this
https://cloud.google.com/dataflow/docs/guides/using-dataflow-insights?&_ga=2.162468924.-1096227812.1671933466#high-fan-out
attaching screenshot
image
Please look at Fusion Optimization for some background information on how to enforce/prevent a fusion.
A very common way is to have GroupByKey if there is some natural way to group things, or use operations such as Reshuffle.ViaRandomKey if you want to just spread evenly.
I am very new to Java and have been tasked to use spring batch to read in some text files. So far Spring batch resources online have helped me to get to a point where I am reading, processing and writing some simple test .csv files into Mongo.
The problem I have now is that the actual file I would like to read from has over 600 columns. Meaning that with the current way I am reading in my file to Java, I would need 600+ fields in my #Document mongo model.
I have been thinking of a couple of ways to get around this,
first I was thinking maybe I could read in each line as a string and then in my processor deal with splitting everything up and formatting the data to then return a list of my MongoTemplate but returning a List is not viable from the overridden process method.
So my question to you guys is,
What is the best way to handle reading in files with hundreds of
columns in spring batch? Or what would be the best resource to start
reading to help point me in the right direction.
Thanks!
I had a same problem I used
http://opencsv.sourceforge.net/apidocs/com/opencsv/CSVReader.html
for reading csvs.
I suggest you use Map instead of 600 java fields.
Besides, 600X600 java strings is not a big deal for java and neither for mongo.
To manipulate with mongo use http://jongo.org/
If you really need batch processing of data.
Your flow for batches should be,
Loop here : divide in batches(say 300 per loop)
Read 300X300 java objects(or in a Map) from file in memory.
Sanitize or Process them if needed.
Store in mongoDB.
return if EOF.
I ended up just reading in each line as a String object. Then in the processor looping over the String object with a delimiter creating my Mongo repository objects and storing them. So I am basically doing all of the writing inside the processor method which I would say is definitely not best practice but gives me the desired end result.
Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.
I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.
I use Google Cloud Dataflow to process bound data and output to BigQuery, and I want it can process something and write something (like stream, not batch), Is any way I can do this?
Currently, Dataflow will wait worker process dont all data, and write to BigQuery, I try to add FixedWindow and use Log Timestamp param be a window_timestamp, but It doesn't work.
I want to know:
Is windowing right way to handle this problem?
Is BigQueryIO really write batch or maybe it just not show on my dashboard (background write stream?)
Is any way to do I need?
My source code is here: http://pastie.org/10907947
Thank you very much!
You need to set the streaming property to true in your PipelineOptions.
See "streaming execution" for more information.
In addition, you'll need to be using sources/sinks that can generate/consume unbounded data. BigQuery can already write in both modes, but currently TextIO only reads bounded data. But it's definitely possible to write a custom unbounded source that scans a directory for new files.
What is the fastest way to export all the rowkeys from a column family in cassandra (0.7.x and later versions) with Java APIs or other tools ?
Currently I am using the Java Pelops API, and paging through all records, but Im wondering if there is a better mechanism.
I am specifically interested in only exporting the rowkeys (no columns/subcolumns), so Im wondering if there is a section of the cassandra direct storage APIs that could be used to do this as quickly as possible (bypassing thrift).
What about using Java hector client. Sample taken from
https://github.com/rantav/hector/wiki/User-Guide
RangeSlicesQuery<String, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, stringSerializer,
stringSerializer, stringSerializer);
rangeSlicesQuery.setColumnFamily("Standard1");
rangeSlicesQuery.setKeys("fake_key_", "");
rangeSlicesQuery.setReturnKeysOnly(); // use this
rangeSlicesQuery.setRowCount(5);
Result<OrderedRows<String, String, String>> result = rangeSlicesQuery.execute();
thrift is API interface for cassandra. Going directly to storage would require you to read data files in binary. Code above should give you good performance.
If you need this for one time export then I would say it's OK. If you need this for production you should reconsider your data-model - you may be doing something wrong.
You may need to split the query using multiple key ranges in case you need to scan many rows.