I am attempting to implement a solution where I need to write data (json) messages from pubsub into GCS using dataflow. My question is exactly similar to this one
I need to write either based on windowing or element count.
Here is the code sample for writes from the the above question:
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));
The solution suggests using FileIO.WriteDynamic function. But i am not able to understand what .by(Event::getKey) does and where it comes from.
Any help on this is greatly appreciated.
It's partitioning elements into groups according to events' keys.
From my understanding, the events come from a PCollection using the KV class since it has the getKey method.
Note that :: is a new operator included in Java 8 that is used to refer a method of a class.
Related
I'm trying to figure out how I can pass multiple side inputs to my DoFn and refer them separately inside ProcessContext.
I wasn't able to find anything for this in the beam documentation and wanted to gain some idea on how I can achieve this in JAVA
Though the example of using side inputs only has a single side input, the same pattern holds for multiple side inputs.
Specifically, the withSideInputs method of ParDo takes any number of PCollectionViews, each of which can be used as its own key in ProcessContext.sideInput.
I figured out "qryfldexe" is able to query cross caches with multiple "join", but couldn't figure out if "qryexe" also has a way to achieve that.
The reason of having desire from question 1 is because "qryexe" returns items in key value fashion, but "qryfldexe" returns items in an array that only has values in each array item. Is there a java lib to load the type of json returned by "qryfldexe" into a json object based on the fields metadata in the end(or beginning) of the json payload?
Many thanks
No, qryexe only works on a single cache (same as its Java counterpart, SqlQuery).
There is no Java client for Ignite REST API as far as I know, but there is also a more efficient Thin Client Protocol and a work-in-progress Java client for it (see this mailing list thread with description and some links).
Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.
I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.
I would appreciate help from anyone familiar with how DynamoDB work.
I need to perform scan on a large DynamoDB table. I know that DynamoDBClient scan operation is limited to 1 MB size of returned data. Does the same restriction apply to Table.scan operation? The thing is that Table.scan operation returns output of type "ItemCollection<ScanOutcome>", while DynamoDBClient scan returns ScanResult output and it is not clear to me whether these operations work in a similar way or not.
I have checked this example: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html, but it doesn't contain any hints about using last returned key.
My questions are:
Do I still need to make scan calls in a cycle until lastreturnedkey is null if I use Table.scan? If yes, how do I get last key? If not, how can I enforce pagination?
Any links to code examples would be appreciated. I have spent some time googling for examples, but most of them are either using DynamoDBClient or DynamoDBMapper, while I need to use Table and Index objects instead.
Thanks!
If you iterate over the output of Table.scan(), the SDK will do pagination for you.
Hi Basically we wanted to use KAFKA+SPARK Streaming to catch Twitter Spam on our thesis. And I wanted to use streamingKmeans. But I have very newbie and serious question:
In this spark StreamingKmeans scala example (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala) there is one line of code for prediction:
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
Why I need to pass the "LABEL" with features ? I mean, am I getting wrong the whole idea ? Isn't we want to predict the "label" ? How am I going to predict my tweets if they are spam or not ?
For the prediction only lp.features is used, whereas lp.label is considered as a key that is carried over. Quoting from the docs:
Use the model to make predictions on the values of a DStream and carry over its keys.
I guess in your example you would simply want to replace predictOnValues by predictOn