Running existing production Java applications in Spark

Running existing production Java applications in Spark - java

I've been reading up on Spark and am very interested in the ability to allocate computation across scalable compute clusters. We have production stream processing code (5K lines written in Java 9) which handles AMQP message processing, that we would like to run in a Spark cluster.
However, I feel like I must misunderstand the basic premise of Spark. On the one hand, it runs Java and we should be able to run our applications with it, but on the other hand it seems (from the documentation) that all code must be rewritten to the Spark API (using Dataframes/Datasets). Is this true? Can Java applications be used as-is with Spark, or must they be rewritten? This seems like a major limitation or rather a showstopper for us.
I think, ideally, we would want to use Spark to handle high level message routing (using the Structured Streaming API), which would hand off the message to our Java application to handle computation, database writes etc. The core part of our code is single class interface and Spark could map the message to that class instance. Hence, there would likely be many, many instances processing messages in parallel both within each machine instance and distributed across the cluster.
Am I missing something here?

for your question Can Java applications be used as-is with Spark, or must they be rewritten?
Yes, you have to rewrite the data interaction layer.
spark reads the source data in the form of rdd/dataframe, in your case its streaming Dataframes/Datasets.
Spark parallel processing/job scheduling is based on these dataset/dataframe
Dataframes/dataset is equivalent to an Array which is storing data on multiple nodes.
so if you have a logic in java that iterate a list and writes to file
conn=openFile(..)
Array[value].foreach{
value-> {
updatedValue=/**your business logic on the value**/
conn.write(updatedValue)
}
}
in spark you have to deal with the dataframe
dataframe[value].map{ value->
updatedValue =/**your business logic on the value**/ <-- reuse your logic here
}.saveToFile(/**file path**/)
hope you can see the difference, you can reuse your business logic,
but spark has to handle the dataflow either read/write(recommended).

Related

Impact of using Thead.currentThread

Would there be any impact of using the below snippet at commonplace which gets the frequent application control,
Thread.currentThread().getStackTrace().HasElements
and
Thread.currentThread().getStackTrace()
.hasMatch(\elt1 -> elt1.FileName == "XXXXX.java")
As I need to identify the control comes here from the internal batch file (XXXX.java) to perform the subsequent custom processes. Will the above snippet makes any difference in the Clustered region?

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

A single output request from multiple elements of a JavaRDD in Apache Spark Streaming

Summary
My question is about how Apache Spark Streaming can handle an output operation that takes a long time by either improving parallelization or by combining many writes into a single, larger write. In this case, the write is a cypher request to Neo4J, but it could apply to other data storage.
Environment
I have an Apache Spark Streaming application in Java that writes to 2 datastores: Elasticsearch and Neo4j. Here are the versions:
Java 8
Apache Spark 2.11
Neo4J 3.1.1
Neo4J Java Bolt Driver 1.1.2
The Elasticsearch output was easy enough as I used the Elasticsearch-Hadoop for Apache Spark library.
Our Stream
Our input is a stream from Kafka received on a particular topic, and I deserialize the elements of the stream through a map function to create a JavaDStream<[OurMessage]> dataStream. I then do transforms on this message to create a cypher query String cypherRequest (using an OurMessage to String transformation) that is sent to a singleton that manages the Bolt Driver connection to Neo4j (I know I should use a connection pool, but maybe that's another question). The cypher query produces a number of nodes and/or edges based on the contents of OurMessage.
The code looks something like the following.
dataStream.foreachRDD( rdd -> {
rdd.foreach( cypherQuery -> {
BoltDriverSingleton.getInstance().update(cypherQuery);
});
});
Possibilities for Optimization
I have two thoughts about how to improve throughput:
I am not sure if Spark Streaming parallelization goes down to the RDD element level. Meaning, the output of RDDs can be parallelized (within `stream.foreachRDD()`, but can each element of the RDD be parallelized (within `rdd.foreach()`). If the latter were the case, would a `reduce` transformation on our `dataStream` increase the ability for Spark to output this data in parallel (each JavaRDD would contain exactly one cypher query)?
Even with improved parallelization, our performance would further increase if I could implement some sort of Builder that takes each element of the RDD to create a single cypher query that adds the nodes/edges from all elements, instead of one cypher query for each RDD. But, how would I be able to do this without using another kafka instance, which may be overkill?
Am I over thinking this? I've tried to research so much that I might be in too deep.
Aside: I apologize in advance if any of this is completely wrong. You don't know what you don't know, and I've just started working with Apache Spark and Java 8 w/ lambdas. As Spark users must know by now, either Spark has a steep learning curve due to it's very different paradigm, or I'm an idiot :).
Thanks to anyone who might be able to help; this is my first StackOverflow question in a long time, so please leave feedback and I will be responsive and correct this question as needed.

I think all we need is a simple Map/Reduce. The following should allow us to parse each message in the RDD and then write it to the Graph DB all at once.
dataStream.map( message -> {
return (ParseResult) Neo4JMessageParser.parse(message);
}).foreachRDD( rdd -> {
List<ParseResult> parseResults = rdd.collect();
String cypherQuery = Neo4JMessageParser.buildQuery(parseResults);
Neo4JRepository.update(cypherQuery);
// commit offsets
});
By doing this, we should be able to reduce the overhead associated with having to do a write for each incoming message.

Dynamically connecting a Kafka input stream to multiple output streams

Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?

If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations

Message processig - should I use JMS?

I'm developing a 'WS oriented' application basing on Spring/CXF/Oracle DB. Now, I stuck with some architectural consideration about right approach to organize message processing (already stored in db).
Briefly, process looks as follows:
(A) Get the message from client -> Validate -> Store -> Send reposponse
(B) Process -> Update data
I consider two general approaches for part B of the process:
1) Use JMS queue
Just after validation and storing incoming message details in DB publish a message to the JSM queue. On the other side define cosumer which will retrieve the message and do the processing
2) Fetch data to be processed
Manually fetch data from with db and process it.
Additional facts:
The processing won't be compute-intensive, so for new I dont think that work distribution will be needed (all in single JVM).
All data in single db schema
So, I'm interested what are key factors to choose JMS in such case?

JMS would be a better approach. In positive scenario, approach #2 works as well. But JMS would provide you some in-built capability, specially for failed case. Though internally JMS would be using a DB-based persistent storage; it would provide a better interface to communicate that data.
For example, you could configure an error queue to track all the messages, whose processing failed.
It would also provide you scalable architecture, where some other app (in future) could starts consuming your message and process.
Reliable: Due to asynchronous messaging, all the pieces don’t need to be up for the application to function as a whole.
Flexible : Think of scenario, in which you might want to process certain type of data before all other (prioritization). JMS would provide more better approach than tweaking logic in a program.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.