How to make sure that write csv is complete? - java

I'm writing a dataset to CSV as follows:
df.coalesce(1)
.write()
.format("csv")
.option("header", "true")
.mode(SaveMode.Overwrite)
.save(sink);
sparkSession.streams().awaitAnyTermination();
How do I make sure, that when the streaming job gets terminated, the output is done properly?
I have the problem that the sink folder gets overwritten and is empty if I terminate too early/late.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.

How do I make sure, that when the streaming job gets terminated, that the output is done properly?
The way Spark Structured Streaming works is that the streaming query (job) runs continuously and "when the streaming job gets terminated, that the output is done properly".
The question I'd ask is how a streaming query got terminated. Is this by StreamingQuery.stop or perhaps Ctrl-C / kill -9?
If a streaming query's terminated in a forceful way (Ctrl-C / kill -9), well, you get what you asked for - a partial execution with no way to be sure that an output is correct since the process (the streaming query) got shut down forcefully.
With StreamingQuery.stop the streaming query will just terminate gracefully and write out all it would at the time.
I have the problem, that the sink folder gets overwritten and that the folder is empty if I terminate too early/late.
If you terminate too early/late, what else would you expect since the streaming query could not finish its work. You should stop it gracefully and you get the expected output.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.
That's an interesting observation which requires further exploration.
If there are no messages to be processed, no batch would be triggered so no jobs so no "overwrites the result with an empty file." (as no task would get executed).

Firstly, I see that you have not used writeStream I am not quite sure how is your job a streaming job.
Now, answering your question 1, you can use StreamingQueryListener to monitor the streaming query's progress. Have another StreamingQuery to read from the output location. Monitor it as well. Once you have the files in the output location, use the query name and input record count in the StreamingQueryListener to gracefully stop any query. awaitAnyTermination should stop your spark application. Following code can be of help.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
Answering your 2nd question, I too, do not think that this is possible as mentioned in the answer by Jacek.

Related

Java/Quarkus Kafka Streams Reading/Writing to Same Topic based on a condition

Hello I have this issue that I'm trying to solve. Basically I have a Kafka Streams topology that will read JSON messages from a Kafka topic and that message gets deserialized into a POJO. Then ideally it will read check that message for a certain boolean flag. If that flag is true it will do some transformation and then write it back to the topic. However if the flag is false, I'm trying to have it not write anything but I'm not sure how I can go about it. With the MP Reactive Messaging I can just use an RxJava 2 Flowable Stream and return something like Flowable.empty() but I can't use that method here it seems.
JsonbSerde<FinancialMessage> financialMessageSerde = new JsonbSerde<>(FinancialMessage.class);
StreamsBuilder builder = new StreamsBuilder();
builder.stream(
TOPIC_NAME,
Consumed.with(Serdes.Integer(), financialMessageSerde)
)
.mapValues (
message -> checkCondition(message)
)
.to (
TOPIC_NAME,
Produced.with(Serdes.Integer(), financialMessageSerde)
);
The below is the function call logic.
public FinancialMessage checkCondition(FinancialMessage rawMessage) {
FinancialMessage receivedMessage = rawMessage;
if (receivedMessage.compliance_services) {
receivedMessage.compliance_services = false;
return receivedMessage;
}
else return null;
}
If the boolean is false it just returns a JSON body with "null".
I've tried changing the return type of the checkCondition function wrapped like
public Flowable<FinancialMessage> checkCondition (FinancialMessage rawMessage)
And then having the return from the if be like Flowable.just(receivedMessage) or Flowable.empty() but I can't seem to serialize the Flowable object. This might be a silly question but is there a better way to go about this?
Note that Kafka messages are immutable and not deleted after read, and if you read/write from the same topic with a single application, a message would be processed infinitely often (or to be more precise different copies of it) if you don't have a condition to "break" the cycle.
Also, if for example 5 services read from the same topic, all 5 services get a copy of every event. And if one service write back, all other 4 services and the writing service itself will read the message again. Thus, you get quite some data amplification.
If you have different services to react on the original input message consecutively, you could have one topic between each pair of consecutive services to really build a pipeline though.
Last, you say if the boolean flag is true you want to transform the message and emit (I assume for the next service to consumer). And for false you want to do nothing. I a further assume that for a message only a single flag will be true and a successful transformation also switches the flag (to enable processing by the next service). For this case, it's best if you can ensure that each original input message has the same initial boolean flag set to build your pipeline. Thus, only the corresponding service will read messages with its boolean flag set (you don't even need to check the boolean flag as your upstream write ensures that it's set; you could only have a sanity check).
If you don't know which boolean flag is set initially and all services read from the same input topic, just filtering out the message is correct. If all services read all messages, 4 services will filter the message while one service will process it and emit a new message with a different flag. For this architecture, a single topic might work: if a message is processed by all services and all boolean flags are false (after all services processed the message), and you write it back to the input topic, all services would drop the last copy correctly. However, using a single topic implies a lot of redundant reading/writing.
Maybe the best architecture is, to have your original input topic, and one additional input topic for each service. You also use an additional "dispatcher" service that read from the original input topics, and branches() the KStream into the service input topics according to the boolean flag. This way, each service will read only messages with the right flag set to true. Furthermore, each service will write to the input topic of the other services also using branch() after the message transformation to write it to the input topic of the correct next service. Last, you would want an output topic that each service can write into after a message is fully processed.

Project Reactor Kafka: Perform action at the end of Flux without blocking

I am working on an application that is using project-reactor Kafka APIs to connect reactively to Kafka-brokers. The use-case is that there is an input topic which contains file-paths to the files for processing. The application reads each file, processes it, creates a flux of the processed messages and pushes it to the output topic. The requirement is that the file must be deleted once it has been processed and processed messages should be pushed to the output topic. So, the delete action must be performed after each file has been processed and the flux of the message pushed to the output topic.
public Flux<?> flux() {
return KafkaReceiver
.create(receiverOptions(Collections.singleton(sourceTopic)))
.receive()
.flatMap(m -> transform(m.value()).map(x -> SenderRecord.create(x,
m.receiverOffset())))
.as(sender::send)
.doOnNext(m -> {
m.correlationMetadata().acknowledge();
deleteFile(path);
}).doOnCancel(() -> close());
}
*The transform() method initiates the file processing in the file path(m.value()) and returns a flux of messages.
The problem is that the file is deleted even before all the messages is pushed to output topic. Therefore, in case of a failure, on re-try the original file is not available.
Since it seems the path variable is accessible in the whole pipeline (method input parameter?), you could delete the file within a separate doFinally. You would need to filter for onComplete or cancel SignalType, because you don't want to delete the file in case of a failure.
Another option would be doOnComplete if you're not interested in deleting the file upon cancellation.

Spring Integration File Poller with Watch Service and LastModifiedFileListFilter

I was wondering if anyone would know if I could use the watch service in a FileInboundChannelAdapter along with a LastModifiedFileListFilter?
The sample code below is giving me fairly inconsistent results. Sometimes the file just sits in the folder and remains unprocessed.
I suspect that the watch service might be incompatible with the LastModifiedFileListFilter. For e.g.
If the LastModifiedFileListfilter is set to look for files at least 5
seconds old, and the poller is set to poll every 10 seconds.
At the 9th second, a file could be created in the watched folder.
At 10 seconds the poller queries the watch service to find out what
changed in the past 10 seconds.
It finds the newly created file.
The newly created has a last modified time of -1 second, so it
does not process it.
At 20 seconds, the poller queries the watch
service a second time, this time it does not see the unprocessed
file as it was created more than 10 seconds ago.
Would anyone else have any experience with this? Would there be a recommended way to get around this issue and allow me to verify that the file has been fully written before proceeding?
#Bean
public IntegrationFlow ftpInputFileWatcher()
{
return IntegrationFlows.from(ftpInboundFolder(), filePoller())
.handle()
/*abbreviated*/
.get();
}
private FileInboundChannelAdapterSpec ftpInboundFolder() {
LastModifiedFileListFilter lastModifiedFileListFilter = new LastModifiedFileListFilter();
lastModifiedFileListFilter.setAge(5);
return Files.inboundAdapter(inboundFolder)
.preventDuplicates(false)
.useWatchService(true)
.filter(fileAgeFilterToPreventPrematurePickup());
}
protected Consumer<SourcePollingChannelAdapterSpec> filePoller(){
return poller -> poller.poller((Function<PollerFactory, PollerSpec>) p -> p.fixedRate(2000));
}
Thanks!
Yeah, that's good catch!
Right they are not compatible. The WatchService is event-based and store files from the events into the internal queue. When the poller triggers its action, it polls files from that queue and applies its filters. Since LastModifiedFileListFilter discards the file and there is no any events for it any more, we won't see that file again.
Please, raise a JIRA on the matter and we'll think how to be .
Meanwhile as a workaround do not use WatchService for this kind of logic.

504 Gateway timeout while generating excel file

I'm trying to implement excel export for some amount of data. After 5 minutes I receive a 504 Gateway timeout. In the backend the process continues with its work.
For the whole service to finish, I need approximately 15 minutes. Is there anything I can do to prevent this? I dont have access to the servers in production.
The app is Spring boot with Oracle database. I'm using POI for this export.
One common way to handle these kinds of problems is to have the first request start the process in the background, and when the file has been generated, download the results from another place. The first request finishes immediately, and the user can then check another view to see if the file has been generated, and download the results.
You can export the data in smaller chunks. Run a test with say 10K records, make a note of the id of the last record and repeat the export starting at the next record. If 10K finishes quickly, then try 50K. If you have a timer that might come in handy. Good luck.
I had the same situation where the timeout of the network calls wasn't in our hand, so I guess you have something where it is 5 mins to receive the 1st byte and then the timeout is gone.
My solution was, let's assume you have a controller and a query layer to talk to the database. In this case, you make your process in the Async way. The call to this controller should just trigger that async execution and return the success status immediately, without waiting. Here execution will happen in the background. Futures can be used here as they are async and you can also handle the result once completed by using callback methods of Future.
You can implement using Future and callback methods in java8 like below:
Futures.addCallback(
exportData,
new FutureCallback<String>() {
public void onSuccess(String message) {
System.out.println(message);
}
public void onFailure(Throwable thrown) {
thrown.getCause();
}
},
service)
and in Scala like:
val result = Future {
exportData(data)
}
result.onComplete {
case Success(message) => println(s"Got the callback result:
$message")
case Failure(e) => e.printStackTrace
}

How JMS work in Java?

How does async JMS work? I've below sample code:
public class JmsAdapter implements MessageListener, ExceptionListener
{
private ConnectionFactory connFactory = null;
private Connection conn = null;
private Session session = null;
public void receiveMessages()
{
try
{
this.session = this.conn.createSession(true, Session.SESSION_TRANSACTED);
this.conn.setExceptionListener(this);
Destination destination = this.session.createQueue("SOME_QUEUE_NAME");
this.consumer = this.session.createConsumer(destination);
this.consumer.setMessageListener(this);
this.conn.start();
}
catch (JMSException e)
{
//Handle JMS Exceptions Here
}
}
#Override
public void onMessage(Message message)
{
try
{
//Do Message Processing Here
//Message sucessfully processed... Go ahead and commit the transaction.
this.session.commit();
}
catch(SomeApplicationException e)
{
//Message processing failed.
//Do whatever you need to do here for the exception.
//NOTE: You may need to check the redelivery count of this message first
//and just commit it after it fails a predefined number of times (Make sure you
//store it somewhere if you don't want to lose it). This way you're process isn't
//handling the same failed message over and over again.
this.session.rollback()
}
}
}
But I'm new to Java & JMS. I'll probably consume messages in onMessage method. But I don't know how does it work exactly.
Do I need to add main method in JmsAdapter class? After adding main method, do I need to create a jar & then run the jar as "java -jar abc.jar"?
Any help is much appreciated.
UPDATE: What I want to know is that if I add main method, should I simply call receiveMessages() in main? And then after running, will the listener keep on running? And if there are messages, will it retrieve automatically in onMessage method?
Also, if the listener is continuously listening, doesn't it take CPU??? In case of threads, when we create a thread & put it in sleep, the CPU utilization is zero, how doe it work in case of listener?
Note: I've only Tomcat server & I'll not be using any jms server. I'm not sure if listener needs any specific jms server such as JBoss? But in any case, please assume that I'll not be having anything except tomcat.
Thanks!
You need to learn to walk before you start trying to run.
Read / do a tutorial on Java programming. This should explain (among other things) how to compile and run a Java program from the command line.
Read / do a tutorial on JMS.
Read the Oracle material on how to create an executable JAR file.
Figure out what it is you are trying to do ... and design your application.
Looking at what you've shown and told us:
You could add a main method to that class, but to make an executable JAR file, you've got to create your JAR file with a manifest entry that specifies the name of the class with the main method.
There's a lot more that you have to do before that code will work:
add code to (at least) log the exceptions that you are catching
add code to process the messages
add code to initialize the connection factory and connection objects
And like I said above, you probably need some kind of design ... so that you don't end up with everything in a "kitchen sink" class.
if I add main method, should I simply call receiveMessages() in main?
That is one approach. But like I said, you really need to design your application.
And then after running, will the listener keep on running?
It is not entirely clear. It should keep running as long as the main thread is alive, but it is not immediately obvious what happens when your main method returns. (It depends on whether the JMS threads are created as daemon threads, and that's not specified.)
And if there are messages, will it retrieve automatically in onMessage method?
It would appear that each message is retrieved (read from the socket) before your onMessage method is called.
Also, if the listener is continuously listening, doesn't it take CPU???
Not if it is implemented properly.
In case of threads, when we create a thread & put it in sleep, the CPU utilization is zero, how doe it work in case of listener?
At a certain level, a listener thread will make a system call that waits for data to arrive on a network socket. I don't know how it is exactly implemented, but this could be as simple as an read() call on the network socket's InoutStream. No CPU is used by a thread while it waits in a blocking system call.
This link looks like a pretty good place with examples using Oracle AQ. There's an examples section that tells you how to setup the examples and run them. Hopefully this can help.
Link to Oracle Advanced Queueing

Categories

Resources