Apache Camel ZipInputStream closed with parallel processing - java

I am successfully using ZipSplitter() to process files inside a zip file. I would like to use parallel processing if possible, but calling parallelProcessing() results in the stream being closed prematurely. This results results in an IOException when the stream is being cached by DefaultStreamCachingStrategy.
I note that when parallel processing is enabled, ZipIterator#checkNullAnswer(Message) is called which closes the ZipInputStream. Curiously, everything is dandy if I loiter on this method in my debugger, which suggests that the iterator is being closed before processing has completed. Is this a bug or have I messed up something?
A simplified version of my route which exhibits this behaviour is:
from("file:myDirectory").
split(new ZipSplitter()).streaming().parallelProcessing().
log("Validating filename ${file:name}").
end();
This is using Camel 2.13.1.

Can you just try to apply the CAMEL-7415 into the camel 2.13.1 branch?
I'm not quit sure if it can fix your issue, but it is worth to give it a shot.

Related

java bigquery storage write api

Using the java bigquery storage api as documented here https://cloud.google.com/bigquery/docs/write-api.
Keeping the write stream long lived and refreshing it when one of the non-retry-able errors happened as per this https://cloud.google.com/bigquery/docs/write-api#error_handling
I am sticking with default stream. I have two tables and different parts of code responsible for writing to each table, maintaining its own stream writer.
If data is flowing, everything is fine. No errors. However I want to test refreshing the stream writers work too so I wait for default stream timeout (10mins) which closes the stream and try writing again. I can create the stream fine, no error there, but for one of the table I keep getting cancelled error wrapped in a Pre condition failed making my code refresh again and again.
Original error because stream closed due to inactivity
! io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed due to com.google.api.gax.rpc.AbortedException: io.grpc.StatusRuntimeException: ABORTED: Closing the stream because it has been inactive for 600 seconds. Entity: projects/<id>/datasets/<id>/tables/<id>/_default
! at com.google.cloud.bigquery.storage.v1beta2.StreamWriterV2.appendInternal(StreamWriterV2.java:263)
! at com.google.cloud.bigquery.storage.v1beta2.StreamWriterV2.append(StreamWriterV2.java:234)
! at com.google.cloud.bigquery.storage.v1beta2.JsonStreamWriter.append(JsonStreamWriter.java:114)
! at com.google.cloud.bigquery.storage.v1beta2.JsonStreamWriter.append(JsonStreamWriter.java:89)
Further repeating errors on new stream(s)
! io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed due to com.google.api.gax.rpc.CancelledException: io.grpc.StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error
! at com.google.cloud.bigquery.storage.v1beta2.StreamWriterV2.appendInternal(StreamWriterV2.java:263)
! at com.google.cloud.bigquery.storage.v1beta2.StreamWriterV2.append(StreamWriterV2.java:234)
! at com.google.cloud.bigquery.storage.v1beta2.JsonStreamWriter.append(JsonStreamWriter.java:114)
! at com.google.cloud.bigquery.storage.v1beta2.JsonStreamWriter.append(JsonStreamWriter.java:89)
I am not sure why its being cancelled without error. Any pointers on how I can debug this or recommendation on how to maintain and refresh a long-lived streaming writer?
Updating the Java client library version should solve this problem since reconnect support was added for the JsonStreamwriter. Instead of throwing this error, it should handle retries.

How to make sure that write csv is complete?

I'm writing a dataset to CSV as follows:
df.coalesce(1)
.write()
.format("csv")
.option("header", "true")
.mode(SaveMode.Overwrite)
.save(sink);
sparkSession.streams().awaitAnyTermination();
How do I make sure, that when the streaming job gets terminated, the output is done properly?
I have the problem that the sink folder gets overwritten and is empty if I terminate too early/late.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.
How do I make sure, that when the streaming job gets terminated, that the output is done properly?
The way Spark Structured Streaming works is that the streaming query (job) runs continuously and "when the streaming job gets terminated, that the output is done properly".
The question I'd ask is how a streaming query got terminated. Is this by StreamingQuery.stop or perhaps Ctrl-C / kill -9?
If a streaming query's terminated in a forceful way (Ctrl-C / kill -9), well, you get what you asked for - a partial execution with no way to be sure that an output is correct since the process (the streaming query) got shut down forcefully.
With StreamingQuery.stop the streaming query will just terminate gracefully and write out all it would at the time.
I have the problem, that the sink folder gets overwritten and that the folder is empty if I terminate too early/late.
If you terminate too early/late, what else would you expect since the streaming query could not finish its work. You should stop it gracefully and you get the expected output.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.
That's an interesting observation which requires further exploration.
If there are no messages to be processed, no batch would be triggered so no jobs so no "overwrites the result with an empty file." (as no task would get executed).
Firstly, I see that you have not used writeStream I am not quite sure how is your job a streaming job.
Now, answering your question 1, you can use StreamingQueryListener to monitor the streaming query's progress. Have another StreamingQuery to read from the output location. Monitor it as well. Once you have the files in the output location, use the query name and input record count in the StreamingQueryListener to gracefully stop any query. awaitAnyTermination should stop your spark application. Following code can be of help.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
Answering your 2nd question, I too, do not think that this is possible as mentioned in the answer by Jacek.

Apache Camel: Cached stream file deletion causing file not found errors

Scenario:
I am trying to stream and process some large xml files. These files are send from a producer asynchronously.
producerTemplate.sendBodyAndHeaders(endpointUri, inStream, ImmutableMap.of(JOBID_PROPERTY, importJob.getId()));
I need to batch all file input streams, identify the files by probing them with xpath and reorder them according to their content. I have the following route:
from("direct:route1")
.streamCaching()
.choice()
.when(xpath("//Tag1")) .setHeader("execOrder", constant(3)) .setHeader("xmlRoute", constant( "direct:some-route"))
.when(xpath("//Tag2")) .setHeader("execOrder", constant(1)) .setHeader("xmlRoute", constant( "direct:some-other-route"))
.when(xpath("//Tag3")) .setHeader("execOrder", constant(2)) .setHeader("xmlRoute", constant( "direct:yet-another-route"))
.otherwise()
.to("direct:somewhereelse")
.end()
.resequence(header("execOrder"))
.batch(new BatchResequencerConfig(300, 10000L))
.allowDuplicates()
.recipientList(header("xmlRoute"))
When running my code I get the following error:
2017-11-23 11:43:13.442 INFO 10267 --- [ - Batch Sender] c.w.n.s.m.DefaultImportJobService : Updating entity ImportJob with id 5a16a61803af33281b22c716
2017-11-23 11:43:13.451 WARN 10267 --- [ - Batch Sender] org.apache.camel.processor.Resequencer : Error processing aggregated exchange: Exchange[ID-int-0-142-bcd-wsint-pro-59594-1511433568520-0-20]. Caused by: [org.apache.camel.RuntimeCamelException - Cannot reset stream from file /var/folders/dc/fkrgdrnx6txbg7jfdjd_58mm0000gn/T/camel/camel-tmp-39abaae8-9bdd-435a-b63d-299ad8b06415/cos1499080503439465502.tmp]
org.apache.camel.RuntimeCamelException: Cannot reset stream from file /var/folders/dc/fkrgdrnx6txbg7jfdjd_58mm0000gn/T/camel/camel-tmp-39abaae8-9bdd-435a-b63d-299ad8b06415/cos1499080503439465502.tmp
at org.apache.camel.converter.stream.FileInputStreamCache.reset(FileInputStreamCache.java:91)
I've read here that the FileInputStreamCache is closed when the XPathBuilder.getDocument() is called, and the temp file is deleted, so you get the FileNotFoundException when the XPathBuilder wants to reset the InputStream
The solution seems to be to disable the spooling to disk like that:
camelContext.getStreamCachingStrategy().setSpoolThreshold(-1);
However, I don't want to do that because of RAM restrictions, i.e. files can get up to 600MB and I don't want to keep them in memory. Any ideas how to solve the problem?
The resequencer is a two-leg pattern (stateful) and will cause the original exchange to be done beforehand, as its keeping a copy in memory while re-sequencing until the gap is fulfilled and sending the messages out in the new order.
Since your input stream comes from some HTTP service then that would be closed beforehand the resequencer may output the exchange.
Either do as suggested to store to local disk first, and then let the resequencer work on that, or find a way not to use the resequencer.
I ended up doing what Claus and Ricardo suggested. I made a separate route which saves the files to disk. Then another one which probes the files and resequences the exchanges according to a fixed order.
String xmlUploadDirectory = "file://" + Files.createTempDir().path + "/xmls?noop=true"
from("direct:route1")
.to(xmlUploadDirectory)
from(xmlUploadDirectory)
.choice()
.when(xpath("//Tag1")).setHeader("execOrder", constant(3)).setHeader("xmlRoute", constant( "direct:some-route"))
.when(xpath("//Tag2")).setHeader("execOrder", constant(1)).setHeader("xmlRoute", constant( "direct:some-other-route"))
.when(xpath("//Tag3")).setHeader("execOrder", constant(2)).setHeader("xmlRoute", constant( "direct:yet-another-route"))
.otherwise()
.to("direct:somewhereelse")
.end()
.to("direct:resequencing")
from("direct:resequencing")
.resequence(header("execOrder"))
.batch(new BatchResequencerConfig(300, 10000L))
.allowDuplicates()
.recipientList(header("xmlRoute"))

Propagate exception to enriching routes

I have 3 routes:
route-file1 which reads file1.csv and converts to array
route-file2 which reads file2.csv and converts to array
route-final which uses enrich of both routes (with custom aggregator to merge arrays) and do something
Problem is if route-file1 succeeds, but route-file2 or any other routes fail. route-file1 already completed and moved file1.csv to .done folder, so I cannot rerun everything again.
Is there a way that when route-final fails, it propagates exception to other routes used in enrich? I tried using transaction which works fine with stopping the route execution, but does not propagate exception to the route-fileX routes. Is this possible with the camel?
You can set shareUnitOfWork to true in your content-enricher's so they work together in the same unit of work. See more details in the documentation: http://camel.apache.org/content-enricher.html

Why did I get "FileUploadException: Stream ended unexpectedly" with Apache Commons FileUpload?

What is the reason for encountering this Exception:
org.apache.commons.fileupload.FileUploadException:
Processing of multipart/form-data request failed. Stream ended unexpectedly
The main reason is that the underlying socket was closed or reset. The most common reason is that the user closed the browser before the file was fully uploaded. Or the Internet was interrupted during the upload. In any case, the server side code should be able to handle this exception gracefully.
Its been about a year since I dealt with that library, but if I remember correctly, if someone tries to upload a file, then changes the browser URL (clicks a link, opens a bookmark, etc) then you could get that exception.
You could possibly get this exception if you're using FileUpload to receive an upload from flash.
At least as of version 8, Flash contains a known bug: The multipart stream it produces is broken, because the final boundary doesn't contain the suffix "--", which ought to indicate, that no more items are following. Consequently, FileUpload waits for the next item (which it doesn't get) and throws an exception.
There is a workaround suggests to use the streaming API and catch the exception.
catch (MalformedStreamException e) {
// Ignore this
}
For more details, please refer to https://commons.apache.org/proper/commons-fileupload/faq.html#missing-boundary-terminator

Categories

Resources