I doing data ingestion from remote API service for different time ranges in my custom NiFi Processor.
I have time ranges global counter which updates with each iteration (I'm using Timer driven scheduling strategy).
When the counter is greater than the maximum value, I want to transfer just FlowFile from request (session.get()) with SUCCESS relationship, i.e. without performing additional logic:
session.transfer(requestFlowFile, SUCCESS);
I undestand that I can't stop or pause processor when time ranges collection is over. So I trying to use the above approach as a solution.
All iterations going fine until the counter has become greater than the maximum and processor trying to transfer FlowFile from request (session.get())
So I having this Exception:
failed to process session due to org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=459e615b-0ff5-424f-aac7-f95d364cdc13,claim=,offset=0,name=99628180019265,size=0] is not known in this session
What's wrong here? Or may be another approach?
That error means the flow file being passed to session.transfer() came from a different session. You can only call transfer() on the same session from which you called get().
if it's a custom processor - just do not do session.get() and skip this execution without transferring anything.
or if you need incoming file to take a decision you can get it, do some checks, and rollback the current session with penalize rollback(true), so the file(s) you get will stay in the incoming queue during Penalty Duration processor parameter without thiggering rocessor to run.
or you can do session.get(FlowFileFilter) to get from incoming queue only files that matches your logic
Related
I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?
Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.
Older versions' doc says it's one of the essential properties.
Newer versions' doc doesn't mention it at all.
Do newer versions of Kafka producers still have producer.type?
Or, new producers are always async, and I should call future.get() to make it sync?
New producers are always async, and you should call future.get() to make it sync. It's not worth making two apis methods when something as simple as adding future.get() gives you basically the same functionality.
From the documentation for send() here
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
Since the send call is asynchronous it returns a Future for the
RecordMetadata that will be assigned to this record. Invoking get() on
this future will block until the associated request completes and then
return the metadata for the record or throw any exception that
occurred while sending the record.
If you want to simulate a simple blocking call you can call the get()
method immediately:
byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value);
producer.send(record).get();
Why do you want to make the send() to sync ?
This is a kafka feature to batch message for better throughput.
Asynchronous send
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.
There is no way to do a send sync because of the api only support the async method, But there is a some configs you can specify to do some work arround.
You could set the batch.size to 0. In this case, the message bacthing is disabled.
However I think you should just leave the batch.size default and set the linger.ms to 0 (this is also default). In this case, if many message come in the same time, they will be batched in one send immediately .
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out.
And if you want to make sure the message is sent and persisted successfully, you coould set the acks to -1 or 1 and retries to 3 (e.g.)
More info about the producer config, you can refer https://kafka.apache.org/documentation/#producerconfigs
I have a webservice ABC
ABC Operations:
A. Call XYZ web service
B. Store response in db
C. return result
Overall ABC Responce time = 18 sec
XYZ Response Time = 8 sec.
Only ABC Response time = 18-8 = 10 sec
I want to minimize response time of ABC service.
How can this be done?
Few things I though:
1.Send part request and get part response = But its not possible in my case.
2. return response and perform db in asynchronous manner. (Can this be done in reliable manner?)
3. Is there any way to improve the db write operation?
If it is possible to “”perform db in asynchronous manner’’ i.e. if you can respond to the caller before the DB write completes then you can use the ‘write behind’ pattern to perform the DB writes asynchronously.
The write behind pattern looks like this: queue each data change, let this queue be subject to a configurable duration (aka the “write behind delay”) and a maximum size. When data changes, it is added to the write-behind queue (if it is not already in the queue) and it is written to the underlying store whenever one of the following conditions is met:
The write behind delay expires
The queue exceeds a configurable size
The system enters shutdown mode and you want to ensure that no data is lost
There is plenty of prior art in this space. For example, Spring’s Cache Abstraction allows you to add a caching layer and it supports JSR-107 compliant caches such as Ehcache 3.x which provides a write behind cache writer. Spring’s caching service is an abstraction not an implementation, the idea being that it will look after the caching logic for you while you continue to provide the store and the code to interact with the store.
You should also look at whatever else is happening inside ABC, other than the call to XYZ, if the DB call accounts for all of those extra 10s then ‘write behind’ will save you ~10s but if there are other activities happening in those 10s then you’ll need to address those separately. The key point here is to profile the calls inside ABC so that you can identify exactly where time is spent and then prioritise each phase according to factors such as (a) how long that phase takes; (b) how easily that time can be reduced.
If you move to a ‘write behind’ approach then the elapsed time of the DB is no longer an issue for your caller but it might still be an issue within ABC since long write times could cause the queue of ‘write behind’ instructions to build up. In that case, you would profile the DB call to understand why it is taking so long. Common candidates include: attempting to write large data items (e.g. a large denormalised data item), attempting to write into a table/store which is heavily indexed.
As far as I know you can follow the options based on your requirement:
Think of caching the results from XYZ response and store to database so that you can minimise the call.
There could be possibility of failures in option 2 but still you can fix it by writing the failure cases to error log and process it later.
DB write operation can be improved with proper indexing, normalisation etc..
I am using storm to process online problems, but I cant't understand why storm replays tuple from spout . Retrying on what crashed may be more effective than replaying from root, right?
Anyone can help me? Thx
A typical spout implementation will replay only the FAILED tuples. As explained here a tuple emitted from the spout can trigger thousands of others tuple and storm creates a tree of tuple based on that. Now a tuple is called "fully processed" when every message in the tree has been processed. While emitting spout add a message id which is used to identify the tuple in later phase. This is called anchoring and can be done in the following way
_collector.emit(new Values("field1", "field2", 3) , msgId);
Now from the link posted above it says
A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.
If the tuple times-out Storm will call the FAIL method on spout and likewise in case of success the ACK method will be called.
So at this point storm will let you know which are the tuple that it has been failed to process but if you look into the source code you will see that the implementation of the fail method is empty in the BaseRichSpout class, so you need to override BaseRichSpout's fail method in order to have replay capability in your application.
Such replays of failed tuples should represent only a tiny proportion of the overall tuple traffic, so the efficiency of this simple replay-from start policy is usually not a concern.
Supporting a "replay-from-error-step" would bring lot's of complexity since the location of errors are sometimes hard to determine and there would be a need to support "replay-elsewhere" in case the cluster node where the error happened is currently (or permanently) down. It would also slow down the execution of the whole traffic which would probably not be compensated by the efficiency gained on error handling (which, again, is assumed to be triggered rarely).
If you think this replay-from-start strategy would impact negatively your topology, try to break it down into several smaller ones separated by some persistent queuing system like Kafka.
Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.