NiFi - Update Luwak (Lucene) Indexes in Processor

NiFi - Update Luwak (Lucene) Indexes in Processor - java

I am trying to create a custom processor using the Luwak Lucene indexer, so I can run queries on incoming flow files. What I am trying to figure out is the best way to update the query indexes that exist inside of the Luwak monitor (example code below).
EDIT - More Usage Context
By update, I mean allowing an outside user to add / update / remove queries that are being run against the incoming flowfiles. We would be starting with a fixed set of queries, but then would want to allow a user or users the ability to change the queries being executed against the incoming messages. Here in lies the challenge, changing the queries that are being executed.
Any other options I should consider? It appears to take about ~20s to update the queries, if there are 10k of them. This would most likely be rare, but re-load / startup time is something I am trying to consider.
Options I have considered:
Use an UpdateAttribute and update on every flowfile. Not ideal, especially if there are a bunch of queries to index.
Use http, AWS SQS, etc. to send a high-priority flow file to update (make higher than any other source). Not terrible, but still doesn't seem right.
Use the NiFi API to start / stop the processor on update. Doesn't seem like a very efficient way to perform the updates, especially if they happen quite frequently.
Instantiate Monitor:
Monitor monitor = new Monitor(new LuceneQueryParser("field"), new TermFilteredPresearcher());
Add Queries - What I am trying to optimize:
//Add queries to the monitor
for (Map.Entry<String, String> entry : bucketList.entrySet()) {
MonitorQuery q = new MonitorQuery(entry.getKey(),entry.getValue());
monitor.update(q);
}

When your processor starts you could start a background timer thread that periodically builds a new monitor and then replaces the one being used by the processor.
You would probably want to make a member variable in your processor like:
AtomicReference<Monitor> monitorHolder = new AtomicReference<Monitor>();
Then in #OnScheduled you can build the initial monitor and set it in the holder.
Then in onTrigger you always first get the Monitor:
Monitor localMonitor = monitorHolder.get();
Then in the background thread you can call monitorHolder.set(newMonitor) which won't affect the current execution of the processor, but will take effect the next time onTrigger is called.

Related

Way to improve Rest Webservice performance which call other API

I have a webservice ABC
ABC Operations:
A. Call XYZ web service
B. Store response in db
C. return result
Overall ABC Responce time = 18 sec
XYZ Response Time = 8 sec.
Only ABC Response time = 18-8 = 10 sec
I want to minimize response time of ABC service.
How can this be done?
Few things I though:
1.Send part request and get part response = But its not possible in my case.
2. return response and perform db in asynchronous manner. (Can this be done in reliable manner?)
3. Is there any way to improve the db write operation?

If it is possible to “”perform db in asynchronous manner’’ i.e. if you can respond to the caller before the DB write completes then you can use the ‘write behind’ pattern to perform the DB writes asynchronously.
The write behind pattern looks like this: queue each data change, let this queue be subject to a configurable duration (aka the “write behind delay”) and a maximum size. When data changes, it is added to the write-behind queue (if it is not already in the queue) and it is written to the underlying store whenever one of the following conditions is met:
The write behind delay expires
The queue exceeds a configurable size
The system enters shutdown mode and you want to ensure that no data is lost
There is plenty of prior art in this space. For example, Spring’s Cache Abstraction allows you to add a caching layer and it supports JSR-107 compliant caches such as Ehcache 3.x which provides a write behind cache writer. Spring’s caching service is an abstraction not an implementation, the idea being that it will look after the caching logic for you while you continue to provide the store and the code to interact with the store.
You should also look at whatever else is happening inside ABC, other than the call to XYZ, if the DB call accounts for all of those extra 10s then ‘write behind’ will save you ~10s but if there are other activities happening in those 10s then you’ll need to address those separately. The key point here is to profile the calls inside ABC so that you can identify exactly where time is spent and then prioritise each phase according to factors such as (a) how long that phase takes; (b) how easily that time can be reduced.
If you move to a ‘write behind’ approach then the elapsed time of the DB is no longer an issue for your caller but it might still be an issue within ABC since long write times could cause the queue of ‘write behind’ instructions to build up. In that case, you would profile the DB call to understand why it is taking so long. Common candidates include: attempting to write large data items (e.g. a large denormalised data item), attempting to write into a table/store which is heavily indexed.

As far as I know you can follow the options based on your requirement:
Think of caching the results from XYZ response and store to database so that you can minimise the call.
There could be possibility of failures in option 2 but still you can fix it by writing the failure cases to error log and process it later.
DB write operation can be improved with proper indexing, normalisation etc..

How can I ensure that my Android app doesn't access a file simultaneously?

I am building a fitness app which continually logs activity on the device. I need to log quite often, but I also don't want to unnecessarily drain the battery of my users which is why I am thinking about batching network calls together and transmitting them all at once as soon as the radio is active, the device is connected to a WiFi or it is charging.
I am using a filesystem based approach to implement that. I persist the data first to a File - eventually I might use Tape from Square to do that - but here is where I encounter the first issues.
I am continually writing new log data to the File, but I also need to periodically send all the logged data to my backend. When that happens I delete the contents of the File. The problem now is how can I prevent both of those operations from happening at the same time? Of course it will cause problems if I try to write log data to the File at the same time as some other process is reading from the File and trying to delete its contents.
I am thinking about using an IntentService essentially act as a queue for all those operations. And since - at least I have read as much - an IntentServices handles Intents sequentially in single worker Thread it shouldn't be possible for two of those operations to happen at the same time, right?
Currently I want to schedule a PeriodicTask with the GcmNetworkManager which would take care of sending the data to the server. Is there any better way to do all this?

1) You are overthinking this whole thing!
Your approach is way more complicated than it has to be! And for some reason none of the other answers point this out, but GcmNetworkManager already does everything you are trying to implement! You don't need to implement anything yourself.
2) Optimal way to implement what you are trying to do.
You don't seem to be aware that GcmNetworkManager already batches calls in the most battery efficient way with automatic retries etc and it also persists the tasks across device boots and can ensure their execution as soon as is battery efficient and required by your app.
Just whenever you have data to save schedule a OneOffTask like this:
final OneoffTask task = new OneoffTask.Builder()
// The Service which executes the task.
.setService(MyTaskService.class)
// A tag which identifies the task
.setTag(TASK_TAG)
// Sets a time frame for the execution of this task in seconds.
// This specifically means that the task can either be
// executed right now, or must have executed at the lastest in one hour.
.setExecutionWindow(0L, 3600L)
// Task is persisted on the disk, even across boots
.setPersisted(true)
// Unmetered connection required for task
.setRequiredNetwork(Task.NETWORK_STATE_UNMETERED)
// Attach data to the task in the form of a Bundle
.setExtras(dataBundle)
// If you set this to true and this task already exists
// (just depends on the tag set above) then the old task
// will be overwritten with this one.
.setUpdateCurrent(true)
// Sets if this task should only be executed when the device is charging
.setRequiresCharging(false)
.build();
mGcmNetworkManager.schedule(task);
This will do everything you want:
The Task will be persisted on the disk
The Task will be executed in a batched and battery efficient way, preferably over Wifi
You will have configurable automatic retries with a battery efficient backoff pattern
The Task will be executed within a time window you can specify.
I suggest for starters you read this to learn more about the GcmNetworkManager.
So to summarize:
All you really need to do is implement your network calls in a Service extending GcmTaskService and later whenever you need to perform such a network call you schedule a OneOffTask and everything else will be taken care of for you!
Of course you don't need to call each and every setter of the OneOffTask.Builder like I do above - I just did that to show you all the options you have. In most cases scheduling a task would just look like this:
mGcmNetworkManager.schedule(new OneoffTask.Builder()
.setService(MyTaskService.class)
.setTag(TASK_TAG)
.setExecutionWindow(0L, 300L)
.setPersisted(true)
.setExtras(bundle)
.build());
And if you put that in a helper method or even better create factory methods for all the different tasks you need to do than everything you were trying to do should just boil down to a few lines of code!
And by the way: Yes, an IntentService handles every Intent one after another sequentially in a single worker Thread. You can look at the relevant implementation here. It's actually very simple and quite straight forward.

All UI and Service methods are by default invoked on the same main thread. Unless you explicitly create threads or use AsyncTask there is no concurrency in an Android application per se.
This means that all intents, alarms, broad-casts are by default handled on the main thread.
Also note that doing I/O and/or network requests may be forbidden on the main thread (depending on Android version, see e.g. How to fix android.os.NetworkOnMainThreadException?).
Using AsyncTask or creating your own threads will bring you to concurrency problems but they are the same as with any multi-threaded programming, there is nothing special to Android there.
One more point to consider when doing concurrency is that background threads need to hold a WakeLock or the CPU may go to sleep.

Just some idea.
You may try to make use of serial executor for your file, therefore, only one thread can be execute at a time.
http://developer.android.com/reference/android/os/AsyncTask.html#SERIAL_EXECUTOR

JVM: is it possible to manipulate frame stack?

Suppose I need to execute N tasks in the same thread. The tasks may sometimes need some values from an external storage. I have no idea in advance which task may need such a value and when. It is much faster to fetch M values in one go rather than the same M values in M queries to the external storage.
Note that I cannot expect cooperation from tasks themselves, they can be concidered as nothing more than java.lang.Runnable objects.
Now, the ideal procedure, as I see it, would look like
Execute all tasks in a loop. If a task requests an external value, remember this, suspend the task and switch to the next one.
Fetch the values requested at the previous step, all at once.
Remove all completed task (suspended ones don't count as completed).
If there are still tasks left, go to step 1, but instead of executing a task, continue its execution from the suspended state.
As far as I see, the only way to "suspend" and "resume" something would be to remove its related frames from JVM stack, store them somewhere, and later push them back onto the stack and let JVM continue.
Is there any standard (not involving hacking at lower level than JVM bytecode) way to do this?
Or can you maybe suggest another possible way to achieve this (other than starting N threads or making tasks cooperate in some way)?

It's possible using something like quasar that does stack-slicing via an agent. Some degree of cooperation from the tasks is helpful, but it is possible to use AOP to insert suspension points from outside.
(IMO it's better to be explicit about what's going on (using e.g. Future and ForkJoinPool). If some plain code runs on one thread for a while and is then "magically" suspended and jumps to another thread, this can be very confusing to debug or reason about. With modern languages and libraries the overhead of being explicit about the asynchronicity boundaries should not be overwhelming. If your tasks are written in terms of generic types then it's fairly easy to pass-through something like scalaz Future. But that wouldn't meet your requirements as given).

As mentioned, Quasar does exactly that (it usually schedules N fibers on M threads, but you can set M to 1), using bytecode transformations. It even gives each task (AKA "fiber") its own stack trace, so you can dump it and get a complete stack trace without any interference from any other task sharing the thread.

Well you could try this
you need
A mechanism to save the current state of the task because when the task returns its frame would be popped from the call stack. Based on the return value or something like that you can determine weather it completed or not since you would need to re-execute it from the point where it left thus u need to preserve the state information.
Create a Request Data structure for each task. When ever a task wants to request something it logs it there , The data structure should support all the possible request a task can make.
Store these DS in a Map. At the end of the loop you can query this DS to determine the kind of resource required by each task.
get the resource put it in the DS . Start the task from the state when it returned.
The task queries the DS gets the resource.
The task should use this DS when ever it wants to use an external resource.
you would need to design the method in which resource is requested with special consideration since when you will re-execute the task again you would need to call this method yourself so that the task can execute from where it left.
*DS -> Data Structure
hope it helps.

constantly check database [duplicate]

I'm using JDBC, need to constantly check the database against changing values.
What I have currently is an infinite loop running, inner loop iterating over a changing values, and each iteration checking against the database.
public void runInBG() { //this method called from another thread
while(true) {
while(els.hasElements()) {
Test el = (Test)els.next();
String sql = "SELECT * FROM Test WHERE id = '" + el.getId() + "'";
Record r = db.getTestRecord(sql);//this function makes connection, executeQuery etc...and return Record object with values
if(r != null) {
//do something
}
}
}
}
I'm think this isn't the best way.
The other way I'm thinking is the reverse, to keep iterating over the database.
UPDATE
Thank you for the feedback regarding timers, but I don't think it will solve my problem.
Once a change occurs in the database I need to process the results almost instantaneously against the changing values ("els" from the example code).
Even if the database does not change it still has to check constantly against the changing values.
UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory.
That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time.
Sure this is more memory intensive but performance is absolute key here.
As to the periodic "timer" answers, I'm sorry but this is not right at all. Nobody has responded with a reason how the use of timers would solve this particular situation.
But thank you again for the feedback, it was still helpful nevertheless.

Another possibility would be using ScheduledThreadPoolExecutor.
You could implement a Runnable containing your logic and register it to the ScheduledExecutorService as follows:
ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(10);
executor.scheduleAtFixedRate(myRunnable, 0, 5, TimeUnit.SECONDS);
The code above, creates a ScheduledThreadPoolExecutor with 10 Threads in its pool, and would have a Runnable registered to it that will run in a 5 seconds period starting immediately.
To schedule your runnable you could use:
scheduleAtFixedRate
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given period; that is executions will commence after initialDelay then initialDelay+period, then initialDelay + 2 * period, and so on.
scheduleWithFixedDelay
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next.
And here you can see the advantages of ThreadPoolExecutor, in order to see if it fits to your requirements. I advise this question: Java Timer vs ExecutorService? too in order to make a good decision.

Keeping the while(true) in the runInBG() is a bad idea. You better remove that. Instead you can have a Scheduler/Timer(use Timer & TimerTask) which would call the runInBG() periodically and check for the updates in the DB.

u could use a timer--->
Timer timer = new Timer("runInBG");
//Taking an instance of class contains your repeated method.
MyClass t = new MyClass();
timer.schedule(t, 0, 2000);

As you said in the comment above, if application controls the updates and inserts then you can create a framework which notifies for 'BG' thread or process about change in database. Notification can be over network via JMS or intra VM using observer pattern or both local and remote notifications.
You can have generic notification message like (it can be class for local notification or text message for remote notifications)
<Notification>
<Type>update/insert</Type>
<Entity>
<Name>Account/Customer</Name>
<Id>id</Id>
<Entity>
</Notification>

To avoid a 'busy loop', I would try to use triggers. H2 also supports a DatabaseEventListener API, that way you wouldn't have to create a trigger for each table.
This may not always work, for example if you use a remote connection.

UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory. That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time. Sure this is more memory intensive but performance is absolute key here.

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?

You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.

I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.

Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.

You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.

Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.

The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.

Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.