Efficiently update an element in a DelayQueue - java

I am facing a similar problem as the author in:
DelayQueue with higher speed remove()?
The problem:
I need to process continuously incoming data and check whether the data has been seen in a certain timeframe before. Therefore I calculate a unique ID for incoming data and add this data indexed by the ID to a map. At the same time I store the ID and the timeout timestamp in a PriorityQueue, giving me the ability to efficiently check for the latest ID to time out. Unfortunately if the data comes in again before the specified timeout, I need to update the timeout stored in the PriorityQueue. So far I just removed the old ID and re-added the ID along with the new timeout. This works well, except for the time consuming remove method if my PriorityQueue grows over 300k elements.
Possible Solution:
I just thought about using a DelayQueue instead, which would make it easier to wait for the first data to time out, unfortunately I have not found an efficient way to update a timeout element stored in such a DelayQueue, without facing the same problem as with the PriorityQueue: the remove method!
Any ideas on how to solve this problem in an efficient way even for a huge Queue?

This actually sounds a lot like a Guava Cache, which is a concurrent on-heap cache supporting "expire this long after the most recent lookup for this entry." It might be simplest just to reuse that, if you can use third-party libraries.
Failing that, the approach that implementation uses looks something like this: it has a hash table, so entries can be efficiently looked up by their key, but the entries are also in a concurrent, custom linked list -- you can't do this with the built-in libraries. The linked list is in the order of "least recently accessed first." When an entry is accessed, it gets moved to the end of the linked list. Every so often, you look at the beginning of the list -- where all the least recently accessed entries live -- and delete the ones that are older than your threshold.

Related

Loosing old aggregated records after adding repartitioning in Kafka

I had a Kafka stream with following operations:
stream.mapValues.groupByKey.aggregate. Aggregation is adding records into the list basically.
Now I changed implementation to that:
stream.flatMap.groupByKey.aggregate. FlatMap is duplicating record: first record is exactly the same like in old implementation and second one has key changed. So after the change of implementation repartitioning is happening, while before, it wasn't (and it's fine). My problem is that after releasing the change, old aggregated records for old key disappeared. From the moment of change everything works as it should, but I don't understand this behaviour. As I understand, as I did not change the key, it should land on the same partition as previously and aggregation should continue adding messages to old list, not starting from beginning. Could anyone help me understand why it's happening?
If you change your processing topology, in general, you need to reset your application and reprocess all data from the input topic to recompute the state.
In your case, I assume that the aggregation operator has a different name after the change and thus, does not "find" its local state and changelog topic any longer.
You can compare the names for both topologies via Topology#describe().
To allow for a smooth upgrade, you would need to provide a fixed name for aggregate() via Materialized.as(...). If you provide a fixed name (ie, same in old and new topology) the issue would go away. However, because your original topology did not provide a fixed name, it's hard to get out of the situation.

Efficient update of sorted JavaFX ObservableList

I have a Java ObservableList with thousands of entries that receives hundreds of updates a second backing a JavaFX TableView.
The ObservableList is backed by an ArrayList. Arbitrary sort orders can be applied to the list. The updates may change the sort order of a single entity in the list. I have performance issues if I try to preform a sort after each update, so currently I have a background task that performs a sort every second. However, I'd like to try to sort in real time if possible.
Assuming that the list is already sorted and I know the index of the element to change, is there a more efficient way to update the index of the element than calling sort on the list again?
I've already determined I can use Collections.binarySearch() to efficiently find the index of the element to update. Is there also a way I can efficiently find the index the updated element needs to move to and shift the ArrayList so it remains in order?
I also need to handle add and remove operations, but those are far less common.
Regarding your answer, FXCollections.sort() should be even faster because it handles the FX-Properties better and is specifically written for ObservableLists.
I would use a TreeSet. It can update the order with O(log N) time complexity whereas ArrayList will do an insertion sort with O(n) per entry.
A few suggestions when dealing with sorting on a JavaFX ObservableList/TableView combo:
Ensure your model class includes Property accessors.
Due to a weird quirk in the JavaFX 2.2 implementation that is not present in JavaFX 8+, TableView is far less efficient when dealing with large data models that do not have property accessors than it is when dealing with those that do include property accessor functions. See JavaFx tableview sort is really slow how to improve sort speed as in java swing for more details.
Perform bulk changes on the ObservableList.
Each time you modify an ObservableList that is being observed, the list change listeners on the list are fired to communicate the permutations of the change to the observers. By reducing the number of modifications you make on the list, you can cut down on the number of change events which occur and hence on the overhead of observer notification and processing.
An example technique for this might be to keep a mirror copy of the list data in a standard non-observable list, sort that data, then set that sorted data into the observable list in a single operation.
To avoid premature optimization issues, only do this sort of optimization if the operation is initially slow and the optimization itself provides a significant measurable improvement.
Don't update the ObservableList more often than necessary.
JavaFX display framerate is capped, by default, at 60fps. Updating visible components more often than once a pulse (a frame render trigger) is unnecessary, so batch up all of your changes for each pulse.
For example, if you have a new record coming in every millisecond, collate all records that come in every 20 milliseconds and apply those changes all at once.
To avoid premature optimization issues, only do this sort of optimization if the operation is initially slow and the optimization itself provides a significant measurable improvement.
Java 8 contains some new classes to assist in using sorted content in tables.
I don't really know how the TableView sorting function and SortList work in Java 8. You can request that Oracle write a tutorial with samples and best practices for the Java 8 TableView sort feature by emailing jfx-docs-feedback_ww#oracle.com
For further reference, see the javadoc:
the sorting section of the JavaFX 8 TableView javadoc.
new SortEvent class.
SortedList class.
What is not quite clear is if you need the list to be sorted the whole time? If you sort it just in order to retrieve and update your entries quicker, you can do that faster using a HashMap. You can create a HashMap<YourClass, YourClass> if you implement a proper hashCode() and equals() method on the key fields in the class. If you only need to output a sorted list occasionally, also implement the Comparable<YourClass> interface and just create a TreeSet<YourClass>( map.keySet() ), that will create a sorted representation while the data in your HashMap stays in place. If you need it sorted always, you can consider to use TreeMap<YourClass,YourClass> instead of a HashMap. Maps are easier than Sets because they provide a way to retrieve the objects.
After some research, I concluded that Collections.sort() is pretty fast, even for 1 item. I haven't found a more efficient way than to update the item in the list and just call sort. I can't use a TreeSet since the the TableView relies on the List interface and I'd have to rebuild the TreeSet every time the sort order is changed.
I found that I could update at 60 FPS by using a Timer or KeyFrame and still have reasonable performance. I haven't found a better solution without upgrading to JavaFX 8.
You could pull the element out of the array list and insert (in sorted order) the updated element.

Reindex task/mapper/job for AppEngine Java

Does anybody know of a library or good code sample that could be used to re-index all/some entities in all/some namespaces ?
If I implement this on my own, is MapReduce what I should consider ?
"I need to re-index ?" feels like a problem many developers have run into but the closest I could find is this, which may be a good start ?
Other option is a homebrewn solution using Task Queues that iterate the datastore namespaces and entities but I'd prefer not the re-invent the wheel and go for a robust, proven solution.
What are the options ?
I'm afraid I don't know of any pre-built system. I think you basically need to create a cursor to iterate through all your entities and then do a get and a put on all of them (or optionally check if they're in the index before doing the put - if you have some that won't need updating, that would save you a write at the cost of a read and/or a small operation).
Follow the example here:
https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Cursors
Create a java.util.concurrent.SynchronousQueue to hold batches of datastore keys.
Create 10 new consumer threads (the current limit) using ThreadManager:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/ThreadManager
Those threads should do the following:
Create a new objectify instance and turn off the session cache and memcache for objectify.
Get a batch of keys from the SynchronousQueue.
Fetch all of those entities using a batch get.
Optionally do a keys-only query for all those entities using the relevant property.
Put all those entities (or exclude the ones that were returned above).
Repeat from step 2.
In a loop, fetch the next 30 keys using a keys-only cursor query and put them into the SynchronousQueue.
Once you've put all of the items into the SynchronousQueue, set a property to stop all the consumer threads once they've done their work.

Deleting multiple items from a Redis hash, based on a certain value

What is the most efficient way to delete a bunch of items from a hash, based on whether an item's value contains a specific substring or not? As far as I know, there is not really a way to do this in one simple block. I have to literally grab all the values of that hash in a Java list, then iterate over this list till i find what I need, then delete its key from the hash, and repeat the same procedure ove and over again.
Another approach I tried was to put an id references to the the hash items in a separate list, so that later on, with a single call, i could grab a list of id for items which should be deleted. That was a bit better, but still, the redis implementation I use (Jedis) does not support the deletion of multiple hash keys, so again I am left with my hands tied.
Redis does not support referential integrity, right? This means, OK, the keys stored in the Redis list are references to the items in the hash, so if I delete the list, the corresponding items from the hash would eb deleted. There is nothing like that in Redis, right?
I will have to go through this loop and delete every single item separately. I wish at least there was something like a block, where I could collect all 1000 commands, and send them in one entire call, rather than 1000 separate ones.
I wish at least there was something like a block,
where I could collect all 1000 commands, and send them in one entire call,
rather than 1000 separate ones.
That's what transactions are for: http://redis.io/topics/transactions
Using pipeline would let possible commands from other connected clients to be issued between the pipelined commands, since it only guarantees that your client issues commands without waiting for replies, with no guarantee of atomicity.
Commands in a transaction (i.e. between MULTI/EXEC) are issued atomically, which I presume is what you want.
Deleting the ids in a Redis List will not affect the Redis Hash Fields. To speed things up consider pipelining. Jedis supports that...

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Categories

Resources