What is the most efficient way to delete a bunch of items from a hash, based on whether an item's value contains a specific substring or not? As far as I know, there is not really a way to do this in one simple block. I have to literally grab all the values of that hash in a Java list, then iterate over this list till i find what I need, then delete its key from the hash, and repeat the same procedure ove and over again.
Another approach I tried was to put an id references to the the hash items in a separate list, so that later on, with a single call, i could grab a list of id for items which should be deleted. That was a bit better, but still, the redis implementation I use (Jedis) does not support the deletion of multiple hash keys, so again I am left with my hands tied.
Redis does not support referential integrity, right? This means, OK, the keys stored in the Redis list are references to the items in the hash, so if I delete the list, the corresponding items from the hash would eb deleted. There is nothing like that in Redis, right?
I will have to go through this loop and delete every single item separately. I wish at least there was something like a block, where I could collect all 1000 commands, and send them in one entire call, rather than 1000 separate ones.
I wish at least there was something like a block,
where I could collect all 1000 commands, and send them in one entire call,
rather than 1000 separate ones.
That's what transactions are for: http://redis.io/topics/transactions
Using pipeline would let possible commands from other connected clients to be issued between the pipelined commands, since it only guarantees that your client issues commands without waiting for replies, with no guarantee of atomicity.
Commands in a transaction (i.e. between MULTI/EXEC) are issued atomically, which I presume is what you want.
Deleting the ids in a Redis List will not affect the Redis Hash Fields. To speed things up consider pipelining. Jedis supports that...
Related
I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)
Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.
It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.
I have some persistent data in the rdms and csv files (they are independent objects, but I wanted to mention it because they are in different mediums,
I can not go with what rdbms provides, actually I do not want to do a trip to database for the next hour in even the data gets old). I need to store the data in memory for performance benefits and query (only read, no other operation) the objects based on multiple columns of it, and refresh the data every hour.
In my case ,what is a good way to store and query in-memory objects other than implementing my own object store and querying methods? For instance, can you provide an example/link to replace the sql query as
select * from employees where emplid like '%input%' or surname like '%input%' or email like '%input%';
Sorry for the dummy query but it explains what kind of queries are possible.
Go find yourself a key store implementation with the features you want. Use your Query string as the key and the result as the value. https://github.com/ben-manes/caffeine Has quite a few features including record timeouts (like an hour).
For my own work, I use a LRU key store (limited to X entries) containing objects with the timeout information and I manually decide if the record is stale or not before I use it. LRU is basically a linked-list which moves "read" records to the head of the list and drops the tail when records are added beyond the maximum desired size. This keeps the popular records in the store longer.
I am facing a similar problem as the author in:
DelayQueue with higher speed remove()?
The problem:
I need to process continuously incoming data and check whether the data has been seen in a certain timeframe before. Therefore I calculate a unique ID for incoming data and add this data indexed by the ID to a map. At the same time I store the ID and the timeout timestamp in a PriorityQueue, giving me the ability to efficiently check for the latest ID to time out. Unfortunately if the data comes in again before the specified timeout, I need to update the timeout stored in the PriorityQueue. So far I just removed the old ID and re-added the ID along with the new timeout. This works well, except for the time consuming remove method if my PriorityQueue grows over 300k elements.
Possible Solution:
I just thought about using a DelayQueue instead, which would make it easier to wait for the first data to time out, unfortunately I have not found an efficient way to update a timeout element stored in such a DelayQueue, without facing the same problem as with the PriorityQueue: the remove method!
Any ideas on how to solve this problem in an efficient way even for a huge Queue?
This actually sounds a lot like a Guava Cache, which is a concurrent on-heap cache supporting "expire this long after the most recent lookup for this entry." It might be simplest just to reuse that, if you can use third-party libraries.
Failing that, the approach that implementation uses looks something like this: it has a hash table, so entries can be efficiently looked up by their key, but the entries are also in a concurrent, custom linked list -- you can't do this with the built-in libraries. The linked list is in the order of "least recently accessed first." When an entry is accessed, it gets moved to the end of the linked list. Every so often, you look at the beginning of the list -- where all the least recently accessed entries live -- and delete the ones that are older than your threshold.
I have a polling loop that accesses devices, some of the devices are linked together and as such when one of the linked devices is polled its state must be copied to the other linked device.
I am trying to speed up the process of updating the second linked device. if device 4 is linked to device 5 it is fast but if device 5 is polled and a change found it has to wait till the polling loop is completed and started again.
My idea is to check as part of the poll if the device is linked so it can be updated immediately. The information on what devices are linked is stored in the database and making a call to the database for every poll will slow down the system dramatically I think so what I wanted to do was create a hash table of the table and check that instead.
so here are my questions
would this idea work ?
would the hash table be faster than the database checks ?
how often should I recreate the hash table, the program will run for weeks at a time ?
is this the best way of doing this. are there other ways to speed up a polling loop ?
Use a hash table if searching is a priority. Hash tables provide very quick search mechanisms when searching by key, and fairly good searching when searching by value
Use a hash table if you want to be able to remove specific elements (use the Remove method)
Use a hash table when the order the elements is stored is irrelevant to you.
Don't use a hash table if you need the elements in some specific order. You cannot rely on how a Hashtable will sort its elements
Don't use a hash table if you need to insert an element in a particular location
Don't use a hash table if you want to store multiple keys with the same value. All keys inside a hash table must be unique.
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.