Indexes update in DynamoDB

Indexes update in DynamoDB - java

I've been working with LSI and GSI in DynamoDB, but I guess I'm missing something.
I created an index to always query by the latest results without using the partition key only other attributes and without reading the entire items, only those that really matter, but with the GSI at some point my query returns data that are not up-to-date; this I understand due to the fact of eventual consistence described in the docs (You may correct me if I'm wrong).
And what about LSI? Even using the ConsistentRead, at some point my data is not being queried correctly, and the results are not up-to-date. From the docs I thought that LSI where updated syncronized with its table and with the ConsistentRead property set I'd always get the latest results, but this is not happening.
I'm using a REST endpoint (API Gateway) to perform inserts into my dynamo table (I perform some treatments before the insertion) so, I've been wondering if this has something to do with it: maybe the code (currently Java) or DynamoDB is slow to update but since in my endpoint everything seems to work fine I perform another request too fast or if I have to wait a little longer to interact with the table because the index is being updated, although I have already tested waiting longer I'm receiving the same wrong results. I'm a bit lost here.
This is the code I'm using to query the index:
QuerySpec spec = new QuerySpec()
.withKeyConditionExpression("#c = :v_attrib1 and #e = :v_attrib2")
.withNameMap(new NameMap()
.with("#c", "attrib1")
.with("#e", "attrib2"))
.withValueMap(new ValueMap()
.withString(":v_attrib1", attrib1Value)
.withString(":v_attrib2", attrib2Value))
.withMaxResultSize(1) // to only bring the latest one
.withConsistentRead(true) // is this wrong?
.withScanIndexForward(false); // what about this one?
I don't know if the maven library version would interfere, but in any case the version I'm using is the 1.11.76 (I know there are a lot of newer versions, but if this is the problem we'll update it then).
Thank you all in advance.

After searching for quite some time and some other tests I, finally, figured out that the problem was not in DynamoDB indexes, they are working as expected, but in the Lambda functions.
The fact that I was sending a lot of requests, one after another, was not giving the indexes time to remain updated: Lambda functions execute asynchronously (I should have known), and so the requests received by the database were not ordered and my data was not properly updated. So, we changed our implementation to use Atomic Counters: we can keep our data updated no matter the number or the order of the requests.
See: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Related

Loosing old aggregated records after adding repartitioning in Kafka

I had a Kafka stream with following operations:
stream.mapValues.groupByKey.aggregate. Aggregation is adding records into the list basically.
Now I changed implementation to that:
stream.flatMap.groupByKey.aggregate. FlatMap is duplicating record: first record is exactly the same like in old implementation and second one has key changed. So after the change of implementation repartitioning is happening, while before, it wasn't (and it's fine). My problem is that after releasing the change, old aggregated records for old key disappeared. From the moment of change everything works as it should, but I don't understand this behaviour. As I understand, as I did not change the key, it should land on the same partition as previously and aggregation should continue adding messages to old list, not starting from beginning. Could anyone help me understand why it's happening?

If you change your processing topology, in general, you need to reset your application and reprocess all data from the input topic to recompute the state.
In your case, I assume that the aggregation operator has a different name after the change and thus, does not "find" its local state and changelog topic any longer.
You can compare the names for both topologies via Topology#describe().
To allow for a smooth upgrade, you would need to provide a fixed name for aggregate() via Materialized.as(...). If you provide a fixed name (ie, same in old and new topology) the issue would go away. However, because your original topology did not provide a fixed name, it's hard to get out of the situation.

Using a cache as a layer in front of a database

I'm working on some back-end service which is asynchronous in nature. That is, we have multiple jobs that are ran asynchronously and results are written to some record.
This record is basically a class wrapping an HashMap of results (keys are job_id).
The thing is, I don't want to calculate or know in advance how many jobs are going to run (if I knew, I could cache.invalidate() the key when all the jobs has already been completed)
Instead, I'd like to have the following scheme:
Set an expiry for new records (i.e. expireAfterWrite)
On expiry, write (actually upsert) the record the database
If a cache miss occurs, load() is called to fetch the record from the database (if not found, create a new one)
The problem:
I tried to use Caffeine cache but the problem is that records aren't expired at the exact time they were supposed to. I then read this SO answer for Guava's Cache and I guess a similar mechanism works for Caffeine as well.
So the problem is that a record can "wait" in the cache for quite a while, even though it was already completed. Is there a way to overcome this issue? That is, is there a way to "encourage" the cache to invalidate expired items?
That lead me to question my solution. Would you consider my solution a good practice?
P.S. I'm willing to switch to other caching solutions, if necessary.

You can have a look at the Ehcache with write-behind. It is for sure more setup effort but it is working quite well

Query past the 500 limit in Gerrit REST API

I'm trying to get 2000 change results from a specific branch with a query request using Gerrit REST API in Java. The problem is that I'm only getting 500 results no matter what I add to the query search.
I have tried the options listed here but I'm not getting the 2000 results that I need. I also read that an admin can increase this limit but would prefer a method that doesn't require this detour.
So what I'm wondering is:
Is it possible to increase the limit without the need to contact the admin?
If not. Is it possible to continue/repeat the query in order to get the remaining 1500 results that I want, using a loop that performs the query on the following 500 results from the previous query until I finally get 2000 results in total?

When using the list changes REST API, the results are returned as a list of ChangeInfo Elements. If there are more results than were returned, the last entry in that list will have a _more_changes field with value true. You can then query again and set the start option to skip over the ones that you've already received.

I want to add a minor workaround to David's great answer.
If you want to crawl Gerrit instances hosted on Google servers (such as Android, Chromium, Golang), you will notice that they block queries with more than 10000 results. You can check this e.g. with
curl "https://android-review.googlesource.com/changes/?q=status:closed&S=10000"
I solved the problem in such a way, that I split up these list of changes with before: and until: in a query string, for example lie
_url_/changes/?q=after:{2018-01-01 00:00:00.000} AND before:{2018-01-01 00:59:99.999}
_url_/changes/?q=after:{2018-01-01 01:00:00.000} AND before:{2018-01-01 01:59:99.999}
_url_/changes/?q=after:{2018-01-01 02:00:00.000} AND before:{2018-01-01 02:59:99.999}
and so on. I think you get the idea. ;-) Please notice, that both limits (before: and after:) are inclusive! For each day I use the pagination described by David.
A nice side effect is, that you can track the progress of the crawling.
I wrote a small Python tool named "Gerry" to crawl open source instances. Feel free to use, adopt it and send me pull requests!

I almost had the same problem. But there is no way as you mentioned you don't want admin to increase the query limit and also you don't want to fire the rest query in a loop with the counter. I will suggest you to follow the second approach firing the query in a loop with a counter set. That's the way I have implemented the rest client in Java.

How to I make sure my N1QL Query considers recent changes?

My situation is that, given 3 following methods (I used couchbase-java-client 2.2 in Scala. And Version of Couchbase server is 4.1):
def findAll() = {
bucket.query(N1qlQuery.simple(select("*").from(i(DatabaseBucket.USER))))
.allRows().toList
}
def findById(id: UUID) = {
Option(bucket.get(id.toString, classOf[RawJsonDocument])).map(i => read[User](i.content()))
}
def upsert(i: User) = {
bucket.async().upsert(RawJsonDocument.create(i.id.toString, write(i)))
}
Basically, they are insert, find one by id and findAll. I did an experiment where :
I insert a User, then find one by findById right after that, I got a user that I have inserted correctly.
I insert and then I use findAll right after that, it returns empty.
I insert, put 3 seconds delay and then I use findAll, I can find the one that I have inserted.
By that, I suspected that N1qlQuery only search over cached layer rather than "persist" layer. So, how can I force to let it search on "persist" layer?

In Couchbase 4.0 with N1QL, there are different consistency levels you can specify when querying which correspond to different cost for updates/changes to propagate through index recalculation. These aren't tied to whether or not data is persisted, but rather it's an option when you issue the query. The default is "not bounded" and to make sure that your upsert request is taken into consideration, you'll want to issue this query as "request plus".
To get the effect you're looking for, you'll want to add N1qlPararms on your creation of the N1qlQuery by using another form of the simple() method. Add a N1qlParams with ScanConsistency.REQUEST_PLUS. You can read more about this in Couchbase's Developer Guide. There's a Java API example of this. With that change, you won't need to have a sleep() in there, the system will automatically service the query request once the index recalculation has gotten to your specified level.
Depending on how you're using this elsewhere in your application, there are times you may want either consistency level.

You need stronger scan consistency. Add a N1qlParam to the query, using consistency(ScanConsistency.REQUEST_PLUS)

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)

All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.