Using Amazon S3 as Key Value Store (in production) - java

I am using Wasabi as S3 storage for my project, and I've been thinking of utilizing S3 as key-value, storage. Wasabi does not charge on API requests as noted here https://wasabi.com/cloud-storage-pricing/
And anyone can easily (in any programming language, maybe) to implement such a simple interface to Amazon S3:
value = store.get(key)
store.put(key, value)
store.delete(key)
Where the key is a string and value is binary data. Effectively using it as a highly distributed and elastic Key-Value store.
So one can store a User object for example with
userid:1234567890:username -> johnsmith
userid:1234567890:username:johnsmith:password -> encrypted_password
userid:1234567890:username:johnsmith:profile_picture -> image_binary
userid:1234567890:username:johnsmith:fav_color -> red
Values are serialized into binary.
And so on.
I have a few questions, what's the best strategy to use Amazon S3 as a key-value store for those who have tried to use S3 as either a database or datastore. Although I think its fairly easy to retrieve the whole user object described here by querying keys with prefix userid:1234567890 and do the logic needed in code, the obvious downside with this is you can't search for value.
What algorithm can be used here to implement a simple key search function, e.g. search for a user with a username starting with "j" or user with fav_color "red", looking at the very basic key-value interface get and put I think this is impossible, but maybe someone knows a work-around?
What kind of serialization strategy for both primitive data types (String, Number, Boolean, etc) and Blob data (images, audio, video, and any sort of file) is best for this kind of key-value store? Also, this simple key-value does not have a way to define what type of value is stored in the key (is it a string, number, binary, etc?), how can that be solved?
How can transactions can be achieved in this kind of scenario? Like in the example above, store the username johnsmith if and only if the other keys are also stored or not at all, I am thinking is S3 batch operation enough to solve this?
What the are main design considerations when planning to use this as the main database for applications (and for production use), both in algorithmic perspective and also considering the limitations of S3 itself?

Related

Store multiple values in a file - best format?

I want to store multiple values (String, Int and Date) in a file via Java in Android Studio.
I don't have that much experience in that area, so I tried to google a bit, but I didn't get the solution, which I've been looking for. So, maybe you can recommend me something?
What I've tried so far:
Android offers a SharedPreferences feature, which allows a user to save a primitive value for a key. But I have multiple values for a key, so that won't work for me.
Another option is saving data on an external storage medium as file. As far as good. But I want to keep the filesize at minimum and load the file as fast as possible. That's the place, where I can't get ahead. If I directly save all values as simple text, I would need to parse the .txt file per hand to load the data which will take time for multiple entries.
Is there a possibility to save multiple entries with multiple values for a particular key in an efficient way?
No need to reinvent a bicycle. Most probably the best option for your case is using the databases. Look into Sqlite or Realm.
You don’t divulge enough details about your data structure or volume, so it is difficult to give a specific solution.
Generally speaking, you have these three choices.
Serialize a collection
I have multiple values for a key
You could use a Map with a List or Set as its value. This has been discussed countless times on Stack Overflow.
Then use Serialization to write and read to storage.
Text file
Write a text file.
Use Tab-delimited or CSV format if appropriate. I suggest using the Apache Commons CSV library for that.
Database
If you have much data, or concurrency issues with multiple threads, use a database such as the H2 Database Engine.

Multiple keys pointing to a single value in Redis (Cache) with Java

I want to store multiple keys with a single value using jedis (Redis cache) with Java.
I have three keys like user_1, driver_10, admin_5 and value = this is user, and I want to get value by using any one key among those three.
Having multiple keys point to same value is not supported in Redis for now, see issue #2668.
You would need a workaround.
Some ideas below, possibly obvious or stupid :)
Maybe have an intermediate key:
- user_10 → id_123
- driver_5 → id_123
- id_123 → data_that_you_dont_want_to_duplicate
You could implement that logic in your client code, or in custom Lua scripts on server, and have your client code use those scripts (but I don't know enough about that to provide details).
If you implement the indirection logic on client side, and if accesses are unbalanced, for example you would access data via user key 99% of the time, and via driver key 1% of the time, it might be worth avoiding 2 client-server round trips for the 99% case. For this you can encode redirections. For example, if first character is # then the rest is the data. If first character is # then the rest is the actual key.
user_10 → #data_that_you_dont_want_to_duplicate
driver_5 → #user_10
Here is a Lua script that can save on trafic, and pull the data in one call:
eval "return redis.call('get',redis.call('get',KEYS[1]))" 1 user-10
The above will return the request data.

how to search for file contents in amazon S3 bucket without downloading the file

i have n number of files uploaded to amazon S3 i need*search* those files based on occurrence of an string in its contents , i tried one method of downloading the files from S3 bucket converting input stream to string and then search for the word in content , but if their are more than five to six files it takes lot of time to do the above process,
is their any other way to do this , please help thanks in advance.
If your files contain CSV, TSV, JSON, Parquet or ORC, you can take a look at AWS's Athena: https://aws.amazon.com/athena/
From their intro:
Amazon Athena is a fast, cost-effective, interactive query service
that makes it easy to analyze petabytes of data in S3 with no data
warehouses or clusters to manage.
Unlikely to help you though as it sounds like you have plain text to search through.
Thought I'd mention it as it might help others looking to solve a similar problem.
Nope!
If you can't infer where the matches are from object metadata (like, the file name), then you're stuck with downloading & searching manually. If you have spare bandwidth, I suggest downloading a few files at a time to speed things up.
In single word NO!!
I think you can do to imprrve the performance will be to cache the files locally so that you don't have to download the file again and again
Probably you can use Last-Modified header to check whether the local file is dirty, then download it again
My suggestion, since you seem to own the files, is to index them manually, based on content. If there is a lot of "keywords", or metadata associated with each file, you can help yourself by using a lightweight database, where you will perform your queries and get the exact file(s) users are looking for. This will preserve bandwidth and also be much faster, at the cost of maintaining kind of an "indexing" system.
Another option (if each file does not contain much metadata) would be to reorganize the files in your buckets, adding prefixes which would "auto-index" them, like follows:
/foo/bar/randomFileContainingFooBar.dat
/foo/zar/anotherRandomFileContainingFooZar.dat.
This way you might end up scanning the whole bucket in order to find the set of files you need (this is why I suggested this option only if you have little metadata), but you will only download the matching ones, which is still much better than your original approach.
Yes, now it is possible with AWS S3 Select. If your objects stored in CSV, JSON, or Apache Parquet format.
AWS details: https://aws.amazon.com/blogs/developer/introducing-support-for-amazon-s3-select-in-the-aws-sdk-for-javascript/
Aws S3 Select getting started examples: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html
Just if anyone will be looking the same.
i.e With SDK:
if you have a csv like this:
user_name,age
jsrocks,13
node4life,22
esfuture,29
...
And for example we would like to retrieve something like:
SELECT user_name FROM S3Object WHERE cast(age as int) > 20
Then in AWS SDK on JavaScript we do the following:
const S3 = require('aws-sdk/clients/s3');
const client = new S3({
region: 'us-west-2'
});
const params = {
Bucket: 'my-bucket,
Key: 'target-file.csv',
ExpressionType: 'SQL,
Expression: 'SELECT user_name FROM S3Object WHERE cast(age as int) > 20',
InputSerialization: {
CSV: {
FileHeaderInfo: 'USE',
RecordDelimiter: '\n',
FieldDelimiter: ','
}
},
OutputSerialization: {
CSV: {}
}
};
I am not familiar with Amazon S3, but the general way to deal with searching remote files is to use indexing, with the index itself being stored on the remote server. That way each search will use the index to deduce a relatively small number of potential matching files and only those will be scanned directly to verify if they are indeed a match or not. Depending on your search terms and the complexity of the pattern, it might even be possible to avoid the direct file scan altogether.
That said, I do not know whether Amazon S3 has an indexing engine that you can use or whether there are supplemental libraries that do that for you, but the concept is simple enough that you should be able to get something working by yourself without too much work.
EDIT:
Generally the tokens that exist in each file are what is indexed. For example if you want to search for "foo bar" the index will tell you which files contain "foo" and which contain "bar". The cross-section of these results will be the files that contain both "foo" and "bar". You will have to scan those files directly to select those (if any) where "foo" and "bar" are right next to each other in the right order.
In any case, the amount of data that is downloaded to the client would be far less than downloading and scanning everything, although that would also depend on how your files are structured and what your search patterns look like.

Fastest way to export keys from cassandra

What is the fastest way to export all the rowkeys from a column family in cassandra (0.7.x and later versions) with Java APIs or other tools ?
Currently I am using the Java Pelops API, and paging through all records, but Im wondering if there is a better mechanism.
I am specifically interested in only exporting the rowkeys (no columns/subcolumns), so Im wondering if there is a section of the cassandra direct storage APIs that could be used to do this as quickly as possible (bypassing thrift).
What about using Java hector client. Sample taken from
https://github.com/rantav/hector/wiki/User-Guide
RangeSlicesQuery<String, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, stringSerializer,
stringSerializer, stringSerializer);
rangeSlicesQuery.setColumnFamily("Standard1");
rangeSlicesQuery.setKeys("fake_key_", "");
rangeSlicesQuery.setReturnKeysOnly(); // use this
rangeSlicesQuery.setRowCount(5);
Result<OrderedRows<String, String, String>> result = rangeSlicesQuery.execute();
thrift is API interface for cassandra. Going directly to storage would require you to read data files in binary. Code above should give you good performance.
If you need this for one time export then I would say it's OK. If you need this for production you should reconsider your data-model - you may be doing something wrong.
You may need to split the query using multiple key ranges in case you need to scan many rows.

Equivalent of Data Protection API on Linux

Microsoft Windows 2000 and later versions expose the Data Protection API (DPAPI) that encrypts data for a per-user or per-system context. The caller does not provide a key with which to encrypt the data. Rather, the data is encrypted with a key derived from the user or system credentials.
This API is conveniently exposed in .NET via the ProtectedData class:
// Encrypts the data in a specified byte array and returns a byte array
// that contains the encrypted data.
public static byte[] Protect(
byte[] userData,
byte[] optionalEntropy,
DataProtectionScope scope
)
// Decrypts the data in a specified byte array and returns a byte array
// that contains the decrypted data.
public static byte[] Unprotect(
byte[] encryptedData,
byte[] optionalEntropy,
DataProtectionScope scope
)
Is there an equivalent API on Linux? A bonus would be that it integrates conveniently with Java.
What are my alternatives if there isn't one?
There are two options for user-level key stores on Linux:
GnomeKeyring
KWallet
This does not address the need for a system-level key store.
https://learn.microsoft.com/en-us/aspnet/core/security/data-protection/introduction?view=aspnetcore-5.0
Here!
This documentation is so good, that I won't even bother to explain more.
Do not be discouraged with no immediate code samples on the first page. There are examples in links there. For all scenarios. DI, ASP.NET, console. Windows and Linux.
As the others before me said - AFAIK you don't have default keys for users and system in Linux. But a key is a key. You can create them and on Linux it's your (as the administrator / root) responsibility to protect the key files (meaning make them accessible only to authorized users).
The good part is you do not rely on system specific keys. You just use separate keys, your application keys.
If that is what you need - the linked API is just for you. I wish Linux had built in, default keys for users, but well... It's just one extra step for increased app-level security. Do you want one step more? Use Azure Key Vault, they have nice REST API you can use anywhere, not necessarily in Dotnet. Yes, AKV requires locally stored user password, but you can disable the access remotely, so it's a good additional security layer. If your user / machine was compromised, you just disable the secret and the target app is disabled until you provide the user with the new key. I use it in my sensitive apps a lot. Works as charm.
BTW, my minimalistic example of Linux DPAPI usage:
using System;
using System.IO;
using Microsoft.AspNetCore.DataProtection;
var keyDirectory = Directory.GetCurrentDirectory();
var dataProtectionProvider = DataProtectionProvider.Create(new DirectoryInfo(keyDirectory));
var protector = dataProtectionProvider.CreateProtector("Test");
var password = "Secret1337";
var protectedPassword = protector.Protect(password);
Console.WriteLine($"Protected: {protectedPassword}");
var decodedPassword = protector.Unprotect(protectedPassword);
Console.WriteLine($"Decoded: {decodedPassword}");
Of course in real world app you won't store keys in current directory, but that's the simplest example I could think of.
It doesn't look any more (or less) advanced than PGP, or Pretty Good Privacy. There are APIs available for PGP, and the one that I recall others speaking kindly of is Bouncy Castle.
Here's an example of how someone used Bouncy Castle.
Better APIs or solutions may be available, depending on your specific needs.
DPAPI does not exist on Linux.
Windows uses a special machine-id to deviate a machine key. You can emulate this behavior by looking into "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Cryptography\MachineGuid" reading this key and deviate your special key by using any encryption library you want.
Under Linux on the other hand this machine-id is stored in the file "/etc/machine-id". You can read it's contents and deviate your special key from it. Be aware this key may be the same when using fast deployment VMs.
Encrypt your data with this special machine-id and it cannot not be read across other machines. Read at first the machine-id (Linux or Windows) and then try to decrypt the contents of your data. On another machine the result will obviously be different and not correct.
You can code your platform independent wrapper class by using the information from above.
Hope this helps someone in the future.
Cheers

Categories

Resources