Fetch large volume of data from DynomoDB?

Fetch large volume of data from DynomoDB? - java

I am developing a spring boot REST API, which has to fetch large volume of data (100-200k records) from dynamoDB table based on search conditions and return the response to the API consumer without loading the entire object list in its memory. With SQL based database, I have used JDBCTemplate queryForStreams method for similar requirement. But for no-sql database like DynamoDB, I could not find similar methods to stream the data.
One sample scenario is to fetch all passengers who booked business class ticket on Christmas weekend from xyz airline dynamoDB database.
note: Edited for clarity.

Reading GB's of data per request from DynamoDB does not seem scalable. Does the end user require all that data, what is the purpose?
DynamoDB can only return 1MB per request so for a single end user API call you would have to make many paginated requests to DynamoDB.
If you are using Scan then your solution is not at all scalable and I would possibly suggest using a different database.

This is not a good use case for REST in general. Have you considered storing the query result in an S3?
Your rest API will return a task id, that you can then use to check the progress of the query and eventually download the result.
This way you get infinite scalability and can run huge amounts of parallel dynamo scans or queries.

The fastest way to do this is going to be using a parallel Scan operation. Assuming you have sufficient read capacity on the DynamoDB table, this is going to give you very high speed results.
See "parallel scan using Java" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html for an example

DynamoDB
Think of DDB in the way that Amazon uses it, for e-commerce - small sub 100s list of paginated data, the items are usually small in size, but the items must be easy to update
In this case you would never need to store / fetch GBs of data from the tables
Your needs as 'how might we...' question
How might we store GBs of data in AWS and retrieve that data quickly?
AWS Best Practices
Before we dive in solving the 'hmw' question above we need to understand some core tenets of AWS
operational excellence
security
reliability
performance efficiency
cost optimisation
sustainability
AWS call these tenets or 'pillars' their Well-Architected Framework
You can read more about these here https://aws.amazon.com/architecture/well-architected/
Most of them are as described: monitoring, security, reliability, performance, cost efficient, computationally cheap (which means environmentally friendly)
A sprawling buffet of solutions
Storage
Your needs are the storage of GBs of data
It still depend on what you're trying to store here, but for most storage needs you'd use S3
To make sure we keep things 'compliant' with the Well-Architected framework we'd need to enable the use of encryption (in transite, at rest), block public bucket access etc.
To make everything cost efficient we will have to think about when we want to access this data. If accessed regularly then we'll have to use 'hot' storage, otherwise 'cold' storage S3 options are cheaper but you trade retrieval time.
Notable mentions
If you have specific data science needs you should checkout: Data Lakes (still uses S3 under the hood), Glue, Athena (a query layer on top of S3)
If you're storing text based data and require near instant seaching and retrieval using OpenSearch - this is very useful for chat related data
Data storage
This depends of your app, but most people still keep a DynamoDB table that acts as a map for S3 queries.
DDB is query optimised and super performant when you fully understand you data queries or access pattern.
Design you table around your access patterns not around entities.
eg.
Option 1: One table
PK SK
type#order timestamp
type#transaction timestamp
....
Option 2: Multiple Entity based tables
Order table,
PK SK Attr
id timestamp productIDs
Transactions table
PK SK Attr
id timestamp amount, orderId
Products table
PK SK
id category
The one-table design just simplifies the retrieval of data in a small number or requests, but you do need to play with your table design until it's just right.
My recommendation: be creative and mix and match the table styles to your needs. Entity-based tables are still useful in most apps.
Also expect to redo your tables once you find out new things.
It's crucial here that you use an infrastructure-as-code tool to make it easier to teardown and recreate tables - CDK is great for this.
Remember that you are billed per Read and Write units. This is where a well-designed table (to match your access patterns) will help you make concise queries at a low cost.
Data retrieval
This is where you have some options, depending on your app
Again I would recommend the storage for big items in S3 not DynamoDB, so in this case it's relatively easy to download GBs of data from S3.
You can also store data in optimised formats using parquet.
Also if you choose to use DynamoDB as a hash map for the S3 bucket you can quickly find your files and locations and then place those in a queue, so that the retrieval happens in the background.
You can also copy files within the bucket to a job folder, zip the data and provide the user with the URL to that zip.
You can also use DataSync for copying across buckets.
Final notes
It sounds to me like you are storing data in AWS and downloading for processing.
Most teams approach this by moving their processing and storage to AWS, running the whole process in the cloud.

Related

Java : relational database vs static variable

I have a web application in which I'm maintaining many static Maps to store my relevant information. Since the application is deployed on a server. Each and every hit to the server side java uses these maps to match the key and get appropriate result and send back to the client side. My code contains a rank and retrieval feature so I have to read the entire keySet of each of these Maps.
My question is:
1. Is working with static variables better than storing this data in a local embedded DB like Apache Derby and then using it?
2. The use of this data is very frequent. So if I use database will that be faster approach? Since I read the full keyset the where clause may not come handy in many operations.
3. How does the server's memory gets impacted on holding data in static variables?
My no. of maps are fixed but the size of the Maps keeps increasing? Please suggest the better solution.

If you want the data to be saved regularly an embedded database like H2 makes sense. You then also have snapshots of the data, and development, structural changes are a bit more safe.
A real database also has an incredible power behind it: concurrency, caching and so on. An embedded (when file based) database less so.
The problem with maps is that the data extraction can become several indirections. It is more versatile to have SQL queries with joins on the tables.
So SQL is more abstract (does not prescribe the actual query implementation), and easier to test. SQL for instance releases the developer of programming reports.
So go for a database IMHO, when you are really doing hard work.

What you might want to consider is to store the data searched in map when it's searched.
For instance, if a user searches for something specific, that something is stored in the map so that the next user who searches for that gets the data directly from the map rather than the database.
There are some downsides though, as you need to make sure that if the data is changed on the database, the hashmap/cache should be cleared or updated with the new data, as to prevent feeding outdated data to the user.
As for the impact on the server's memory, it depends on the size of the data you're storing. It's hard to give you a precise answer, but you can however test that on your own:
long memoryBefore = Runtime.getRuntime().freeMemory();
// populate your map
long memoryAfter = Runtime.getRuntime().freeMemory();
System.out.println(memoryBefore - memoryAfter);
That should give you the amount of bytes used (more or less, depending on the operations you run between memoryBefore and memoryAfter, as you may have instantiated other classes/variables unrelated to the hashmap)

Database design decision

I'm going to design a merchant application. After merchants are registered with the system they will be able to add their products, discounts, price etc. And there is smart mobile apps to visit each merchants and their products.
So regarding the database (hope to use MySQL) design I have three options.
Use one database and use single table structure to maintain catalog with column called merchant_id.
Use one database and create same table structure per each merchant with unique prefix in table name.
Use separate database with table structure for each merchant when they registers with the system. In this case will maintain a master DB to keep merchant's db information.
We are developing a single application to cater all the merchant's and customers requests and there will be a lot of merchants and customers interact with the system.
Currently we are planning to use Spring MVC and Spring Data JPA.
So I'm troubling with get the correct decision in terms of scalability and maintainability or etc. Your expertise advices/recommendation are highly appreciated.

1) Use one database and use single table structure to maintain catalog
with column called merchant_id.
This is the easiest route to take.
Pros
Low maintenance. Any changes to the DB make it to one schema / database.
Cons
Does not scale beyond X merchants and N transactions per second on the database.
2) Use one database and create same table structure per each merchant
with unique prefix in table name.
This is a hybrid model of sorts, and writing the SQL and trying to track which prefix belongs to which app can be messy if you do not handle it correctly.
Pros
Can scale a little better
Cons
Maintenance overhead on each table; such as adding a new column called created to the table user requires you to modify user_111 and user_121 etc
You can possibly mix up queries by attempting to join user_111 with access_121.
Use separate database with table structure for each merchant when they
registers with the system. In this case will maintain a master DB to
keep merchant's db information.
This provides the most scale but also gives you the most maintenance overhead.
Pros
Can scale each database individually based on the type of customer you have and the traffic they provide.
Cons
High maintenance for each database because individual parameters are tweaked at the DB level too (SSD / Shared buffers / fsync time with the disk / write caches etc ).
If you're starting out by designing a system where you will not know what kind of traffic it will attract on day 1, choose #1. Should the traffic be unexpectedly large, you can always scale vertically and place the high traffic customers on another db later (through a hashing mechanism that puts the customers into db buckets )
If you expect the site traffic to be large enough and already have capacity planned out for the customers, go for #3. You must bear the brunt of the maintenance overhead, but at least you get to scale each database based on the traffic that hits it.
I'm not a fan of #2 since I've seen that approach let down some products that implemented it.

In my opinion option 1 is the way to go. The benefit I see is that you can work over this table with aggregate queries to perform calculations over each merchant, e.g. your admin view wants to see the top 20 merchants with the highest number of products uploaded.
The drawback you might see in option 1 is that this table will be huge. This can be addressed with partitioning techniques and properly chosen indexes.
Option 2 and 3 are not nice because they introduce redundancy in your schema.
Also you can consider that with JPA your entity classes naturally map to tables, but I think table prefixes per merchant would be painful to hack with JPA. This is also a +1 for option 1.
What benefits do you see in option 2 and 3? I don't really see any advantage, only drawbacks.

lightweight data structure for java google app engine

I have a google app engine based app which stores data in the datastore. I want to implement a cron that will read around 20k rows of data each day and summarize the data into a much smaller data set and store it in a lightweight, easy to access data structure that I will use later to serve google charts to users.
I think it will be much too costly to read all the instance level data every time a user needs the chart, therefore I want to compile the data "ahead of time" once per day.
I'm thinking of the following options and I'm interested in any feedback or approaches that would optimize performance and minimize GAE overhead.
Options:
1) Create a small csv or xml file and keep it locally on the server, then read the data from there
2) Persist another "summary level" object in the data store and read that (still might be costly?)
3) Create the google chart SVG and store it locally then re-serve it to users (not sure if this is possible)
Thanks!

Double check, but I think datastore + memcache may endup being the cheapest one.
In your cronjob you precompute the data you need to return for each graph and store it in both datastore and memcache.
For each graph request you get the data from memcache.
Memcache data can however be deleted at any time, so if not available there you read it from datastore and put it back into memcache.

Why not generate the "expensive" data for the first request, then store those results in memcache? Depending on your particular implementation, even the first, expensive request might be slightly cheaper than reading & parsing local files. Subsequent reads will hit your memcache and be much cheaper all around.

How to store Search History

I am building a set of 'Now-Trending' kind of visualizations to showcase the trending searches/ trending documents within my system. The idea to show the top queries that came to my system/ most viewed results etc.
I was wondering what would be the most effective and scalable Java based backend for this. If it's a database what should be the schema like? Or is it wise to maintain this info within a Lucene index? Presently for the prototype I store them in a flat file in an unstructured format.

You might try storing this kind of data in a key-value store such as Redis. Redis has efficient atomic methods for incrementing counters that you can use for accruing votes for queries.

A schema-less backend might be preferable if you plan on capturing data ad-hoc or are unsure of your data needs in the future. Additionally, a scalable solution (horizontally) would support growth in the dataset. With regards to your question about whether to store this data in a search engine, here's a great article going over that concept with some examples.
http://www.elasticsearch.org/blog/2011/05/13/data-visualization-with-elasticsearch-and-protovis.html

Server side caching for Java/Java EE application

Here is my situation: I have Java EE single page application. All client-server communication is AJAX based with JSON is used as format to exchange data. One of my request takes around 1 min to calculate data required by client. Also this data is huge(Could be > 20 MB). So it is not possible to pass entire data to javascript in one go. So for this reason I am only passing few records to client and using grid to display data with paging option.
Now when user clicks on next page button, I need to get more data. My question is how do I cache data on server side ? I need this data only for one user as a time. Would you recommend caching all data one first request using session id as key ?
Any other suggestions ?

I am assuming you are using DB backend for that. I'd use limits to return small chunks of data, most DB vendors have solution for this. That would make your queries faster, and also most of JS fameworks with grid type of components will support paginating results(ExtJS for example).
If you are fetching data from 3rd party and passing it on (with some modifications or not) I'd still stick to the database and use such workflow: pool data from 3rd party, save in db, call from your widget small chunks required by customers.
Hope this helps.

The cheapest (and not so ineffective way of caching data) in a Java EE web application is to use the Session object like you intend to do. It's ineffective since it requires the developer to ensure that the cache does not leak memory; so it is upto to the developer to nullify the reference to the object once the object is no longer needed.
However, even if you wish to implement the poor man's cache, caching 20MB of data is not advisable, as it does not scale well. The scalability question rises when multiple users utilize the same functionality of the application, in which case 20MB is a lot of data.
You're better off returning paginated "datasets" in the form of JSON, based on the ValueList design pattern. Each request for the query of data will result in partial retrieval of data, which is then sent down the wire to the client. That way, you never have to cache the complete results of the query execution, and also you can return partial datasets. It is entirely upto to you, as to whether you want to cache; usually caching is done for large datasets that are utilized time and again.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.