I have a google app engine based app which stores data in the datastore. I want to implement a cron that will read around 20k rows of data each day and summarize the data into a much smaller data set and store it in a lightweight, easy to access data structure that I will use later to serve google charts to users.
I think it will be much too costly to read all the instance level data every time a user needs the chart, therefore I want to compile the data "ahead of time" once per day.
I'm thinking of the following options and I'm interested in any feedback or approaches that would optimize performance and minimize GAE overhead.
Options:
1) Create a small csv or xml file and keep it locally on the server, then read the data from there
2) Persist another "summary level" object in the data store and read that (still might be costly?)
3) Create the google chart SVG and store it locally then re-serve it to users (not sure if this is possible)
Thanks!
Double check, but I think datastore + memcache may endup being the cheapest one.
In your cronjob you precompute the data you need to return for each graph and store it in both datastore and memcache.
For each graph request you get the data from memcache.
Memcache data can however be deleted at any time, so if not available there you read it from datastore and put it back into memcache.
Why not generate the "expensive" data for the first request, then store those results in memcache? Depending on your particular implementation, even the first, expensive request might be slightly cheaper than reading & parsing local files. Subsequent reads will hit your memcache and be much cheaper all around.
Related
I am developing a spring boot REST API, which has to fetch large volume of data (100-200k records) from dynamoDB table based on search conditions and return the response to the API consumer without loading the entire object list in its memory. With SQL based database, I have used JDBCTemplate queryForStreams method for similar requirement. But for no-sql database like DynamoDB, I could not find similar methods to stream the data.
One sample scenario is to fetch all passengers who booked business class ticket on Christmas weekend from xyz airline dynamoDB database.
note: Edited for clarity.
Reading GB's of data per request from DynamoDB does not seem scalable. Does the end user require all that data, what is the purpose?
DynamoDB can only return 1MB per request so for a single end user API call you would have to make many paginated requests to DynamoDB.
If you are using Scan then your solution is not at all scalable and I would possibly suggest using a different database.
This is not a good use case for REST in general. Have you considered storing the query result in an S3?
Your rest API will return a task id, that you can then use to check the progress of the query and eventually download the result.
This way you get infinite scalability and can run huge amounts of parallel dynamo scans or queries.
The fastest way to do this is going to be using a parallel Scan operation. Assuming you have sufficient read capacity on the DynamoDB table, this is going to give you very high speed results.
See "parallel scan using Java" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html for an example
DynamoDB
Think of DDB in the way that Amazon uses it, for e-commerce - small sub 100s list of paginated data, the items are usually small in size, but the items must be easy to update
In this case you would never need to store / fetch GBs of data from the tables
Your needs as 'how might we...' question
How might we store GBs of data in AWS and retrieve that data quickly?
AWS Best Practices
Before we dive in solving the 'hmw' question above we need to understand some core tenets of AWS
operational excellence
security
reliability
performance efficiency
cost optimisation
sustainability
AWS call these tenets or 'pillars' their Well-Architected Framework
You can read more about these here https://aws.amazon.com/architecture/well-architected/
Most of them are as described: monitoring, security, reliability, performance, cost efficient, computationally cheap (which means environmentally friendly)
A sprawling buffet of solutions
Storage
Your needs are the storage of GBs of data
It still depend on what you're trying to store here, but for most storage needs you'd use S3
To make sure we keep things 'compliant' with the Well-Architected framework we'd need to enable the use of encryption (in transite, at rest), block public bucket access etc.
To make everything cost efficient we will have to think about when we want to access this data. If accessed regularly then we'll have to use 'hot' storage, otherwise 'cold' storage S3 options are cheaper but you trade retrieval time.
Notable mentions
If you have specific data science needs you should checkout: Data Lakes (still uses S3 under the hood), Glue, Athena (a query layer on top of S3)
If you're storing text based data and require near instant seaching and retrieval using OpenSearch - this is very useful for chat related data
Data storage
This depends of your app, but most people still keep a DynamoDB table that acts as a map for S3 queries.
DDB is query optimised and super performant when you fully understand you data queries or access pattern.
Design you table around your access patterns not around entities.
eg.
Option 1: One table
PK SK
type#order timestamp
type#transaction timestamp
....
Option 2: Multiple Entity based tables
Order table,
PK SK Attr
id timestamp productIDs
Transactions table
PK SK Attr
id timestamp amount, orderId
Products table
PK SK
id category
The one-table design just simplifies the retrieval of data in a small number or requests, but you do need to play with your table design until it's just right.
My recommendation: be creative and mix and match the table styles to your needs. Entity-based tables are still useful in most apps.
Also expect to redo your tables once you find out new things.
It's crucial here that you use an infrastructure-as-code tool to make it easier to teardown and recreate tables - CDK is great for this.
Remember that you are billed per Read and Write units. This is where a well-designed table (to match your access patterns) will help you make concise queries at a low cost.
Data retrieval
This is where you have some options, depending on your app
Again I would recommend the storage for big items in S3 not DynamoDB, so in this case it's relatively easy to download GBs of data from S3.
You can also store data in optimised formats using parquet.
Also if you choose to use DynamoDB as a hash map for the S3 bucket you can quickly find your files and locations and then place those in a queue, so that the retrieval happens in the background.
You can also copy files within the bucket to a job folder, zip the data and provide the user with the URL to that zip.
You can also use DataSync for copying across buckets.
Final notes
It sounds to me like you are storing data in AWS and downloading for processing.
Most teams approach this by moving their processing and storage to AWS, running the whole process in the cloud.
I have a web application in which I'm maintaining many static Maps to store my relevant information. Since the application is deployed on a server. Each and every hit to the server side java uses these maps to match the key and get appropriate result and send back to the client side. My code contains a rank and retrieval feature so I have to read the entire keySet of each of these Maps.
My question is:
1. Is working with static variables better than storing this data in a local embedded DB like Apache Derby and then using it?
2. The use of this data is very frequent. So if I use database will that be faster approach? Since I read the full keyset the where clause may not come handy in many operations.
3. How does the server's memory gets impacted on holding data in static variables?
My no. of maps are fixed but the size of the Maps keeps increasing? Please suggest the better solution.
If you want the data to be saved regularly an embedded database like H2 makes sense. You then also have snapshots of the data, and development, structural changes are a bit more safe.
A real database also has an incredible power behind it: concurrency, caching and so on. An embedded (when file based) database less so.
The problem with maps is that the data extraction can become several indirections. It is more versatile to have SQL queries with joins on the tables.
So SQL is more abstract (does not prescribe the actual query implementation), and easier to test. SQL for instance releases the developer of programming reports.
So go for a database IMHO, when you are really doing hard work.
What you might want to consider is to store the data searched in map when it's searched.
For instance, if a user searches for something specific, that something is stored in the map so that the next user who searches for that gets the data directly from the map rather than the database.
There are some downsides though, as you need to make sure that if the data is changed on the database, the hashmap/cache should be cleared or updated with the new data, as to prevent feeding outdated data to the user.
As for the impact on the server's memory, it depends on the size of the data you're storing. It's hard to give you a precise answer, but you can however test that on your own:
long memoryBefore = Runtime.getRuntime().freeMemory();
// populate your map
long memoryAfter = Runtime.getRuntime().freeMemory();
System.out.println(memoryBefore - memoryAfter);
That should give you the amount of bytes used (more or less, depending on the operations you run between memoryBefore and memoryAfter, as you may have instantiated other classes/variables unrelated to the hashmap)
I'm fairly new to Android-Development and I got a general question about How-To:
My App gets Sensor-Data from Step-Detector (Detected steps gets added up).
Now I need to store those Steps (which will be a lot of Data).
The steps should be stored like this:
If Todays
steps are stored on per Hour basis.
Else
steps are stored on per Day basis
SharedPreferences falls out of this as it only stores KeyValues.
But can SQLite handle this? Or is there any other way?
A future feature could be to sync those data with a Server.
I mean this could end up in thousands of Entries, and the app will also support other large data sets which need to get stored in similar way.
Try using Realm noSql database for it. The point is, you can save entire database on sd card as separate file for each day and process it later. It is native and work very fast with large amount of data. You can process all your readings later on - open database, transform readings (perhaps interpolate values for older to shring data in size) and then upload it to the cloud and delete database file.
But, anyways, a database is just implementation details, consider abstracting out all your operations so you can replace db later on.
As far as I know, sqLite stores all tables in a single file, so you will need column for a date and all records will be stored in single table. Realm is more flexible for this task.
SQL Lite can be used , it will be there as long as your application exist in the device, however if you want you can use Cloud Service, Azure provides simple and easy to use App Service , which have easy tables , in which you can directly call the APIs and internally it takes care of making connection and inserting the data into table.You can use Free Tier of App Service to test the concept.
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!
Here is my situation: I have Java EE single page application. All client-server communication is AJAX based with JSON is used as format to exchange data. One of my request takes around 1 min to calculate data required by client. Also this data is huge(Could be > 20 MB). So it is not possible to pass entire data to javascript in one go. So for this reason I am only passing few records to client and using grid to display data with paging option.
Now when user clicks on next page button, I need to get more data. My question is how do I cache data on server side ? I need this data only for one user as a time. Would you recommend caching all data one first request using session id as key ?
Any other suggestions ?
I am assuming you are using DB backend for that. I'd use limits to return small chunks of data, most DB vendors have solution for this. That would make your queries faster, and also most of JS fameworks with grid type of components will support paginating results(ExtJS for example).
If you are fetching data from 3rd party and passing it on (with some modifications or not) I'd still stick to the database and use such workflow: pool data from 3rd party, save in db, call from your widget small chunks required by customers.
Hope this helps.
The cheapest (and not so ineffective way of caching data) in a Java EE web application is to use the Session object like you intend to do. It's ineffective since it requires the developer to ensure that the cache does not leak memory; so it is upto to the developer to nullify the reference to the object once the object is no longer needed.
However, even if you wish to implement the poor man's cache, caching 20MB of data is not advisable, as it does not scale well. The scalability question rises when multiple users utilize the same functionality of the application, in which case 20MB is a lot of data.
You're better off returning paginated "datasets" in the form of JSON, based on the ValueList design pattern. Each request for the query of data will result in partial retrieval of data, which is then sent down the wire to the client. That way, you never have to cache the complete results of the query execution, and also you can return partial datasets. It is entirely upto to you, as to whether you want to cache; usually caching is done for large datasets that are utilized time and again.