Using DynamoDB for Timeseries Data with visualization goal - java

I've been advised to look into DynamoDB to store timeseries data but I'm not quite sure about it given that my final goal is data visualization.
I have sensors that send data once every 10 minutes and I'd like to visualize the data in some charts with a weekly view by default (1008 data points (datetime/values) per week). Let's suppose that I provision 10,000 Reads/Second (AWS 'default' maximum) and let's assume that 1 record will fit in 1 unit of capacity (1kb).
Besides stuff getting expensive, does this mean that I cannot even support only 10 clients simultaneously? Am I wrong or DynamoDB is just not the right tool for the job?

DynamoDB is very good to store your in coming event data, but it should not be the only tool for you to work with. You can integrate DynamoDB with other tools:
Put a cache (ElasticCache, for example) in front of your DynamoDB to allow serving your repeated queries from it, instead of DynamoDB
Put a buffer queue (SQS, for example) in front of your DynamoDB to allow your sensors to send their reports in various rates, while keeping a lower balanced rate of writes into your DynamoDB.
You can also have multiple formats of your data inside DynamoDB., based on your access pattern. For example, you can have a single record that holds the data points of the whole week per a sensor, and you can update this single record every 10 minutes, instead of only appending new record data every report. This weekly record per sensor, can be daily or monthly as you see fit. Still you will only have to read 1 or 7 or any other small number of records per write and read.
Updated with link to more on DynamoDB table design from DynamoDB documentations: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Related

Fetch large volume of data from DynomoDB?

I am developing a spring boot REST API, which has to fetch large volume of data (100-200k records) from dynamoDB table based on search conditions and return the response to the API consumer without loading the entire object list in its memory. With SQL based database, I have used JDBCTemplate queryForStreams method for similar requirement. But for no-sql database like DynamoDB, I could not find similar methods to stream the data.
One sample scenario is to fetch all passengers who booked business class ticket on Christmas weekend from xyz airline dynamoDB database.
note: Edited for clarity.
Reading GB's of data per request from DynamoDB does not seem scalable. Does the end user require all that data, what is the purpose?
DynamoDB can only return 1MB per request so for a single end user API call you would have to make many paginated requests to DynamoDB.
If you are using Scan then your solution is not at all scalable and I would possibly suggest using a different database.
This is not a good use case for REST in general. Have you considered storing the query result in an S3?
Your rest API will return a task id, that you can then use to check the progress of the query and eventually download the result.
This way you get infinite scalability and can run huge amounts of parallel dynamo scans or queries.
The fastest way to do this is going to be using a parallel Scan operation. Assuming you have sufficient read capacity on the DynamoDB table, this is going to give you very high speed results.
See "parallel scan using Java" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html for an example
DynamoDB
Think of DDB in the way that Amazon uses it, for e-commerce - small sub 100s list of paginated data, the items are usually small in size, but the items must be easy to update
In this case you would never need to store / fetch GBs of data from the tables
Your needs as 'how might we...' question
How might we store GBs of data in AWS and retrieve that data quickly?
AWS Best Practices
Before we dive in solving the 'hmw' question above we need to understand some core tenets of AWS
operational excellence
security
reliability
performance efficiency
cost optimisation
sustainability
AWS call these tenets or 'pillars' their Well-Architected Framework
You can read more about these here https://aws.amazon.com/architecture/well-architected/
Most of them are as described: monitoring, security, reliability, performance, cost efficient, computationally cheap (which means environmentally friendly)
A sprawling buffet of solutions
Storage
Your needs are the storage of GBs of data
It still depend on what you're trying to store here, but for most storage needs you'd use S3
To make sure we keep things 'compliant' with the Well-Architected framework we'd need to enable the use of encryption (in transite, at rest), block public bucket access etc.
To make everything cost efficient we will have to think about when we want to access this data. If accessed regularly then we'll have to use 'hot' storage, otherwise 'cold' storage S3 options are cheaper but you trade retrieval time.
Notable mentions
If you have specific data science needs you should checkout: Data Lakes (still uses S3 under the hood), Glue, Athena (a query layer on top of S3)
If you're storing text based data and require near instant seaching and retrieval using OpenSearch - this is very useful for chat related data
Data storage
This depends of your app, but most people still keep a DynamoDB table that acts as a map for S3 queries.
DDB is query optimised and super performant when you fully understand you data queries or access pattern.
Design you table around your access patterns not around entities.
eg.
Option 1: One table
PK SK
type#order timestamp
type#transaction timestamp
....
Option 2: Multiple Entity based tables
Order table,
PK SK Attr
id timestamp productIDs
Transactions table
PK SK Attr
id timestamp amount, orderId
Products table
PK SK
id category
The one-table design just simplifies the retrieval of data in a small number or requests, but you do need to play with your table design until it's just right.
My recommendation: be creative and mix and match the table styles to your needs. Entity-based tables are still useful in most apps.
Also expect to redo your tables once you find out new things.
It's crucial here that you use an infrastructure-as-code tool to make it easier to teardown and recreate tables - CDK is great for this.
Remember that you are billed per Read and Write units. This is where a well-designed table (to match your access patterns) will help you make concise queries at a low cost.
Data retrieval
This is where you have some options, depending on your app
Again I would recommend the storage for big items in S3 not DynamoDB, so in this case it's relatively easy to download GBs of data from S3.
You can also store data in optimised formats using parquet.
Also if you choose to use DynamoDB as a hash map for the S3 bucket you can quickly find your files and locations and then place those in a queue, so that the retrieval happens in the background.
You can also copy files within the bucket to a job folder, zip the data and provide the user with the URL to that zip.
You can also use DataSync for copying across buckets.
Final notes
It sounds to me like you are storing data in AWS and downloading for processing.
Most teams approach this by moving their processing and storage to AWS, running the whole process in the cloud.

Apache Beam - what are the limits of Deduplication function

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.
The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.
Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?
The deduplication function, as described in the link that you provide, performs a deduplicate per window.
If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.
So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.

spark application from kafka stream takes long time to produce recommendation

I am reading stream of data in my spark application from kafka stream. My requirement is to produce product recommendation for a user when he makes any request (search/browse etc.)
I already have a trained model containing score of users. I am using Java and org.apache.spark.mllib.recommendation.MatrixFactorizationModel model to read the model once at start of my spark application. Whenever there is any browsing event, I call recommendProducts(user_id, num_of_recommended_products) API to produce recommendation for a user from my already existing trained model.
This API is taking ~3-5 seconds for generating result per user which is very slow and hence my stream processing lags behind. Are there any ways in which I can optimise the time of this API? I am considering increasing stream duration from 15 seconds to 1 minute as an optimisation (not sure of its results now)
Calling recommendProducts in real time, doesn't make much sense. Since ALS model can make predictions only for users, which has been seen in the training dataset, it is better to recommendProductsForUser once, store the output in a store which supports first lookups by key and fetch results from there, when needed.
If adding storage layer is not an option, you can also take output of recommendProductsForUser, partition by id, checkpoint and cache predictions, and then join with input stream by id.

Android: Best way to store large amount of sensor datas over long time

I'm fairly new to Android-Development and I got a general question about How-To:
My App gets Sensor-Data from Step-Detector (Detected steps gets added up).
Now I need to store those Steps (which will be a lot of Data).
The steps should be stored like this:
If Todays
steps are stored on per Hour basis.
Else
steps are stored on per Day basis
SharedPreferences falls out of this as it only stores KeyValues.
But can SQLite handle this? Or is there any other way?
A future feature could be to sync those data with a Server.
I mean this could end up in thousands of Entries, and the app will also support other large data sets which need to get stored in similar way.
Try using Realm noSql database for it. The point is, you can save entire database on sd card as separate file for each day and process it later. It is native and work very fast with large amount of data. You can process all your readings later on - open database, transform readings (perhaps interpolate values for older to shring data in size) and then upload it to the cloud and delete database file.
But, anyways, a database is just implementation details, consider abstracting out all your operations so you can replace db later on.
As far as I know, sqLite stores all tables in a single file, so you will need column for a date and all records will be stored in single table. Realm is more flexible for this task.
SQL Lite can be used , it will be there as long as your application exist in the device, however if you want you can use Cloud Service, Azure provides simple and easy to use App Service , which have easy tables , in which you can directly call the APIs and internally it takes care of making connection and inserting the data into table.You can use Free Tier of App Service to test the concept.

What is the best approach to create aggregation tables?

I have data being collected every 1 sec and stored in hsqlDB.
I need to have aggregation data (per 15 sec, 1 min etc) on each metrics in the data collected.
What is the best approach to calculate the aggregation values? When to store in the DB?
Should I calculate the values online and each 15 sec store in DB? Or maybe query the DB for the last results and calculate the aggregation on them? Should I use small aggregation (15 sec) to calculate the large aggregation (1 min) ?
Are there free java tools for it?
From previous experience, I would suggest using a real time database, probably non-relational with a built-in ability to deal with time series. That way, you should be able to avoid storing calculated aggregated data. Using a relational database, you will quickly end up with millions of rows that will be difficult to manage and slow to access. Your other option is to denormalize your data and store every 1 hour of data in a single row, in a BLOB column (in binary format).
You can use HSQLDB is MVCC mode for concurrent reads and writes.
Provided the table for the raw data has an indexed timestamp column, aggregate calculation on a range is very fast using a SELECT statement. Because SELECT statements with aggregate calculations happen concurrently, you can use separate threads to perform the operation every 1 second and every 15 seconds.

Categories

Resources