I have data being collected every 1 sec and stored in hsqlDB.
I need to have aggregation data (per 15 sec, 1 min etc) on each metrics in the data collected.
What is the best approach to calculate the aggregation values? When to store in the DB?
Should I calculate the values online and each 15 sec store in DB? Or maybe query the DB for the last results and calculate the aggregation on them? Should I use small aggregation (15 sec) to calculate the large aggregation (1 min) ?
Are there free java tools for it?
From previous experience, I would suggest using a real time database, probably non-relational with a built-in ability to deal with time series. That way, you should be able to avoid storing calculated aggregated data. Using a relational database, you will quickly end up with millions of rows that will be difficult to manage and slow to access. Your other option is to denormalize your data and store every 1 hour of data in a single row, in a BLOB column (in binary format).
You can use HSQLDB is MVCC mode for concurrent reads and writes.
Provided the table for the raw data has an indexed timestamp column, aggregate calculation on a range is very fast using a SELECT statement. Because SELECT statements with aggregate calculations happen concurrently, you can use separate threads to perform the operation every 1 second and every 15 seconds.
Related
I have around 15 million records in MySQL (read only) which will be fetched using joins of 10 tables. Around 50000 new records are inserted daily. Number will keep on increasing in future.
Each record will be processed independently by a java program. Multiple processing will be done on the same record and output will be calculated based on the processing.
Results will be stored in another database.
Processing shall be completed within an hour
My questions are
How to design the processing engine (cluster of java programs) in a distributed manner making the processing as fast as possible? To be more precise, I want to boot many spot instance at that time and finish the processing.
Will mysql be a read bottleneck?
I don't have any experience in big data solutions. Shall I use spark or any other map reduce solution? If yes, then how shall I proceed?
I was in a similar situation where we were collecting about 15 million records per day. What I did was create some collection tables that I rotated and performed initial processing. Once that was done, I moved the data to the next phase where further processing was done before adding it to the large collection of data. Breaking it down will get the best performance and avoid having to run through a large set of data.
I'm not sure what you mean about processing data and why you want to do it in Java, you may have a good reason for that. I would imagine that performance would be much better if you offload that to MySQL and let it do as much of the processing as possible.
I'm measuring the performance of MongoDB with its native driver.
For the insert operation, the more I inserted the more time it took. However, with selecting, I have 500.000 documents (inside 1 collection) with each document having 4 other embedded, small documents inside of it.
I performed a select query using also a conditional operation for 1 subdocument. The performance of selecting 100 documents (from a total of 1000) is 0,007 ms while selecting 50000 documents (from a total of 500000) is 0,008 ms.
I don't quite understand how this is possible. Is it really that fast? I'm measuring the performance using JMH.
Your test setup only measures the time it takes to open a cursor, but not to actually read any data from the cursor.
When you perform contactCollection.find(bs); the database returns a cursor object but doesn't actually do anything yet. The actual query doesn't happen before you retrieve the data from the cursor. Changing the line to contactCollection.find(bs).toArray() should force the database to return all results, allowing you to measure both the query-time, the time it takes to pass the results via network and for the Java driver to parse the results.
Alternatively you could use contactCollection.find(bs).explain() to measure the time the operation takes only on the database-level and also get a DBObject with some useful performance metrics from the database itself.
I have a requirement like running 'n' numbers of select queries at fixed time intervals and storing that data. These results need to be pulled later upon a client's demand.
My question is:
1) Is it okay to store it as csv files? Or could you suggest another format?
2) Or, should it be stored as clob variable in a db?
Please suggest any compression techniques to store these query results; also, is it possible to store only revisions of previous resultsets instead of storing the whole resultset?
note:
The minimum time interval is hourly.
The number of queries (n) will be varying (currently 10 to 200 queries.)
The resultset size of each query is also varying (say 10 to 1,000,000 but mostly around 10k.)
The resultset data fetched between each time intervals doesn't differ much. (The row value will not be updated frequently.)
I am new to computer science and programming and also not very aware about storage or db designs.
It sounds like you should be building a data warehouse.
Performance-wise I suppose it would be better to have a table which purpose is to store the query results.
I think you need to store the data in a database. SQL database can serve you the best.
Regarding to storing the data in fixed interval of time, you just need to make effect of the change in the data set instead of storing the whole data again and again. I don't know what is your requirement and how much infrastructure you can afford. If you have such huge queries, I recommend you to work in Distributed System. Use NOSQL database for better performance.
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.
Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
Have you looked into bulk operations bulk operations?
Have you considered to use batch updates?
Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.
Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.
I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.