queries to get data in batches from a huge data table

queries to get data in batches from a huge data table - java

I need to create some files with my own format(let say 5000 recs each) from a huge data table (May contain 5 Million records). And i want this creation to be multi threaded.
So how can I form queries efficiently to fetch records like 1..5000 and 5001..10000 and so on.
I can form some thing like select * from table where rownum<5000 and not exists ( already fetched records). But it is not the efficient one.
Please suggest the best way of forming the queries or any alternative approach to create files.

If you're on Oracle 11g you can use the DBMS_PARALLEL_EXECUTE package to run your procedure in multiple threads. Find out more.
If you're on an earlier version you can implement DIY parallelism using a technique from Tom Kyte. The Hungry DBA provides a good explanation on his blog here.

Sounds like you need a set of queries using the MySql LIMIT clause to implement paging (e.g. a query would get the first 1000, another would get the second 1000 etc..).
You could form these queries and submit as Callables to an Executor service with a set number of threads. The Executor will manage the threads. I suspect it may be more efficient to both query and write your records within each Callable, but this is an assumption that would likely require testing.

Related

Is there a way to have two queries, as a part of JAVA implementation, to use the same data set?

I am working on a Java Service (Hibernate) and I am calling sequentially a count query and a query to fetch the corresponding records (native queries). There are cases where the count is different than the actual records fetched by the query retrives the data.
I would like to secure that both queries are about to use the same dataset.
Any ideas on this?

I guess it is quite not good idea to use counts.
think about what primary key on record stands for... or maybe other fields identify records you need.
Retrieved Dataset on client gives you what was in DB at time you ran your query.
There are some dangerous abilities to lock table or records while your transaction not commited yet... but I do not recommend to try them. if it is about Db used by multiple services/clients or threads in parallel. I guess you have such system as counts change while your queries run.
It needs very careful handling to use locks and really dangerous to slow and hang other threads

jdbc data retreiving using multi threading

I am using jdbc mysql. Let's assume there is a table in my db called Test. And there is a 700k rows. But fetching all rows are taking huge time. I am using preparedStatement. But I want to use multi threading in such a way that think there is 10 threads. for. eg 1st thread will fetch 70k rows then 2nd will fetch next 70k and so on. How to implement this?

Forgive me if this is too obvious and you tried it or it won't work in your situation, but caching might be very helpful here.
Regarding actually doing it with multi-threading, It might make sense to have some procedure you run (might need a new column in your table to do this) that would assign ids that you can query - something like " WHERE id BETWEEN value1 AND value2". Each Thread would query a different range. This would be faster than using order by, since this way avoids the need for the database to sort.
If you do want to go the order by route though, consider indexing your database so that that ordering doesn't take extra time.

What kind of Java / SQL solution do I need for 10 read/write every second in a million line table?

If I have an sql table that consist of one million rows. Let's say a user table.
What type of software do I need, in order to handle 10 read/write every second. I was thinking of using a Java NIO server to handle the connections.
But how does the back-end Database work? Could I simply use MySQL on the same computer?
Any insight would be great. Links, reading, examples. books?
I know SQL. I have done alot of SQLite but never created a scalable system to handle this kind of load.
Edit update,regarding helios comment
how many reads vs. writes?: 50/50
do you need up-to-date-reads(no delay): YES?
how big is each item?: 10% is 10-15 columns and the rest is 1-3 columns
are you accessing them individually?: NO, non of the USER threads are interacting but there can be simultaneous DB read/write on same row, (just make it synchronized?)

so you need 10 transcation/second on table with million rows.
that is really neither huge data set nor high performance.
MYSQL (currently 5.5+ ,innodb engine) , running on single server,can easily handle that.
you may need read first five chapter of 'High Performance MySQL' published by oreilly.
for nosql-db, i suggest mongodb, see http://www.mongodb.org/

If you make use of JDBC Connection Pooling (like C3PO, DBCP etc.) you would be able to have parallel inserts, and you would be able to have 10 threads (or more) simultaneously inserting data. Your limit would then be your platform resources (memory, I/O etc.).
All this would hold however only if the data insertion process itself can be parallel threads (i.e. you do not have a specific requirement to insert records sequentially) and that what you are doing are simple inserts and not something complex that locks the table or causes the other transactions to wait.
Also consider using JDBC prepared statements, and also committing in batches rather than after each record. This would speed up things greatly.

Accessing database multiple times

I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.

Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.

JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.

This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.

Fastest way for inserting very large number of records into a Table in SQL

The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.

Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs

Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.

Have you looked into bulk operations bulk operations?

Have you considered to use batch updates?

Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.

Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.

I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

queries to get data in batches from a huge data table - java

If you're on Oracle 11g you can use the DBMS_PARALLEL_EXECUTE package to run your procedure in multiple threads. Find out more. If you're on an earlier version you can implement DIY parallelism using a technique from Tom Kyte. The Hungry DBA provides a good explanation on his blog here.

Related

Is there a way to have two queries, as a part of JAVA implementation, to use the same data set?

jdbc data retreiving using multi threading

What kind of Java / SQL solution do I need for 10 read/write every second in a million line table?

Accessing database multiple times

Fastest way for inserting very large number of records into a Table in SQL

Categories

Resources