From a Java application, if I need to fetch 100 000 records from any RDBMS. What are the things I should consider? will it be fetched by a simple select statement?
What are the things I should consider?
The most obvious thing is that it could take a long time to transfer 100,000 "huge" records over a JDBC connection.
You might want to look at alternatives ... like a database specific data extraction tool of some kind.
will it be fetched by a simple select statement?
If you are willing to wait long enough for the transfer to compete, yes.
Suppose that application has to fetch all the records and display in UI in a J2EE application and the application is using Spring MVC with Hibernate as the ORM layer.
More things to consider:
Attempting to display 100,000 records to a user in a single page is a bit crazy. No user is going to want to scroll through 100,000 records.
If you are doing this kind of thing via an ORM, you are liable to us an inordinate amount of server-side memory.
My advice: don't fetch all 100,000 records. Instead, fetch the first N, and implement a scheme that allows the user to page through the records.
Related
In Angular 8+, If we need to display list of record, we will display result in pagination way.
We have more than 1 Million of Records and in future also record will increase.
I am using Spring Boot and MYSQL as a Database
But what would be the preferable approach
Getting all the data from server at once and handle Pagination at client side.
Get 10 Records at once and display and when User click at Next Button get the next 10 records from Server.
I think you should use Pagination as compared with all data from the server.
As you are getting all data from the server it is a costly operation as you mention your application has more than millions of records.
With the use of Pagination whenever required at that time API is called and get data based on your Pagination request per page.
I would strongly advise you to go with variant #2.
The main reason to do pagination is not really because it makes sense to only display a few entries in the UI at once. Instead, pagination allows you to only transfer the necessary entries from large data sets (such as yours). This greatly improves performance and reduces the amount of data that has to be sent from the server to the client.
Variant #1 will have very poor performance, because the client has to fetch all 1,000,000 records to then only display 10 of them. This does not make a lot of sense and goes directly against the idea and the advantages of pagination.
Variant #2 on the other hand will only fetch the entries that are actually displayed. And it will only transfer roughly 0.00001% of the data that variant #1 would.
I would use something in between, load maybe 100 or 1000 records. But with one million you browser will go out of memory and with 10 your user gets bored...
i have a JPA Entity (EclipseLink) developing a web application with JSF 2. Let's say i have this:
private String table;
#OneToMany(mappedBy = "NodeTypeID")
private Collection<NodeEntity> nodeEntityCollection;
That collection is coming very large because, of course, the rows in the table at the database are a lot. I don't show all those entities in the web because... you can't do it, is too much for a web page. So i limit the collection to 150 objects.
I limit it after the +1,000 entities are already on memory, so i guess the process of making all those instances has to be slow. So, i just want to know, what would you do in this case ? Just make a query to bring just the 150 entities i want ? Is there an annotation for that ? Is it good practice to let that process just like that ?
In hibernate Criteria there are couple of methods to handle pagination, ie retrieve 150 rows at a time, the client has to keep track of page number that you are viewing and send it to the server. Storing 1500 rows in the server is usually not a big deal for short duration.
setFirstResult(i*PAGE_SIZE)
setMaxResults(PAGE_SIZE)
ref : http://docs.jboss.org/hibernate/envers/3.6/javadocs/org/hibernate/Criteria.html#setMaxResults(int)
If I have an sql table that consist of one million rows. Let's say a user table.
What type of software do I need, in order to handle 10 read/write every second. I was thinking of using a Java NIO server to handle the connections.
But how does the back-end Database work? Could I simply use MySQL on the same computer?
Any insight would be great. Links, reading, examples. books?
I know SQL. I have done alot of SQLite but never created a scalable system to handle this kind of load.
Edit update,regarding helios comment
how many reads vs. writes?: 50/50
do you need up-to-date-reads(no delay): YES?
how big is each item?: 10% is 10-15 columns and the rest is 1-3 columns
are you accessing them individually?: NO, non of the USER threads are interacting but there can be simultaneous DB read/write on same row, (just make it synchronized?)
so you need 10 transcation/second on table with million rows.
that is really neither huge data set nor high performance.
MYSQL (currently 5.5+ ,innodb engine) , running on single server,can easily handle that.
you may need read first five chapter of 'High Performance MySQL' published by oreilly.
for nosql-db, i suggest mongodb, see http://www.mongodb.org/
If you make use of JDBC Connection Pooling (like C3PO, DBCP etc.) you would be able to have parallel inserts, and you would be able to have 10 threads (or more) simultaneously inserting data. Your limit would then be your platform resources (memory, I/O etc.).
All this would hold however only if the data insertion process itself can be parallel threads (i.e. you do not have a specific requirement to insert records sequentially) and that what you are doing are simple inserts and not something complex that locks the table or causes the other transactions to wait.
Also consider using JDBC prepared statements, and also committing in batches rather than after each record. This would speed up things greatly.
I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.