I'm running queries in parallel against a MySql database. Each query takes less than a second and another half a second to a second to fetch.
This is acceptable for me. But when I run 10 of these queries in parallel and then attempt another set in a different session everything slows down and a single query can take some 20 plus seconds.
My ORM is hibernate and I'm using C3P0 with <property name="hibernate.c3p0.max_size">20</property>. I'm sending the queries in parallel by using Java threads. But I don't think these are related because the slowdown also happens when I run queries in MySql Workbench. So I'm assuming something in my MySql config is missing, or the machine is not fast enough.
This is the query:
select
*
FROM
schema.table
where
site = 'sitename' and (description like '% family %' or title like '% family %')
limit 100 offset 0;
How can I make this go faster when facing let's say 100 concurrent queries?
I'm guessing that this is slow because the where clause is doing a full text search on the description and title columns; this will require the database to look through the entire field on every record, and that's never going to scale.
Each of those 10 concurrent queries must read the 1 million rows to fulfill the query. If you have a bottleneck anywhere in the system - disk i/o, memory, CPU - you may not hit that bottleneck with a single query, but you do hit it with 10 concurrent queries. You could use one of these tools to find out which bottleneck you're hitting.
Most of the time, those bottlenecks (CPU, memory, disk) are too expensive to upgrade - especially if you need to scale to 100 concurrent queries. So it's better to optimize the query/ORM approach.
I'd consider using Hibernate's built-in free text capability here - it requires some additional configuration, but works MUCH better when looking for arbitrary strings in a textual field.
Related
I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.
I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !
In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.
Specifically, I'm using MySQL v5.5.41 and performing the inserts using Java JDBC (the driver is mysql-connector-java-5.1.30), although I don't know if the driver I'm using relevant.
I'm running a once-off application to insert a reasonably large number of rows across 7 tables. Each of my "entries" (rows relating to the same data) consists of a variable number of rows across 6 of the tables, and a single row in the other table that relates to the others (approx. 10-20 rows to be inserted across all 7 tables per "entry", but occasionally there might be significantly more).
I'm wrapping each "entry" insert inside a transaction that is committed after all the rows for the entry have been inserted.
My question is whether or not it is necessary to batch the row inserts into each of my tables that would usually require multiple row inserts e.g. using addBatch() and batchExecute() in Java?
For example, if I was to call the executeUpdate() function for every row insert in a table (no batching), does the JDBC library I'm using optimize and ultimately only issue a single multi-valued insert when I commit the transaction later? Or, if there is optimization in these circumstances, maybe it's carried out by MySQL itself?
There will still be multiple statements - and thus multiple in flight requests. Using transactions does not affect how statements are executed (or batched for such execution).
Transactions occur entirely within the MySQL engine itself. Using a transactions is a good step and it does greatly help with performance mainly because 'commits' (and associated data flushes/syncs) are themselves reduced.
For a low latency connection the performance will be equivalent. However batching can still play a factor in 'higher latency' connections. This is because the individual statements must still be round-tripped to the sever. (Eg. with a 5ms connection at most 200 statements can be executed per second.)
In any case, the 'definitive performance answer' is a benchmark under the specific load/task/configuration.
Batching is very important, regardless of transactions.
In many tests, I have seen about 10x (not 10%) speed up when doing a single INSERT with 100 rows instead of 100 1-row INSERTs. (For "same machine", latency is low, but not zero.)
Think of all the overhead of a statement -- network latency, process swapping, parsing, and many mutexes.
If I have an sql table that consist of one million rows. Let's say a user table.
What type of software do I need, in order to handle 10 read/write every second. I was thinking of using a Java NIO server to handle the connections.
But how does the back-end Database work? Could I simply use MySQL on the same computer?
Any insight would be great. Links, reading, examples. books?
I know SQL. I have done alot of SQLite but never created a scalable system to handle this kind of load.
Edit update,regarding helios comment
how many reads vs. writes?: 50/50
do you need up-to-date-reads(no delay): YES?
how big is each item?: 10% is 10-15 columns and the rest is 1-3 columns
are you accessing them individually?: NO, non of the USER threads are interacting but there can be simultaneous DB read/write on same row, (just make it synchronized?)
so you need 10 transcation/second on table with million rows.
that is really neither huge data set nor high performance.
MYSQL (currently 5.5+ ,innodb engine) , running on single server,can easily handle that.
you may need read first five chapter of 'High Performance MySQL' published by oreilly.
for nosql-db, i suggest mongodb, see http://www.mongodb.org/
If you make use of JDBC Connection Pooling (like C3PO, DBCP etc.) you would be able to have parallel inserts, and you would be able to have 10 threads (or more) simultaneously inserting data. Your limit would then be your platform resources (memory, I/O etc.).
All this would hold however only if the data insertion process itself can be parallel threads (i.e. you do not have a specific requirement to insert records sequentially) and that what you are doing are simple inserts and not something complex that locks the table or causes the other transactions to wait.
Also consider using JDBC prepared statements, and also committing in batches rather than after each record. This would speed up things greatly.
We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.
Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.
A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.
I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.
However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.
So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?
Replicating the data in a cache seems like replicating the functionality of the database.
From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:
CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;
If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than
SELECT name, city, ... , FROM a , b ;
Other database systems have a view concept which lets you do something like this
CREATE VIEW abView AS SELECT * FROM a , b ;
Changes in the underlying a and b table will be reflected in the abView.
If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.
A good database management system should be able to handle your data needs. So why reinvent the wheel?
SELECT * FROM YOUR_TABLE
Map results into an object or data structure
Assign a unique key for each object or data structure
Load the key and object or data structure into a WeakHashMap to act as your cache.
I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?
Be sure to think about thread safety.
I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.
How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.
Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
Have you looked into bulk operations bulk operations?
Have you considered to use batch updates?
Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.
Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.
I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.