I have web application that use SQLserver . In db I have one big table (3GB). All db have 4GB. Problem is that query from another table (not that big one) is very slow sometimes. Sometimes query need few second, but sometimes same query need several minutes.
My question is: can one big table slow down query from another table?
Because i em using sql-server-2008 express edition, with limitation of 1 GB RAM and 10 GB database size, could that be a problem? Would changing sql server edition solve my problem? There is about 50 users all the time that use application.
In general, the simple existence of table A should not have any affect on queries against table B that do not involve table A. That said, if application X is querying table B and at the same time application Y is querying table A, and the query against A takes a lot of work, than that can slow down the query against table B, because the server only has so much power.
I can think of ways in which the existence of a large table could slow down queries against another table. For example, if the disk is fragmented with parts of small table B, then big table A, then more of small table B, any access against B has to move across larger sections of the hard drive. But wow, I doubt this would be a big issue.
I suppose there could be background processes, like accumulating table statistics for the optimizer, that would kick in on the big table just as you are running a query against the little table. Maybe someone with more knowledge of the internals could weigh in on a question like that.
it could be RAM related, SQL server caches data in RAM and if a very large table gets cached, this could be at the expense of other tables not being cached and therefore slower to access from disk.
This is just a theory, but you might want to try out the queries in
https://www.mssqltips.com/sqlservertip/2393/determine-sql-server-memory-use-by-database-and-object/
How much RAM have you allocated to SQL server, and how many other databases/things going on?
Related
I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.
I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !
In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.
I have a project fetching data from DB2 database and we have a following scenario over which I need quality inputs. Thanks in advance.
Current Application is fetching data from table A (Let’s say SALES) table from DB Schema say ORIGIN_X.
The same table with different name exists in other schema say ORIGIN_Y.
Both tables has more 5 million records in each and growing.
Problem Statement
I want to merge the data from both the schema/tables to present combined view on UI without compromising performance.
The number of records are not more than 200 to show on UI but scanning 5 + 5 =10 million records degrades the performance.
Solutions worked so far.
Created logical view and tried to fetch the date from it but the query performance is dead slow.
Thinking of MQT (so as can create index on column) in DB2 equivalent to Materialized view and still progressing.
Help Need
Are these both approaches right for the problem statement? If yes, what should be done better to proceed with MQT?
What is the better approach other than above two?
Thoughts?
We have a relatively large table in a H2 database with up to 12 million rows. The table contains status information that a user needs to see on a web interface. The user is mainly only interested in the last couple of hundred / thousand entries, or entries over the last n days. Of course sometimes it will also be necessary to query all entries, but we can suppose that this happens seldomly and can take its time. Now our main problem is that we do not have a full blown server as target platform but a more embedded solution and with tables that size, the embedded system is taking a couple of seconds to respond and the web ui (with ajax etc.) feels sluggish.
To make the query faster we already added indexes, max_row_memory and caching. This makes the query impressively faster, but is still not in the range where we would like to be.
As I understand, H2 flushes the cache of the table if an INSERT / UPDATE / DELETE is performed on the table. A large part of the application depends on the last n rows and I am searching for a way to always keep these n-rows in cache, so that if a SELECT query to get the last n rows is called, even after a previous INSERT, the rows are collected from the cache.
As I did not find any solution in H2 directly, my first approach would be to implement the caching as a second level inside the application. The solution would be ok, but from a design point of view find it more appealing to have it inside H2. Anybody have an idea how I could solve this with H2?
H2 doesn't flush the cache on modification.
Your best bet is to run a profiler over your application to see where the time is going to.
We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.
Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.
A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.
I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.
However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.
So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?
Replicating the data in a cache seems like replicating the functionality of the database.
From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:
CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;
If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than
SELECT name, city, ... , FROM a , b ;
Other database systems have a view concept which lets you do something like this
CREATE VIEW abView AS SELECT * FROM a , b ;
Changes in the underlying a and b table will be reflected in the abView.
If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.
A good database management system should be able to handle your data needs. So why reinvent the wheel?
SELECT * FROM YOUR_TABLE
Map results into an object or data structure
Assign a unique key for each object or data structure
Load the key and object or data structure into a WeakHashMap to act as your cache.
I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?
Be sure to think about thread safety.
I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.
How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.