I'm using Logstash to import a lot of data from a MySQL database. I'm facing a java.lang.OutOfMemoryError: Java heap space error.
After a few searches I found out a good solution is to enable paging (here) and set jdbc_page_size.
In the current conf I use, the jdbc_fetch_size is already set.
Now, my questions are:
What are the differences between those two parameters?
Do they interact? In a opposite or similar way?
The info I gathered so far are:
this topic asking if :
jdbc_page_size determines the number of rows that are fetched from database per trip
And this one, about hibernate, stating that:
hibernate.jdbc.fetch_size determines the number of rows fetched when there is more than a one row result on select statements
And of course logstash doc
Can anyone clarify it?
Related
I recently got into an interview and I was asked a question
We have a table employee(id, name). And in our java code, we are writing a logic to fetch data from this table and display it in UI. The query is
Select id,name from employee
Query was that during debugging, we found that this jdbc call to fire the query and get the output is taking say 20 secs and we want to reduce this to say 5 seconds or to the optimal time. How can we you do that, or how will I tackle this problem?
As there is no where clause in the query, I didn't suggest to index the column.
As this logic is taking 20 secs every time, so, some other code getting a lock on this table is also out of question.
I suggested that limiting the number of records fetched from the table should help but the interviewer didn't look convinced
Is there anything else we can do as a developer to optimize the call. I guess DBA might tune database setting to improve the performance of this query, but is there any other way
OK, so this is an interview question, so both the problem and the solutions are hypothetical. The interviewer is asking for possible optimizations and / or approaches. Here are some that are most likely to help:
Modify the query to page the data rather than fetching the whole lot. This looks applicable for the example query. Note that this is not just "limiting the number of rows selected from the table" ... which is probably why the interviewer looked doubtful when you said that!
If you do need to display the entire selected record set but in a reduced form (e.g. summed, averaged, sorted, collated etc), do the reduction in the query rather than by fetching the records and doing it in the client.
Tune the fetchSize() as suggested by Ivan.
Here are some other ideas that are less likely to help and / or will require extensive reworking.
Look at the network configs. For example you may be able to get better throughput by OS-level tuning TCP buffer, or optimizing physical or virtual network paths.
Run the query on the database server itself (to eliminate network overheads)
Use an in-memory table
Query a secondary database server; e.g. a readonly snapshot or a slave
You can try to increase fetchSize() for Statement/PreparedStatement to decrease number of network roundtrips between application server/desktop and database server.
You can start several threads that will query some piece of data and then merge all data from several threads.
EDIT: doesn't apply to this situation because id and name are the only columns on this table, but still useful for other readers to note.
If you create an index covering both id and name, then the database can use that index to read the data faster since it wont even have to even read the table.
See this link for a more thorough explanation.
if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage.
The following query generated by hibernate takes 13+ seconds and locks the table:
SELECT COUNT(auditentit0_.audit_id) AS col_0_0_ FROM Audit auditentit0_ WHERE 1=1;
The growing Microsoft SQL server database table contains 90+ million rows.
For Microsoft SQL server, I have found an accurate meta data way of getting the same information very quickly.
However, I would rather not write custom code for Microsoft sql server and oracle (the next database) if hibernate has a way of getting this information.
Here is an example meta data query for Microsoft sql server that is accurate and almost instant:
SELECT SUM (row_count) FROM sys.dm_db_partition_stats WHERE object_id=OBJECT_ID('huge_audit_table') AND (index_id=0 or index_id=1);
Is there a way to have hibernate issue a similar query for a table row count?
One posted answer has indicated that a view could be of use. I'm investigating this post to see if it can solve the issue:
https://vladmihalcea.com/map-jpa-entity-to-view-or-sql-query-with-hibernate/
In hibernate you should use projections like in the link you provided in order to guarantee that it works on multiple dbms:
protected Long countByCriteria(DetachedCriteria criteria) {
Criteria crit = criteria.getExecutableCriteria(getSession());
crit.setProjection(Projections.rowCount());
return (Long)crit.uniqueResult();
}
What engine are you using in mysql? I never had a blocking problem with row count in MySql or Oracle. Maybe the following link will help you: Any way to select without causing locking in MySQL?
Also, after some quick reading i see that Sql Server does indeed block on count.
Maybe you could use a stored procedure or some other mechanism to pass the problem to the dbms.
Edit:
Projections in Hibernate are used to select the columns to fetch, the columns to group elements by, and to use built-in aggregate functions (sum, count, avg, max, min, countDistinct).
It helps you keep your application database-agnotic. Remember that hibernate supports around 30 databases.
In your case you have an specific problem with mssql as the count blocks the table prioritizing accuracy. And using the system views is really quick as you get an estimate but isn´t standard.
You could encapsulate the problem into a view or stored procedure dbms dependant. Or maybe you could try with a NOLOCK hint or READ UNCOMMITED in hibernate (in a count of an audit table it should be acceptable).
To solve this particular problem we stepped back and changed how the UI functions. Through a collaborative effort between UIX and UI developers we agreed that unfiltered queries will NOT ask for total counts. The initial screen load will show only a page full of data. No page 1 of 60,000 controls will exists. Only when the user enters specific criteria will the total count come into play. Those queries should be very fast. Now... it is possible for the user to still setup a query that will be just as bad as the original problem. It should be the exception versus the norm.
So there really is not a solid answer for the OP. If you are faced with this type of problem, if you have control of the UI and API, then it is time to rethink the solution. Think of how google handles paging from a UI perspective. The days of showing a "page 1 of (XX)" are gone IMHO.
Every database I've ever seen has a method for retrieving the count of the query prior to actually executing it. But I can't figure how to do this simple task in Accumulo.
Just for clarity, I want the Accumulo analog of this Mongo feature.
I checked the Scanner apidocs but I can't find anything. I'm using Java but answers for other languages would be greatly helpful too.
Accumulo is a lower-level application than a traditional RDBMS. It is based on Google's Big Table and not like a relational database. It's more accurately described as a massive parallel sorted map than a database.
It is designed to do different kinds of tasks than a relational database, and its focus is on big data.
To achieve the equivalent of the MongoDB feature you mentioned in Accumulo (to get a count of the size of an arbitrary query's result set), you can write a server-side Iterator which returns counts from each server, which can be summed on the client side to get a total. If you can anticipate your queries, you can also create an index which keeps track of counts during the ingest of your data.
Creating custom Iterators is an advanced activity. Typically, there are important trade-offs (time/space/consistency/convenience) to implementing something as seemingly simple as a count of a result set, so proceed with caution. I would recommend consulting the user mailing list for information and advice.
I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.
I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.
Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).
Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.
I am accessing a MySQL table that has over 1 million or more Records. I am using My SQL query browser which is unable to grab all the records and it break the connection in the middle.
Now I have to write a Java Program which access that particular table without being broken in the middle as this table will be modified and accessed frequently.
Can you experts suggest me how should do I go over this problem
either I create an Index on the table and how do I create index
There are different reasons why a MySQL connection might break during a query. Can you give the exact error message you receive?
A simplified explanation on how to add an index to the table for a simple query
Look at the field(s) in the WHERE
clause of the query
Add an index on the field(s) using
ALTER TABLE ADD INDEX
Use EXPLAIN on the query and check
if the query is actually using the
index.
IF you want more specific help, Post the SHOW CREATE TABLE and the EXPLAIN of your query.
MySQL query browser limits the number of records to be displayed for performance reasons, because it is an interactive program and nobody like to wait for half an hour before the program crashes with an out-of-memory error. You can change these limits in the settings.
Your Java program will face similar problems.
When using large datasets it is important to plan how you are going to access that dataset and create the necessary indexes.
It would be useful to edit the question to show the structure of the data. Generqlly it looks like this :
CREATE INDEX idx_customer_name ON customer (name);
Here are more details
If you just want to dump the data to work on the data using Excel you can try this on the commandline
mysqldump -u [username] -p -t -T/path/to/directory [database] --fields-enclosed-by=\" --fields-terminated-by=,
In my experience this is a very painful exercise as Excel really is not made to deal with this amount of rows, and the dump format usually is slightly, but infuriatingly incompatible.
Your best bet is to invest an hour of your time to go through a SQL tutorial like sql fundamentals and play with MySQL query browser to get a feel of what you can do with SQL. I guarantee your investment paid itself back by tomorrow.
I am not very well used to MySQL programming, but generally indexes are used to arrange the values of one or more columns in a database table in specific order.
SYNTAX
CREATE INDEX IndexName ON tableName (column);
Just go through this tutorial for more information,
http://dev.mysql.com/doc/refman/5.0/en/create-index.html