My query returns with 31,000 results with 12 columns for each row, and each row contains roughly 8,000 characters (8KB per row). Here is how I processed:
public List<MyTableObj> getRecords(Connection con) {
List<MyTableObj> list = new ArrayList<MyTableObj>();
String sql = "my query...";
ResultSet rs = null;
Statement st = null;
con.setAutoCommit(false);
st = con.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
st.setFetchSize(50);
rs = st.executeQuery(sql);
try {
System.out.println("Before MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
while ( rs.next() ) {
MyTableObjitem item = new MyTableObj();
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
list.add( item );
} // end loop
// try to release some memory, but it's not working at all
if ( st != null ) st.close();
if ( rs != null ) rs.close();
st = null; rs = null;
}
catch ( Exception e ) { //do something }
System.out.println("After MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
return list;
} // end getRecords
If each row takes 8kb memory, 31k should take 242mb memory. After finish looping the query result, my remaining memory is only 142mb, which is not enough to finish rest of my other process.
I searched many solutions and I tried to set my heap memory to 512mb -Xmx512m -Xms512m, and I also set the fetch size setFetchSize(50).
I suspect it's the ResultSet occupied too much memories, the results may stored in the client-side catch. However, after I clear up some object ( st.close() and rs.close() ), even I manually called the garbage collector System.gc(), the free memory after the loop never increase (why?).
Let's just assume I can not change the database design, and I need all query results. How can I free more memory after processing?
P.S.: I also tried to not using the ResultSet.getString() and relace it with hardcode String, and after looping, I got 450mb free memory.
I found that, if I do:
// + counter to make the value different for each row, for testing purpose
item.setColumn1( "Constant String from Column1" + counter );
... ...
item.setColumn12( "Constant String from Column12" + counter );
counter++;
It used only around 60MB memory.
But if I do:
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
It used up to 380MB memory.
I already did rs.close(); and rs = null; //rs is Result instance, but this seems does not help. Why there is so much memory usage different between these 2 approaches? In both approaches I only passed in String.
You should narrow down your queries, try to get more specific and if necessary add limit in your queries your java can't handle too large results
If you need all the data you're getting in memory at the same time (so you can't process it in chunks), then you'll need to have enough memory for it. Try it with 1G of memory.
Forget calling System.gc(), that's not going to help (it will be called before an OutOfMemoryException is thrown anyway).
I also noticed you're not closing the connection. You should probably do that as well (if you don't have a connection pool yet, set one up).
And of course you can use a profiler to see where the memory is actually going to. What you're doing now is pure guesswork.
I don't think many people may encounter this issue, but I still feel like to post my solution for reference.
Before in my code,
The query is:
String sql = "SELECT column1, column2 ... FROM mytable";
and the setter for MyTableObj is:
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr.trim();
}
After update:
What I updated is just use the trim in query instead of using java code:
String sql = "SELECT TRIM(column1), TRIM(column2) ... FROM mytable";
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr;
}
Using this udpated code, it takes only roughly 100 MB memory, which is a lot less than previous memory usage (380 MB).
I still can not give a valid reason why java trim consume more memory the sql trim, if anyone knows the reason, please help me to explain it. I will appreciate it a lot.
After many test, I found that it's data. Each row takes 8 KB and 31,000 rows takes about 240MB memory. TRIM in the query can only works for those short data.
Since data is large and memory is limit, I can only limit my query result for now.
Related
Which one will give me better performance?
Use Java simply loop the value and added to the sql string and execute the statement at once? Note that PreparedStatement is also used.
INSERT INTO tbl ( c1 , c2 , c3 )
VALUES ('r1c1', 'r1c2', 'r1c3'),
('r2c1', 'r2c2', 'r2c3'),
('r3c1', 'r3c2', 'r3c3')
Use the batch execution as below.
String SQL_INSERT = "INSERT INTO tbl (c1, c2, c3) VALUES (?, ?, ?);";
try (
Connection connection = database.getConnection();
PreparedStatement statement = connection.prepareStatement(SQL_INSERT);
) {
int i = 0;
for (Entity entity : entities) {
statement.setString(1, entity.getSomeProperty());
// ...
statement.addBatch();
i++;
if (i % 1000 == 0 || i == entities.size()) {
statement.executeBatch(); // Execute every 1000 items.
}
}
}
I did a presentation a few years ago I called Load Data Fast!. I compared many different methods of inserting data as fast as possible, and benchmarked them.
LOAD DATA INFILE was much faster than any other method.
But there are other factors that affect the speed, like the type of data, and the type of hardware, and perhaps the load on the system from other concurrent clients of the database. The results I got only describe what the performance is on a Macbook Pro.
Ultimately, you need to test your specific case on your server to get the most accurate answer.
This is what being a software engineer is about. You don't always get the answers spoon-fed to you. You have to do some testing to confirm them.
I'm trying to write a java function that can work with large result sets.
The table has 1.2 billion rows which is 189 Gb of data.
Currently I query all the data and extract the information which I store in their respective objects.(using a million row sample db)
TreeMap <Long, Vessel> vessels = new TreeMap<Long, Vessel>(); //list for all vessels
try{
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT mmsi, report_timestamp, position_geom, ST_X(position_geom) AS Long, "
+ "ST_Y(position_geom) AS Lat FROM reports2 WHERE position_geom IS NOT NULL ORDER by report_timestamp ASC");
while(rs.next()){
long mmsi = rs.getLong("mmsi");
java.util.Date time = rs.getTime("report_timestamp");
double longitude = rs.getDouble("Long");
double latitude = rs.getDouble("Lat");
Coordinate coordinate = new Coordinate(longitude, latitude, time);
Vessel vessel = new Vessel(mmsi);
if(!vessels.containsKey(mmsi)) { //if vessel is not present in vessels
vessel.addCoor(coordinate);
vessels.put(mmsi, vessel);
}
else{ //if vessel is already in vessels
vessels.get(mmsi).addCoor(coordinate);
}
}
}catch(Exception e){
JOptionPane.showMessageDialog(null, e);
}
With 189 Gb of data, my computer's memory won't be able to hold the information
I've never touched a table with a billion+ rows and some of my methods involve having all the tables attributes
Can I make Resultset collect 1,000,000 queries at a time and then delete objects after I run functions on them -> then collect another 1,000,000 and so on
Is it possible to hold a 1.2 billion row resultset in approx. 43,000,000 vessel objects (will it take too much space/time ?)
Do I try and limit my query by having a way to select a specific key or attribute and run functions on specified data?
Is there another option ?
If memory is an issue with the ResultSet you can set the fetch size, though you'll need to clear objects during fetch to ensure you don't run out of memory. With Postgres you need to turn off Auto commit or fetch size will not occur.
connection.setAutoCommit(false);
Statement stmt = connection.createStatement();
stmt.setFetchSize(fetchsize);
You can read more about buffering the Result set at https://jdbc.postgresql.org/documentation/94/query.html#query-with-cursor
From your code it seems that you are builing a java object that collects alla the coordinates with the same mmsi field. You did not provide information about this object (mmsi and it list of coordinates) usage. Given this information you can query the data sorting by mmsi and then timestamp (you order by clause is only by timestamp now), when in the resultset you find a different value of mmsi you collected all the data about than specific mmsi so you can use it wihout reading other data.
I don't think you really need to get all the data in memory; you can rewrite the query in order to get only a fixed (a sliding window) number of Vessel objects; you must page the data (i.e. retrieve a block of 10 vessels starting from vessel at position x)
In order to provide a more detailed response you have to explain what you are doing with Vessels.
I have an application which accesses about 2 million tweets from a MySQL database. Specifically one of the fields holds a tweet of text (with maximum length of 140 characters). I am splitting every tweet into an ngram of words ngrams where 1 <= n <= 3. For example, consider the sentence:
I am a boring sentence.
The corresponding nGrams are:
I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence
With about 2 million tweets, I am generating a lot of data. Regardless, I am surprised to get a heap error from Java:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
at twittertest.Global.main(Global.java:40)
Here is the problem code statement (line 49) as given by the above output from Netbeans:
results = stmt.executeQuery("select * from tweets");
So, if I am running out of memory it must be that it is trying to return all the results at once and then storing them in memory. What is the best way to solve this problem? Specifically I have the following questions:
How can I process pieces of results rather than the whole set?
How would I increase the heap size? (If this is possible)
Feel free to include any suggestions, and let me know if you need more information.
EDIT
Instead of select * from tweets I partitioned the table into equally sized subsets of about 10% of the total size. Then I tried running the program. It looked like it was working fine but it eventually gave me the same heap error. This is strange to me because I have ran the same program in the past, successfully with 610,000 tweets. Now I have about 2,000,000 tweets or roughly 3 times as much more data. So if I split the data into thirds it should work, but I went further and split the subsets into size 10%.
Is some memory not being freed? Here is the rest of the code:
results = stmt.executeQuery("select COUNT(*) from tweets");
int num_tweets = 0;
if(results.next())
{
num_tweets = results.getInt(1);
}
int num_intervals = 10; //split into equally sized subets
int interval_size = num_tweets/num_intervals;
for(int i = 0; i < num_intervals-1; i++) //process 10% of the data at a time
{
results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
while(results.next()) //for each row in the tweets database
{
tweetID = results.getLong("tweet_id");
curTweet = results.getString("tweet");
int colPos = curTweet.indexOf(":");
curTweet = curTweet.substring(colPos + 1); //trim off the RT and retweeted
if(curTweet != null)
{
curTweet = removeStopWords(curTweet);
}
if(curTweet == null)
{
continue;
}
reader = new StringReader(curTweet);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
//tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
//Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
//tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) //insert each nGram from each tweet into the DB
{
insertNGram.setInt(1, nGramID++);
insertNGram.setString(2, charTermAttribute.toString().toString());
insertNGram.setLong(3, tweetID);
insertNGram.executeUpdate();
}
}
}
Don't get all rows from table. Try to select partial
data based on your requirement by setting limits to query. You are using MySQL database your query would be select * from tweets limit 0,10. Here 0 is starting row id and 10 represents 10 rows from start.
You can always increase the heap size available to your JVM using the -Xmx argument. You should read up on all the knobs available to you (e.g. perm gen size). Google for other options or read this SO answer.
You probably can't do this kind of problem with a 32-bit machine. You'll want 64 bits and lots of RAM.
Another option would be to treat it as a map-reduce problem. Solve it on a cluster using Hadoop and Mahout.
Have you considered streaming the result set? Halfway down the page is a section on result set, and it addresses your problem (I think?) Write the n grams to a file, then process the next row? Or, am I misunderstanding your problem?
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
My problem is this: I am trying to process about 1.5 million rows of data in Spring via JDBCTemplate coming from MySQL. With such a large number of rows, I am using the RowCallbackHandler class as suggested here
The code is actually working, but's SLOW... The thing is that no matter what I set the fetch size to, I seem to get approximately 350 records per fetch, with a 2 to 3 second delay between fetches (from observing my logs). I tried commenting out the store command and confirmed that behavior stayed the same, so the problem is not with the writes.
There are 6 columns, only 1 that is a varchar, and that one is only 25 characters long, so I can't see throughput being the issue.
Ideally I'd like to get more like 30000-50000 rows at a time. Is there a way to do that?
Here is my code:
protected void runCallback(String query, Map params, int fetchSize, RowCallbackHandler rch)
throws DatabaseException {
int oldFetchSize = getJdbcTemplate().getFetchSize();
if (fetchSize > 0) {
getJdbcTemplate().setFetchSize(fetchSize);
}
try {
getJdbcTemplate().query(getSql(query), rch);
}
catch (DataAccessException ex) {
logger.error(ExceptionUtils.getStackTrace(ex));
throw new DatabaseException( ex.getMessage() );
}
getJdbcTemplate().setFetchSize(oldFetchSize);
}
and the handler:
public class SaveUserFolderStatesCallback implements RowCallbackHandler {
#Override
public void processRow(ResultSet rs) throws SQLException {
//Save each row sequentially.
//Do NOT call ResultSet.next() !!!!
Calendar asOf = Calendar.getInstance();
log.info("AS OF DATE: " + asOf.getTime());
Long x = (Long) rs.getLong("x");
Long xx = (Long) rs.getLong("xx");
String xxx = (String) rs.getString("xxx");
BigDecimal xxxx = (BigDecimal)rs.getBigDecimal("xxxx");
Double xxxx = (budgetAmountBD == null) ? 0.0 : budgetAmountBD.doubleValue();
BigDecimal xxxxx = (BigDecimal)rs.getBigDecimal("xxxxx");
Double xxxxx = (actualAmountBD == null) ? 0.0 : actualAmountBD.doubleValue();
dbstore(x, xx, xxx, xxxx, xxxxx, asOf);
}
}
And what is your query? Try to create an indexex for fields you are searching/sorting. That will help.
Second strategy: in memory cache implementation. Or using of hibernate plus 2nd level cache.
Both this technics can significantly speed up your query execution.
Few Questions
How long does it takes if you query the DB directly. Another issue could be ASYNC_NETWORK_IO delay between application and DB hosts.
did you check it without using Spring
The answer actually is to do setFetchSize(Integer.MIN_VALUE) while this totally violates the stated contract of Statement.setFetchSize, the mysql java connector uses this value to stream the resultset. This results in tremendous performance improvement.
Another part of the fix is that I also needed to create my own subclass of (Spring) JdbcTemplate that would accomodate the negative fetch size... Actually, I took the code example here, where I first found the idea of setting fetchSize(Integer.MIN_VALUE)
http://javasplitter.blogspot.com/2009/10/pimp-ma-jdbc-resultset.html
Thank you both for your help!
I have a SQL query as shown below.
SELECT O_DEF,O_DATE,O_MOD from OBL_DEFINITVE WHERE OBL_DEFINITVE_ID =?
A collection of Ids is passed to this query and ran as Batch query. This executes for 10000
times for retriveing values from Database.(Some one else mess)
public static Map getOBLDefinitionsAsMap(Collection oblIDs)
throws java.sql.SQLException
{
Map retVal = new HashMap();
if (oblIDs != null && (!oblIDs.isEmpty()))
{
BatchStatementObject stmt = new BatchStatementObject();
stmt.setSql(SELECT O_DEF,O_DATE,O_MOD from OBL_DEFINITVE WHERE OBL_DEFINITVE_ID=?);
stmt.setParameters(
PWMUtils.convertCollectionToSubLists(taskIDs, 1));
stmt.setResultsAsArray(true);
QueryResults rows = stmt.executeBatchSelect();
int rowSize = rows.size();
for (int i = 0; i < rowSize; i++)
{
QueryResults.Row aRow = (QueryResults.Row) rows.getRow(i);
CoblDefinition ctd = new CoblDefinition(aRow);
retVal.put(aRow.getLong(0), ctd);
}
}
return retVal;
Now we had identified that if the query is modified to
add as
SELECT O_DEF,O_DATE,O_MOD from OBL_DEFINITVE WHERE OBL_DEFINITVE_ID in (???)
so that we can reduce it to 1 query.
The problem here is MSSQL server is throwing exception that
Prepared or callable statement has more than 2000 parameter
And were struck here . Can some one provide any better alternative to this
There is a maximum number of allowed parameters, let's call it n. You can do one of the following:
If you have m*n + k parameters, you can create m batches (or m+1 batches, if k is not 0). If you have 10000 parameters and 2000 is the maximum allowed parameters, you will only need 5 batches.
Another solution is to generate the query string in your application and adding your parameters as string. This way you will run your query only once. This is an obvious optimization in speed, but you'll have a query string generated in your application. You would set your where clause like this:
String myWhereClause = "where TaskID = " + taskIDs[0];
for (int i = 1; i < numberOfTaskIDs; i++)
{
myWhereClause += " or TaskID = " + taskIDs[i];
}
It looks like you are using your own wrapper around PreparedStatement and addBatch(). You are clearly reaching a limit of how many statements/parameters can be batched at once. You will need to use executeBatch (eg every 100 or 1000) statements, instead of having it build up until the limit is reached.
Edit: Based on the comment below I reread the problem. The solution: make sure you use less than 2000 parameters when building the query. If necessary, breaking it up in two or more queries as required.