I am currently writing a Java program which loops through a folder of around 4000 XML files.
Using a for loop, it extracts the XML from each file, assigns it to a String 'xmlContent', and uses the PreparedStatement method setString(2,xmlContent) to insert the String into a table stored in my SQL Server.
The column '2' is a column called 'Data' of type XML.
The process works, but it is slow. It inserts about 50 rows into the table every 7 seconds.
Does anyone have any ideas as to how I could speed up this process?
Code:
{ ...declaration, connection etc etc
PreparedStatement ps = con.prepareStatement("INSERT INTO Table(ID,Data) VALUES(?,?)");
for (File current : folder.listFiles()){
if (current.isFile()){
xmlContent = fileRead(current.getAbsoluteFile());
ps.setString(1, current.getAbsoluteFile());
ps.setString(2, xmlContent);
ps.addBatch();
if (++count % batchSize == 0){
ps.executeBatch();
}
}
}
ps.executeBatch(); // performs insertion of leftover rows
ps.close();
}
private static String fileRead(File file){
StringBuilder xmlContent = new StringBuilder();
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String strLine = "";
br.readLine(); //removes encoding line, don't need it and causes problems
while ( (strLine = br.readLine() ) != null){
xmlContent.append(strLine);
}
fr.close();
return xmlContent.toString();
}
Just from a little reading and a quick test - it looks like you can get a decent speedup by turning off autoCommit on your connection. All of the batch query tutorials I see recommend it as well. Such as http://www.tutorialspoint.com/jdbc/jdbc-batch-processing.htm
Turn it off - and then drop an explicit commit where you want (at the end of each batch, at the end of the whole function, etc).
conn.setAutoCommit(false);
PreparedStatement ps = // ... rest of your code
// inside your for loop
if (++count % batchSize == 0)
{
try {
ps.executeBatch();
conn.commit();
}
catch (SQLException e)
{
// .. whatever you want to do
conn.rollback();
}
}
Best make the read and write parallel.
Use one thread to read the files and store in a buffer.
Use another thread to read from the buffer and execute queries on database.
You can use more than one thread to write to the database in parallel. That should give you even better performance.
I would suggest you follow this MemoryStreamMultiplexer approach where you can read the XML files in one thread and store in a buffer and then use one or more thread to read from the buffer and execute against database.
http://www.codeproject.com/Articles/345105/Memory-Stream-Multiplexer-write-and-read-from-many
It is a C# implementation, but you get the idea.
Related
In many try-with-resource examples I have searched, Statement and ResultSet are declared separately. As the Java document mentioned, the close methods of resources are called in the opposite order of their creation.
try (Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sql) ) {
} catch (Exception e) {
}
But now I have multiple queries in my function.
Can I make Statement and ResultSet in just one line ? My code is like:
try (ResultSet rs = con.createStatement().executeQuery(sql);
ResultSet rs2 = con.createStatement().executeQuery(sql2);
ResultSet rs3 = con.createStatement().executeQuery(sql3)) {
} catch (Exception e) {
}
If I only declare them in one line, does it still close resource of both ResultSet and Statement?
When you have a careful look you will see that the concept is called try-with-resources.
Note the plural! The whole idea is that you can declare one or more resources in that single statement and the jvm guarantees proper handling.
In other words: when resources belong together semantically, it is good practice to declare them together.
Yes, and it works exactly as you put it in your question, multiple statements separated by semicolon.
You may declare one or more resources in a try-with-resources statement. The following example retrieves the names of the files packaged in the zip file zipFileName and creates a text file that contains the names of these files:
try (
java.util.zip.ZipFile zf =
new java.util.zip.ZipFile(zipFileName);
java.io.BufferedWriter writer =
java.nio.file.Files.newBufferedWriter(outputFilePath, charset)
) {
// Enumerate each entry
for (java.util.Enumeration entries =
zf.entries(); entries.hasMoreElements();) {
// Get the entry name and write it to the output file
String newLine = System.getProperty("line.separator");
String zipEntryName =
((java.util.zip.ZipEntry)entries.nextElement()).getName() +
newLine;
writer.write(zipEntryName, 0, zipEntryName.length());
}
}
https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html
ResultSet implements AutoCloseable, which means try-with-resources will also enforce closing it when it finishes using it.
https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
Input set: thousands(>10000) of csv files, each containing >50000 entries.
output: Store those data in mysql db.
Approach taken:
Read each file and store the data into database. Below is the code snippet for the same. Please suggest if this approach is ok or not.
PreparedStatement pstmt2 = null;
try
{
pstmt1 = con.prepareStatement(sqlQuery);
result = pstmt1.executeUpdate();
con.setAutoCommit(false);
sqlQuery = "insert into "
+ tableName
+ " (x,y,z,a,b,c) values(?,?,?,?,?,?)";
pstmt2 = con.prepareStatement(sqlQuery);
Path file = Paths.get(filename);
lines = Files.lines(file, StandardCharsets.UTF_8);
final int batchsz = 5000;
for (String line : (Iterable<String>) lines::iterator) {
pstmt2.setString(1, "somevalue");
pstmt2.setString(2, "somevalue");
pstmt2.setString(3, "somevalue");
pstmt2.setString(4, "somevalue");
pstmt2.setString(5, "somevalue");
pstmt2.setString(6, "somevalue");
pstmt2.addBatch();
if (++linecnt % batchsz == 0) {
pstmt2.executeBatch();
}
}
int batchResult[] = pstmt2.executeBatch();
pstmt2.close();
con.commit();
} catch (BatchUpdateException e) {
log.error(Utility.dumpExceptionMessage(e));
} catch (IOException ioe) {
log.error(Utility.dumpExceptionMessage(ioe));
} catch (SQLException e) {
log.error(Utility.dumpExceptionMessage(e));
} finally {
lines.close();
try {
pstmt1.close();
pstmt2.close();
} catch (SQLException e) {
Utility.dumpExceptionMessage(e);
}
}
I've used LOAD DATA INFILE ins situations like this in the past.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed. LOAD DATA INFILE is the complement of
SELECT ... INTO OUTFILE. (See Section 14.2.9.1, “SELECT ... INTO
Syntax”.) To write data from a table to a file, use SELECT ... INTO
OUTFILE. To read the file back into a table, use LOAD DATA INFILE. The
syntax of the FIELDS and LINES clauses is the same for both
statements. Both clauses are optional, but FIELDS must precede LINES
if both are specified.
The IGNORE number LINES option can be used to ignore lines at the start of the file. For example, you can use IGNORE 1 LINES to skip over an initial header line containing column names:
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE test IGNORE 1 LINES;
http://dev.mysql.com/doc/refman/5.7/en/load-data.html
As #Ridrigo has already pointed out, LOAD DATA INFILE is the way to go. Java is not really needed at all.
If the format of your CSV is not something that can directly be inserted into the database, your Java code can renter the picture. Use it to reorganize/transform the CSV and save it as another CSV file instead of writing it into the database.
You can also use the Java code to iterate through the folder that contains the CSV, and then execute the system command for the
Runtime r = Runtime.getRuntime();
Process p = r.exec("mysql -p password -u user database -e 'LOAD DATA INFILE ....");
you will find that this is much much faster than running individual sql queries for each row of the CSV file.
I read file and create a Object from it and store to postgresql database. My file has 100,000 document that I read from one file and split it and finally store to database.
I can't create List<> and store all document in List<> because my RAM is little. My code to read and write to database are as below. But My JVM Heap fills and can not continue to store more document. How to read file and store to database efficiently.
public void readFile() {
StringBuilder wholeDocument = new StringBuilder();
try {
bufferedReader = new BufferedReader(new FileReader(files));
String line;
int count = 0;
while ((line = bufferedReader.readLine()) != null) {
if (line.contains("<page>")) {
wholeDocument.append(line);
while ((line = bufferedReader.readLine()) != null) {
wholeDocument = wholeDocument.append("\n" + line);
if (line.contains("</page>")) {
System.out.println(count++);
addBodyToDatabase(wholeDocument.toString());
wholeDocument.setLength(0);
break;
}
}
}
}
wikiParser.commit();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
bufferedReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public void addBodyToDatabase(String wholeContent) {
Page page = new Page(new Timestamp(System.currentTimeMillis()),
wholeContent);
database.addPageToDatabase(page);
}
public static int counter = 1;
public void addPageToDatabase(Page page) {
session.save(page);
if (counter % 3000 == 0) {
commit();
}
counter++;
}
First of all you should apply a fork-join approach here.
The main task parses the file and sends batches of at most 100 items to an ExecutorService. The ExecutorService should have a number of worker threads that equals the number of available database connections. If you have 4 CPU cores, let's say that the database can take 8 concurrent connections without doing to much context switching.
You should then configure a connection pooling DataSource and have a minSize equal to maxSize and equal to 8. Try HikariCP or ViburDBCP for connection pooling.
Then you need to configure JDBC batching. If you're using MySQL, the IDENTITY generator will disable bathing. If you're using a database that supports sequences, make sure you also use the enhanced identifier generators (they are the default option in Hibernate 5.x).
This way the entity insert process is parallelized and decoupled from the main parsing thread. The main thread should wait for the ExecutorService to finish processing all tasks prior to shutting down.
Actually it is hard to suggest to you without doing real profiling and find out what's making your code slow or inefficient.
However there are several things we can see from your code
You are using StringBuilder inefficiently
wholeDocument.append("\n" + line); should be wrote as wholeDocument.append("\n").append(line); instead
Because what you original wrote will be translated by compiler to
whileDocument.append(new StringBuilder("\n").append(line).toString()). You can see how much unnecessary StringBuilders you have created :)
Consideration in using Hibernate
I am not sure how you manage your session or how you implemented your commit(), I assume you have done it right, there are still more thing to consider:
Have you properly set up batch size in Hibernate? (hibernate.jdbc.batch_size) By default, the JDBC batch size is something around 5. You may want to make sure you set it in bigger size (so that internally Hibernate will send inserts in a bigger batch).
Given that you do not need the entities in 1st level cache for later use, you may want to do intermittent session flush() + clear() to
Trigger batch inserts mentioned in previous point
clear out first level cache
Switch away from Hibernate for this feature.
Hibernate is cool but it is not panacea for everything. Given that in this feature you are just saving records into DB based on text file content. Neither you do need any entity behavior, nor you need to make use of first level cache for later processing, there is not much reason to make use of Hibernate here given the extra processing and space overhead. Simply doing JDBC with manual batch handling is going to save you a lot of trouble .
I use #RookieGuy answer.
stackoverflow.com/questions/14581865/hibernate-commit-and-flush
I use
session.flush();
session.clear();
and finally after read all documents and store them into database
tx.commit();
session.close();
and change
wholeDocument = wholeDocument.append("\n" + line);
to
wholeDocument.append("\n" + line);
I'm not very much sure about the structure of your data file.It will be easy to understand, if you could provide a sample of your file.
The root cause of the memory consumption is the way of reading/iterating the file. Once something get read, stays in memory. You should rather use either java.io.FileInputStream or org.apache.commons.io.FileUtils.
Here is a sample code to iterate with java.io.FileInputStream
try (
FileInputStream inputStream = new FileInputStream("/tmp/sample.txt");
Scanner sc = new Scanner(inputStream, "UTF-8")
) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
addBodyToDatabase(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Here is a sample code to iterate with org.apache.commons.io.FileUtils
File file = new File("/tmp/sample.txt");
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
addBodyToDatabase(line);
}
} finally {
LineIterator.closeQuietly(it);
}
You should begin a transaction, do the save operation and commit a transaction. (Don't begin a transaction after save!). You can try to use StatelessSession to exclude memory consumption by a cache.
And use more less value, for an example 20, in this code
if (counter % 20 == 0)
You can try to pass StringBuilder as a method's argument as far as possible.
The problem is, user clicks a button in JSP, which will export the displayed data. So what i am doing is, creating a temp. file and writing the contents in it [ resultSet >> xml >> csv ], and then writing the contents to ServletResponse. After closing the respons output stream, i try to delete the file, but every time it returns false.
code;
public static void writeFileContentToResponse ( HttpServletResponse response , String fileName ) throws IOException{
ServletOutputStream responseoutputStream = response.getOutputStream();
File file = new File(fileName);
if (file.exists()) {
file.deleteOnExit();
DataInputStream dis = new DataInputStream(new FileInputStream(
file));
response.setContentType("text/csv");
int size = (int) file.length();
response.setContentLength(size);
response.setHeader("Content-Disposition",
"attachment; filename=\"" + file.getName() + "\"");
response.setHeader("Pragma", "public");
response.setHeader("Cache-control", "must-revalidate");
if (size > Integer.MAX_VALUE) {
}
byte[] bytes = new byte[size];
dis.read(bytes);
FileCopyUtils.copy(bytes, responseoutputStream );
}
responseoutputStream.flush();
responseoutputStream.close();
file.delete();
}
i have used 'file.deleteOnExit();' and file.delete(); but none of them is working.
file.deleteOnExit() isn't going to produce the result you want here - it's purpose is to delete the file when the JVM exits - if this is called from a servlet, that means to delete the file when the server shuts down.
As for why file.delete() isn't working - all I see in this code is reading from the file and writing to the servlet's output stream - is it possible when you wrote the data to the file that you left the file's input stream open? Files won't be deleted if they're currently in use.
Also, even though your method throws IOException you still need to clean up things if there's an exception while accessing the file - put the file operations in a try block, and put the stream.close() into a finally block.
Don't create that file.
Write your data directly from your resultset to your CSV responseoutputStream.
That saves time, memory, diskspace and headache.
If you realy need it, try using File.createTempFile() method.
These files will be deleted when your VM stops normaly if they haven't been deleted before.
I'm assuming you have some sort of concurrency issue going on here. Consider making this method non-static, and use a unique name for your temp file (like append the current time, or use a guid for a filename). Chances are that you're opening the file, then someone else opens it, so the first delete fails.
as I see it, you are not closing the DataInputStream dis - this results to the false status, when you do want to delete file. Also, you should handle the streams in try-catch-finally block and close them within finally. The code is a bit rough, but it is safe:
DataInputStream dis = null;
try
{
dis = new DataInputStream(new FileInputStream(
file));
... // your other code
}
catch(FileNotFoundException P_ex)
{
// catch only Exceptions you want, react to them
}
finally
{
if(dis != null)
{
try
{
dis.close();
}
catch (IOException P_ex)
{
// handle exception, again react only to exceptions that must be reacted on
}
}
}
How are you creating the file. You probably need to use createTempFile.
You should be able to delete a temporary file just fine (No need for deleteOnExit). Are you sure the file isn't in use, when you are trying to delete it? You should have one file per user request (That is another reason you should avoid temp files and store everything in memory).
you can try piped input and piped output stream. those buffers need two threads one to feed the pipe (exporter) and the other (servlet) to consume data from the pipe and write it to the response output stream
You really don't want to create a temporary file for a request. Keep the resulting CSV in memory if at all possible.
You may need to tie the writing of the file in directly with the output. So parse a row of the result set, write it out to response stream, parse the next row and so on. That way you only keep one row in memory at a time. Problem there is that the response could time out.
If you want a shortcut method, take a look at Display tag library. It makes it very easy to show a bunch of results in a table and then add pre-built export options to said table. CSV is one of those options.
You don't need a temporary file. The byte buffer which you're creating there based on the file size may also cause OutOfMemoryError. It's all plain inefficient.
Just write the data of the ResultSet immediately to the HTTP response while iterating over the rows. Basically: writer.write(resultSet.getString("columnname")). This way you don't need to write it to a temporary file or to gobble everything in Java's memory.
Further, most JDBC drivers will by default cache everything in Java's memory before giving anything to ResultSet#next(). This is also inefficient. You'd like to let it give the data immediately row-by-row by setting the Statement#setFetchSize(). How to do it properly depends on the JDBC driver used. In case of for example MySQL, you can read it up in its JDBC driver documentation.
Here's a kickoff example, assuming that you're using MySQL:
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
response.setContentType("text/csv");
response.setCharacterEncoding("UTF-8");
Connection connection = null;
Statement statement = null;
ResultSet resultSet = null;
PrintWriter writer = response.getWriter();
try {
connection = database.getConnection();
statement = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
statement.setFetchSize(Integer.MIN_VALUE);
resultSet = statement.executeQuery("SELECT col1, col2, col3 FROM tbl");
while (resultSet.next()) {
writer.append(resultSet.getString("col1")).append(',');
writer.append(resultSet.getString("col2")).append(',');
writer.append(resultSet.getString("col3")).println();
// Note: don't forget to escape quotes/commas as per RFC4130.
}
} catch (SQLException e) {
throw new ServletException("Retrieving CSV rows from DB failed", e);
} finally {
if (resultSet != null) try { resultSet.close; } catch (SQLException logOrIgnore) {}
if (statement != null) try { statement.close; } catch (SQLException logOrIgnore) {}
if (connection != null) try { connection.close; } catch (SQLException logOrIgnore) {}
}
}
That's it. This way effectlvely only one database row is been kept in the memory all the time.
I have the following code that executes a query and writes it directly to a string buffer which then dumps it to a CSV file. I will need to write large amount of records (maximum to a million). This works for a million records it takes about half an hour for a file that is around 200mb! which seems to me like a lot of time, not sure if this is the best. Please recommend me better ways even if it includes using other jars/db connection utils.
....
eventNamePrepared = con.prepareStatement(gettingStats +
filterOptionsRowNum + filterOptions);
ResultSet rs = eventNamePrepared.executeQuery();
int i=0;
try{
......
FileWriter fstream = new FileWriter(realPath +
"performanceCollectorDumpAll.csv");
BufferedWriter out = new BufferedWriter(fstream);
StringBuffer partialCSV = new StringBuffer();
while (rs.next()) {
i++;
if (current_appl_id_col_display)
partialCSV.append(rs.getString("current_appl_id") + ",");
if (event_name_col_display)
partialCSV.append(rs.getString("event_name") + ",");
if (generic_method_name_col_display)
partialCSV.append(rs.getString("generic_method_name") + ",");
..... // 23 more columns to be copied same way to buffer
partialCSV.append(" \r\n");
// Writing to file after 10000 records to prevent partialCSV
// from going too big and consuming lots of memory
if (i % 10000 == 0){
out.append(partialCSV);
partialCSV = new StringBuffer();
}
}
con.close();
out.append(partialCSV);
out.close();
Thanks,
Tam
Just write to the BufferedWriter directly instead of constructing the StringBuffer.
Also note that you should likely use StringBuilder instead of StringBuffer... StringBuffer has an internal lock, which is usually not necessary.
Profiling is generally the only sure-fire way to know why something's slow. However, in this example I would suggest two things that are low-hanging fruit:
Write directly to the buffered writer instead of creating your own buffering with the StringBuilder.
Refer to the columns in the result-set by integer ordinal. Some drivers can be slow when resolving column names.
You could tweak various things, but for a real improvement I would try using the native tool of whatever database you are using to generate the file. If it is SQL Server, this would be bcp which can take a query string and generate the file directly. If you need to call it from Java you can spawn it as a process.
As way of an example, I have just run this...
bcp "select * from trading..bar_db" queryout bar_db.txt -c -t, -Uuser -Ppassword -Sserver
...this generated a 170MB file containing 2 million rows in 10 seconds.
I just wanted to add a sample code for the suggestion of Jared Oberhaus:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
public class CSVExport {
public static void main(String[] args) throws Exception {
String table = "CUSTOMER";
int batch = 100;
Class.forName("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:#server:orcl", "user", "pass");
PreparedStatement pstmt = conn.prepareStatement(
"SELECT /*+FIRST_ROWS(" + batch + ") */ * FROM " + table);
ResultSet rs = pstmt.executeQuery();
rs.setFetchSize(batch);
ResultSetMetaData rsm = rs.getMetaData();
File output = new File("result.csv");
PrintWriter out = new PrintWriter(new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream(output), "UTF-8")), false);
Set<String> columns = new HashSet<String>(
Arrays.asList("COL1", "COL3", "COL5")
);
while (rs.next()) {
int k = 0;
for (int i = 1; i <= rsm.getColumnCount(); i++) {
if (columns.contains(rsm.getColumnName(i).toUpperCase())) {
if (k > 0) {
out.print(",");
}
String s = rs.getString(i);
out.print("\"");
out.print(s != null ? s.replaceAll("\"", "\\\"") : "");
out.print("\"");
k++;
}
}
out.println();
}
out.flush();
out.close();
rs.close();
pstmt.close();
conn.close();
}
}
I have two quick thoughts. The first is, are you sure writing to disk is the problem? Could you actually be spending most of your time waiting on data from the DB?
The second is to try removing all the + ","s, and use more .appends for that. It may help considering how often you are doing those.
You mentioned that you are using Oracle. You may want to investigate using the Oracle External Table feature or Oracle Data Pump depending on exactly what you are trying to do.
See http://www.orafaq.com/node/848 (Unloading data into an external file...)
Another option could be connecting by sqlplus and running "spool " prior to the query.
Writing to a buffered writer is normally fast "enough". If it isn't for you, then something else is slowing it down.
The easiest way to profile it is to use jvisualvm available in the latest JDK.