I am working on a full text indexing using inverted file method where it extracts all the word in a document, and inserts each word one by one into my table in MYSQL.
So far, my program works perfectly fine but I am stuck in thinking how it could be optimize further to improve the time it takes to insert into db. I am aware inverted file has a disadvantage of slow time for building up the index table.
Here is my code:
public class IndexTest {
public static void main(String[] args) throws Exception {
StopWatch stopwatch = new StopWatch();
stopwatch.start();
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
// System.out.println(uniqueWords);
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
String unique = uniqueWords.toString();
// System.out.println(words[1].toString());
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
MysqlAccessIndex connection = new MysqlAccessIndex();
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
stopwatch.stop();
long timeTaken = stopwatch.getTime();
System.out.println(timeTaken);
MYSQL connection:
public class MysqlAccessIndex {
public Connection connect = null;
public Statement statement = null;
public PreparedStatement preparedStatement = null;
public ResultSet resultSet = null;
public MysqlAccessIndex() throws Exception {
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
+ "user=root&password=root123");
// statement = connect.createStatement();
System.out.print("Connected");
}
public void readDataBase(String path,String word) throws Exception {
try {
preparedStatement = connect
.prepareStatement("insert IGNORE into fulltext_ltat.test_text values (?, ?) ");
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.executeUpdate();
} catch (Exception e) {
throw e;
} finally {
close();
}
}
Is it possible if I could use some sort of multi threading to say insert three words in three rows at the same time to speed up the insert process or some sort?
I would appreciate any suggestion.
I think solution to your problem - is to use bulk insert.
You could try to do something like this:
public void readDataBase(String path, HashSet<String> uniqueWords) throws Exception {
PreparedStatement preparedStatement;
try {
String compiledQuery = "insert IGNORE into fulltext_ltat.test_text values (?, ?) ";
preparedStatement = connect.prepareStatement(compiledQuery);
for(String word : uniqueWords) {
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
}
long start = System.currentTimeMillis();
int[] inserted = preparedStatement.executeBatch();
} catch (Exception e) {
throw e;
} finally {
close();
}
}
Modify your readDataBase method to have HashSet<String> uniqueWords in params.
After that you should add preparedStatement.addBatch() call after each item to insert and execute preparedStatement.executeBatch() instead of preparedStatement.executeUpdate() in the end.
I hope it would help.
Related
Let's say I want to index a file. The file is stored in a filequeue table. The table structure looks like below:
UniqueID FilePath Status
1 C:\Folder1\abc.pdf Active
2 C:\Folder1\def.pdf Active
3 C:\Folder1\efg.pdf Error
There are four different status : Active, Processing, Success and Error
Active: When the file is inserted to the table pending for indexing process
Processing: When indexing process is starting, the table status is updated to Processing.
Success: After the indexing process is completed,table status should be updated to processing.
Error: If by any chance, the processing fail for some reason.
For some reason, let's say abc.pdf does not exist. And when I scan the table, it will retrieve all filepath with status = Active and starts iterating each one of them and do the index function. During this process, it will update the status to Processing and then to Complete if there are no issues.
However, it will throw an error FileNotFoundException
on abc.pdf which is fine since the file does not exist but it still updates the status to Complete. It should update to Error status instead.
I was thinking of using an if else statement and it looks like this:
public void doScan_DB() throws Exception {
boolean fileprocessstatus=false;
try {
Statement statement = con.connect().createStatement();
ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active'");
while (rs.next()) {
// get the filepath of the PDF document
String path1 = rs.getString(2);
// while running the process, update status : Processing
updateProcess_DB();
// call the index function
Indexing conn = new Indexing();
conn.doScan(path1);
fileProcessStatus =true;
// After completing the process, update status: Complete
if(fileProcessStatus=true){
updateComplete_DB();
}else{
//call function to update status to error if index fails
}
}
}catch(SQLException|IOException e){
e.printStackTrace();
}
my DoScan() method:
public void doScan(String path) throws Exception{
/* File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
// HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();*/
ArrayList<String> list = new ArrayList<String>();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
// words.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
list.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
// uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
String[] words1 =list.toArray(new String[list.size()]);
// String[] words2 =uniqueWords.toArray(new String[uniqueWords.size()]);
// MysqlAccessIndex connection = new MysqlAccessIndex();
index(words1,path);
System.out.println("Completed");
}
}
UpdateError_DB() :
public void updateError_DB(){
try{
Statement statement = con.connect().createStatement();
statement.execute("update filequeue SET STATUS ='Error' where STATUS ='Processing' ");
}catch(Exception e){
e.printStackTrace();
}
}
UpdateComplete_DB():
public void updateComplete_DB() {
try {
Statement statement = con.connect().createStatement();
statement.execute("update filequeue SET STATUS ='Complete' where STATUS ='Processing' ");
} catch (Exception e) {
e.printStackTrace();
}
}
However, it doesn't really fix the issue of update the status correctly.
Is there a way to achieve what I want?
Here is the solution based on my understanding of your scenario.
doScan_DB() method:
public void doScan_DB() throws Exception {
boolean fileprocessstatus = false;
try {
Statement statement = con.connect().createStatement();
ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active'");
while (rs.next()) {
//Get the uniqueID of active filepath
String uniqueID = rs.getString(1);
// get the filepath of the PDF document
String path1 = rs.getString(2);
// while running the process, update status : Processing
updateProcess_DB(uniqueID);
// call the index function
Indexing conn = new Indexing();
if (conn.doScan(path1)) {
updateComplete_DB(uniqueID);
} else {
updateError_DB(uniqueID);
}
}
} catch (SQLException | IOException e) {
e.printStackTrace();
}
}
doScan() method:
public boolean doScan(String path) {
/*
* File folder = new File("D:\\PDF1"); File[] listOfFiles = folder.listFiles();
*
* for (File file : listOfFiles) { if (file.isFile()) { // HashSet<String>
* uniqueWords = new HashSet<>();
*
* String path = "D:\\PDF1\\" + file.getName();
*/
ArrayList<String> list = new ArrayList<String>();
boolean isSuccess = true;
try {
File f = new File(path);
if (!f.exists()) {
isSuccess = false;
} else {
PDDocument document = PDDocument.load(f);
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
// words.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
for (String word : words) {
// check if one or more special characters at end of string then
// remove OR
// check special characters in beginning of the string then
// remove
// uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
list.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
// uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
}
}
}
}
String[] words1 = list.toArray(new String[list.size()]);
// String[] words2 =uniqueWords.toArray(new String[uniqueWords.size()]);
// MysqlAccessIndex connection = new MysqlAccessIndex();
index(words1, path);
} catch (Exception e) {
System.err.println("Exception while trying to read pdf document - " + e);
isSuccess = false;
}
return isSuccess;
}
Now all the update methods.Here I'll be using JDBC PreparedStatement in place of JDBC Statement. I'm assuming your UniqueId is of Int type.
Note: Make necessary changes according to your environment
public void updateComplete_DB(String uniqueID) {
try {
String sql="UPDATE filequeue SET STATUS ='Complete' WHERE STATUS ='Processing' AND UniqueID=?";
PreparedStatement statement = con.connect().prepareStatement(sql);
statement.setInt(1,Integer.parseInt(uniqueID));
int rows = statement.executeUpdate();
System.out.prrintln("No. of rows updated:"+rows)
} catch (Exception e) {
e.printStackTrace();
}
}
public void updateProcess_DB(String uniqueID) {
try {
String sql="UPDATE filequeue SET STATUS ='Processing' WHERE UniqueID=?";
PreparedStatement statement = con.connect().prepareStatement(sql);
statement.setInt(1,Integer.parseInt(uniqueID));
int rows = statement.executeUpdate();
System.out.prrintln("No. of rows updated:"+rows)
} catch (Exception e) {
e.printStackTrace();
}
}
public void updateError_DB(String uniqueID) {
try {
String sql="UPDATE filequeue SET STATUS ='Error' WHERE STATUS ='Processing' AND UniqueID=?";
PreparedStatement statement = con.connect().prepareStatement(sql);
statement.setInt(1,Integer.parseInt(uniqueID));
int rows = statement.executeUpdate();
System.out.prrintln("No. of rows updated:"+rows)
} catch (Exception e) {
e.printStackTrace();
}
}
I am trying to insert records into my table in MYSQL after extracting the words from a file and stored them in a hashset.
I tried using executeBatch() to insert into my db after getting 500 records but when the execution finished, I checked my table and there's no record inserted at all.
Note: When I use ExecuteUpdate() then the records will show in my table. But not ExecuteBatch() since I want to insert by batch, not one by one.
May I know what did I do wrong?
Code:
public void readDataBase(String path,String word) throws Exception {
try {
// Result set get the result of the SQL query
int i=0;
// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
// Statements allow to issue SQL queries to the database
// statement = connect.createStatement();
System.out.print("Connected");
// Result set get the result of the SQL query
preparedStatement = connect
.prepareStatement("insert IGNORE into fulltext_ltat.indextable values (default,?, ?) ");
preparedStatement.setString( 1, path);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
i++;
// preparedStatement.executeUpdate();
if(i%500==0){
preparedStatement.executeBatch();
}
preparedStatement.close();
// writeResultSet(resultSet);
} catch (Exception e) {
throw e;
} finally {
close();
}
}
This is my loop to call that method(words is just an array that contains the words which is going to be inserted to the table):
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
My main method:
public static void main(String[] args) throws Exception {
StopWatch stopwatch = new StopWatch();
stopwatch.start();
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word)
;
}
}
// System.out.println(uniqueWords);
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
MysqlAccessIndex connection = new MysqlAccessIndex();
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
Your pattern for doing batch updates is off. You should be opening the connection and preparing the statement only once. Then, iterate multiple times, binding parameters, and add that statement to the batch.
// define a collection of paths and words somewhere
List<String> paths = new ArrayList<>();
List<String> words = new ArrayList<>();
try {
// presumably you only want to insert so many records
int LIMIT = 10000;
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
String sql = "INSERT IGNORE INTO fulltext_ltat.indextable VALUES (default, ?, ?);";
preparedStatement = connect.prepareStatement(sql);
for (int i=0; i < LIMIT; ++i) {
preparedStatement.setString(1, paths.get(i));
preparedStatement.setString(2, word.get(i));
preparedStatement.addBatch();
if (i % 500 == 0) {
preparedStatement.executeBatch();
}
}
// execute remaining batches
preparedStatement.executeBatch();
}
catch (SQLException e) {
e.printStackTrace();
}
finally {
try {
preparedStatement.close();
connect.close();
}
catch (SQLException e) {
e.printStackTrace();
}
}
One key change I made here is to add logic for when you should stop doing inserts. Currently, your code looks to have an infinite loop, which means it would run forever. This is probably not what you were intending to do.
where is your loop. try this
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root&rewriteBatchedStatements=true");
Let's say I have a program where I am inserting records into MYSQL table in ddatabase in Java.
Instead of inserting row by row, I insert by a batch of 1000 records. Using ExecuteBatch method, it doesn't seem to work as it still inserts row by row.
Code(only the snippet):
public void readDataBase(String path,String word) throws Exception {
try {
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
String sql="insert IGNORE into fulltext_ltat.indextable values (default,?, ?) ";
preparedStatement = connect.prepareStatement(sql);
for(int i=0;i<1000;i++) {
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
} catch (SQLException e) {
e.printStackTrace();
} finally {
try {
preparedStatement.close();
connect.close();
}
catch (SQLException e) {
e.printStackTrace();
}
}
}
Code: Main method calling the above
public static void main(String[] args) throws Exception {
StopWatch stopwatch = new StopWatch();
stopwatch.start();
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word)
;
}
}
// System.out.println(uniqueWords);
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
MysqlAccessIndex connection = new MysqlAccessIndex();
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
The moment I run the program, the if statement is always executing rather than checking if there are 1000 records then only execute to insert to db.
Am I doing anything wrong?
i % 1000 == 0 is true when i==0, so you only execute the batch in the first iteration of the loop.
You should execute the batch after the loop:
for (int i=0;i<1000;i++) {
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
System.out.print("Add Thousand");
Now, if you had 10000 records, and you wanted to execute batch insert every 1000, you could write:
for (int i=0;i<10000;i++) {
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
if ((i + 1) % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
EDIT: In order not to insert the same word multiple times to the table, pass an array to your method:
Change
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
connection.readDataBase(path, words[i].toString());
}
to
connection.readDataBase(path, words);
and
public void readDataBase(String path,String word) throws Exception {
to
public void readDataBase(String path,String[] words) throws Exception {
and finally the batch insert loop would become:
for (int i=0;i<words.length;i++) {
preparedStatement.setString(1, path);
preparedStatement.setString(2, words[i]);
preparedStatement.addBatch();
if ((i + 1) % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (words.length % 1000 > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
In the configuration property url add: allowMultiQueries=true
I am trying to convert java.sql.Clob data into String by using SubString method (This method giving good performance compared with other). The clob data having near or morethan to 32MB. AS my observation substring method able to to return upto 33554342 bytes only.
if clob data is crossing 33554342 bytes then this it's throwing below sql exception
ORA-24817: Unable to allocate the given chunk for current lob operation
EDIT
CODE:
public static void main(String[] args) throws SQLException {
Main main = new Main();
Connection con = main.getConnection();
if (con == null) {
return;
}
PreparedStatement pstmt = null;
ResultSet rs = null;
String sql = "SELECT Table_ID,CLOB_FILE FROM TableName WHERE SOMECONDITION ";
String table_Id = null;
String directClobInStr = null;
CLOB clobObj = null;
String clobStr = null;
Object obj= null;
try {
pstmt = con.prepareStatement(sql);
rs = pstmt.executeQuery();
while (rs.next()) {
table_Id = rs.getString( "Table_ID" ) ;
directClobInStr = rs.getString( "clob_FILE" ) ;
obj = rs.getObject( "CLOB_FILE");
clobObj = (CLOB) obj;
System.out.println("Table id " + table_Id);
System.out.println("directClobInStr " + directClobInStr);
clobStr = clobObj.getSubString(1L, (int)clobObj.length() );//33554342
System.out.println("clobDataStr = " + clobStr);
}
}
catch (SQLException e) {
e.printStackTrace();
return;
}
catch (Exception e) {
e.printStackTrace();
return;
}
finally {
try {
rs.close();
pstmt.close();
con.close();
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
NOTE:- here obj = rs.getObject( "CLOB_FILE"); working but I am not expecting this. because I am getting ResultSet object from somewhere as Object. I have to convert and get the data from CLOB
Any Idea how to achieve this?
Instead:
clobStr = clobObj.getSubString(1L, (int)clobObj.length() );
Try something like:
int toread = (int) clobObj.length();
int read = 0;
final int block_size = 8*1024*1024;
StringBuilder str = new StringBuilder(toread);
while (toread > 0) {
int current_block = Math.min(toread, block_size);
str.append(clobObj.getSubString(read+1, current_block));
read += current_block;
toread -= current_block;
}
clobStr = str.toString();
It extracts substrings using a loop (8MB per iteration).
But remember that, as far as I known, Java Strings are limited to 2 GB (this is the reason why read is declared as int instead of long) and Oracle CLOBs are limited to 128 TB.
I'm trying to import all googlebooks-1gram files into a postgresql database. I wrote the following Java code for that:
public class ToPostgres {
public static void main(String[] args) throws Exception {
String filePath = "./";
List<String> files = new ArrayList<String>();
for (int i =0; i < 10; i++) {
files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
}
Connection c = null;
try {
c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
"postgres", "xxxxxx");
} catch (SQLException e) {
e.printStackTrace();
}
if (c != null) {
try {
PreparedStatement wordInsert = c.prepareStatement(
"INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS
);
PreparedStatement countInsert = c.prepareStatement(
"INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
"VALUES (?,?,?,?,?)"
);
String lastWord = "";
Long lastId = -1L;
for (String filename: files) {
BufferedReader input = new BufferedReader(new FileReader(new File(filename)));
String line = "";
while ((line = input.readLine()) != null) {
String[] data = line.split("\t");
Long id = -1L;
if (lastWord.equals(data[0])) {
id = lastId;
} else {
wordInsert.setString(1, data[0]);
wordInsert.executeUpdate();
ResultSet resultSet = wordInsert.getGeneratedKeys();
if (resultSet != null && resultSet.next())
{
id = resultSet.getLong(1);
}
}
countInsert.setLong(1, id);
countInsert.setInt(2, Integer.parseInt(data[1]));
countInsert.setInt(3, Integer.parseInt(data[2]));
countInsert.setInt(4, Integer.parseInt(data[3]));
countInsert.setInt(5, Integer.parseInt(data[4]));
countInsert.executeUpdate();
lastWord = data[0];
lastId = id;
}
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
However, when running this for ~3 hours it only placed 1.000.000 entries in the wordcounts table. When I check the amount of lines in the entire 1gram dataset it's 500.000.000 lines. So to import everything would take about 62.5 days, I can accept that it imports in about a week, but 2 months? I think I'm doing something seriously wrong here(I do have a server that runs 24/7, so I can actually run it for this long, but faster would be nice XD)
EDIT: This code is how I solved it:
public class ToPostgres {
public static void main(String[] args) throws Exception {
String filePath = "./";
List<String> files = new ArrayList<String>();
for (int i =0; i < 10; i++) {
files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
}
Connection c = null;
try {
c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
"postgres", "xxxxxx");
} catch (SQLException e) {
e.printStackTrace();
}
if (c != null) {
c.setAutoCommit(false);
try {
PreparedStatement wordInsert = c.prepareStatement(
"INSERT INTO words (id, word) VALUES (?,?)"
);
PreparedStatement countInsert = c.prepareStatement(
"INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
"VALUES (?,?,?,?,?)"
);
String lastWord = "";
Long id = 0L;
for (String filename: files) {
BufferedReader input = new BufferedReader(new FileReader(new File(filename)));
String line = "";
int i = 0;
while ((line = input.readLine()) != null) {
String[] data = line.split("\t");
if (!lastWord.equals(data[0])) {
id++;
wordInsert.setLong(1, id);
wordInsert.setString(2, data[0]);
wordInsert.executeUpdate();
}
countInsert.setLong(1, id);
countInsert.setInt(2, Integer.parseInt(data[1]));
countInsert.setInt(3, Integer.parseInt(data[2]));
countInsert.setInt(4, Integer.parseInt(data[3]));
countInsert.setInt(5, Integer.parseInt(data[4]));
countInsert.executeUpdate();
lastWord = data[0];
if (i % 10000 == 0) {
c.commit();
}
if (i % 100000 == 0) {
System.out.println(i+" mark file "+filename);
}
i++;
}
c.commit();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
I reached 1.5 million rows in about 15 minutes now. That's fast enough for me, thanks all!
JDBC connections have autocommit enabled by default, which carries a per-statement overhead. Try disabling it:
c.setAutoCommit(false)
then commit in batches, something along the lines of:
long ops = 0;
for(String filename : files) {
// ...
while ((line = input.readLine()) != null) {
// insert some stuff...
ops ++;
if(ops % 1000 == 0) {
c.commit();
}
}
}
c.commit();
If your table has indexes, it might be faster to delete them, insert the data, and recreate the indexes later.
Setting autocommit off, and doing a manual commit every 10 000 records or so (look into the documentation for a reasonable value - there is some limit) could speed up as well.
Generating the index/foreign key yourself, and keeping track of it should be faster than wordInsert.getGeneratedKeys(); but I'm not sure, whether it is possible from your content.
There is an approach called 'bulk insert'. I don't remember the details, but its a starting point for a search.
Write it to do threading, running 4 threads at the same time, or split it up in sections (read from config file) and distribute it to X machines and have them get the data togeather.
Use batch statements to execute multiple inserts at the same time, rather than one INSERT at a time.
In addition I would remove the part of your algorithm which updates the word count after each insert into the words table, instead just calculate all of the word counts once inserting the words is complete.
Another approach would be to do bulk inserts rather than single inserts. See this question Whats the fastest way to do a bulk insert into Postgres? for more information.
Create threads
String lastWord = "";
Long lastId = -1L;
PreparedStatement wordInsert;
PreparedStatement countInsert ;
public class ToPostgres {
public void main(String[] args) throws Exception {
String filePath = "./";
List<String> files = new ArrayList<String>();
for (int i =0; i < 10; i++) {
files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
}
Connection c = null;
try {
c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
"postgres", "xxxxxx");
} catch (SQLException e) {
e.printStackTrace();
}
if (c != null) {
try {
wordInsert = c.prepareStatement(
"INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS
);
countInsert = c.prepareStatement(
"INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
"VALUES (?,?,?,?,?)"
);
for (String filename: files) {
new MyThread(filename). start();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
class MyThread extends Thread{
String file;
public MyThread(String file) {
this.file = file;
}
#Override
public void run() {
try {
super.run();
BufferedReader input = new BufferedReader(new FileReader(new File(file)));
String line = "";
while ((line = input.readLine()) != null) {
String[] data = line.split("\t");
Long id = -1L;
if (lastWord.equals(data[0])) {
id = lastId;
} else {
wordInsert.setString(1, data[0]);
wordInsert.executeUpdate();
ResultSet resultSet = wordInsert.getGeneratedKeys();
if (resultSet != null && resultSet.next())
{
id = resultSet.getLong(1);
}
}
countInsert.setLong(1, id);
countInsert.setInt(2, Integer.parseInt(data[1]));
countInsert.setInt(3, Integer.parseInt(data[2]));
countInsert.setInt(4, Integer.parseInt(data[3]));
countInsert.setInt(5, Integer.parseInt(data[4]));
countInsert.executeUpdate();
lastWord = data[0];
lastId = id;
}
} catch (NumberFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
}