Optimizing indexing lucene 5.2.1 - java

I have developed my own Indexer in Lucene 5.2.1. I am trying to index a file of dimension of 1.5 GB and I need to do some non-trivial calculation during indexing time on every single document of the collection.
The problem is that it takes almost 20 minutes to do all the indexing! I have followed this very helpful wiki, but it is still way too slow. I have tried increasing Eclipse heap space and java VM memory, but it seems more a matter of hard disk rather than virtual memory (I am using a laptop with 6GB or RAM and a common Hard Disk).
I have read this discussion that suggests to use RAMDirectory or mount a RAM disk. The problem with RAM disk would be that of persisting index in my filesystem (I don't want to lose indexes after reboot). The problem with RAMDirectory instead is that, according to the APIs, I should not use it because my index is more than "several hundreds of megabites"...
Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.
Here you can find my code:
public class ReviewIndexer {
private JSONParser parser;
private PerFieldAnalyzerWrapper reviewAnalyzer;
private IndexWriterConfig iwConfig;
private IndexWriter indexWriter;
public ReviewIndexer() throws IOException{
parser = new JSONParser();
reviewAnalyzer = new ReviewWrapper().getPFAWrapper();
iwConfig = new IndexWriterConfig(reviewAnalyzer);
//change ram buffer size to speed things up
//#url https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
//little speed increase
// Set to overwrite the existing index
indexWriter = new IndexWriter(FileUtils.openDirectory("review_index"), iwConfig);
* Indexes every review.
* #param file_path : the path of the yelp_academic_dataset_review.json file
* #throws IOException
* #return Returns true if everything goes fine.
public boolean indexReviews(String file_path) throws IOException{
BufferedReader br;
try {
//open the file
br = new BufferedReader(new FileReader(file_path));
String line;
//define fields
StringField type = new StringField("type", "", Store.YES);
String reviewtext = "";
TextField text = new TextField("text", "", Store.YES);
StringField business_id = new StringField("business_id", "", Store.YES);
StringField user_id = new StringField("user_id", "", Store.YES);
LongField stars = new LongField("stars", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
LongField date = new LongField("date", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
StringField votes = new StringField("votes", "", Store.YES);
Date reviewDate;
JSONObject jsonVotes;
try {
//scan the file line by line
//TO-DO: split in chunks and use parallel computation
while ((line = br.readLine()) != null) {
try {
JSONObject jsonline = (JSONObject) parser.parse(line);
Document review = new Document();
//add values to fields
type.setStringValue((String) jsonline.get("type"));
business_id.setStringValue((String) jsonline.get("business_id"));
user_id.setStringValue((String) jsonline.get("user_id"));
stars.setLongValue((long) jsonline.get("stars"));
reviewtext = (String) jsonline.get("text");
//non-trivial function being calculated here
reviewDate = DateTools.stringToDate((String) jsonline.get("date"));
jsonVotes = (JSONObject) jsonline.get("votes");
//add fields to document
//write the document to index
} catch (ParseException | java.text.ParseException e) {
return false;
}//end of while
} catch (IOException e) {
return false;
//close buffer reader and commit changes
} catch (FileNotFoundException e1) {
return false;
return true;
public void close() throws IOException {
What is the best thing to do then? Should I Build a RAM disk and then copy indexes to FileSystem once they are done, or should I use RAMDirectory anyway -or maybe something else? Many thanks

Lucene claims 150GB/hour on modern hardware - that is with 20 indexing threads on a 24 core machine.
You have 1 thread, so expect about 150/20 = 7.5 GB/hour. You will probably see that 1 core is working 100% and the rest is only working when merging segments.
You should use multiple index threads to speeds things up. See for example the luceneutil Indexer.java for inspiration.
As you have a laptop I suspect you have either 4 or 8 cores, so multi-threading should be able to give your indexing a nice boost.

You can try setMaxTreadStates in IndexWriterConfig


PDF file encode to base64 take more time if 100k documents are to be encode

Am trying to encode pdf documents to base64, If it is less in number ( like 2000 documents) its working nicely. But am having 100k plus doucments to be encode.
Its take more time to encode all those files. Is there any better approach to encode large data set.?
Please find my current approach
String filepath=doc.getPath().concat(doc.getFilename());
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
Try this:
Figure out how many files you need to encode.
int files = Files.list(Paths.get(directory)).count();
Split them up into a reasonable amount that a thread can handle in java. I.E) If you have 100k files to encode. Split it into 1000 lists of 1000, something like that.
int currentIndex = 0;
for (File file : filesInDir) {
if (fileMap.get(currentIndex).size() >= cap)
/** Its going to take a little more effort than this, but its the idea im trying to show you*/
Execute each worker thread one after another if the computers resources are available.
for (Integer key : fileMap.keySet()) {
new WorkerThread(fileMap.get(key)).start();
You can check the current resources available with:
public boolean areResourcesAvailable() {
return imNotThatNice();
* Gets the resource utility instance
* #return the current instance of the resource utility
private static OperatingSystemMXBean getInstance() {
if (ResourceUtil.instance == null) {
ResourceUtil.instance = ManagementFactory.getOperatingSystemMXBean();
return ResourceUtil.instance;

Writing large of data to excel: GC overhead limit exceeded

I have a list of strings in read from MongoDB (~200k lines)
Then I want to write it to an excel file with Java code:
public class OutputToExcelUtils {
private static XSSFWorkbook workbook;
private static final String DATA_SEPARATOR = "!";
public static void clusterOutToExcel(List<String> data, String outputPath) {
workbook = new XSSFWorkbook();
FileOutputStream outputStream = null;
writeData(data, "Data");
try {
outputStream = new FileOutputStream(outputPath);
} catch (IOException e) {
public static void writeData(List<String> data, String sheetName) {
int rowNum = 0;
XSSFSheet sheet = workbook.getSheet(sheetName);
sheet = workbook.createSheet(sheetName);
for (int i = 0; i < data.size(); i++) {
System.out.println(sheetName + " Processing line: " + i);
int colNum = 0;
// Split into value of cell
String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);
Row row = sheet.createRow(rowNum++);
for (String valueOfCell : valuesOfLine) {
Cell cell = row.createCell(colNum++);
Then I get an error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead
limit exceeded at
org.apache.xmlbeans.impl.store.Cur$Locations.(Cur.java:497) at
org.apache.xmlbeans.impl.store.Locale.(Locale.java:168) at
org.apache.xmlbeans.impl.store.Locale.getLocale(Locale.java:242) at
org.apache.xmlbeans.impl.store.Locale.newInstance(Locale.java:593) at
Source) at
at ups.mongodb.App.main(App.java:74)
Please give me some advice for that?
Thank you with my respect.
Update solution: Using SXSSWorkbook instead of XSSWorkbook
public class OutputToExcelUtils {
private static SXSSFWorkbook workbook;
private static final String DATA_SEPERATOR = "!";
public static void clusterOutToExcel(ClusterOutput clusterObject, ClusterOutputTrade clusterOutputTrade,
ClusterOutputDistance ClusterOutputDistance, String outputPath) {
workbook = new SXSSFWorkbook();
FileOutputStream outputStream = null;
writeData(clusterOutputTrade.getTrades(), "Data");
try {
outputStream = new FileOutputStream(outputPath);
} catch (IOException e) {
public static void writeData(List<String> data, String sheetName) {
int rowNum = 0;
SXSSFSheet sheet = workbook.createSheet(sheetName);
sheet.setRandomAccessWindowSize(100); // For 100 rows saved in memory, it will flushed after wirtten to excel file
for (int i = 0; i < data.size(); i++) {
System.out.println(sheetName + " Processing line: " + i);
int colNum = 0;
// Split into value of cell
String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);
Row row = sheet.createRow(rowNum++);
for (String valueOfCell : valuesOfLine) {
Cell cell = row.createCell(colNum++);
Your application is spending too much time doing garbage collection. This doesn't necessarily mean that it is running out of heap space; however, it spends too much time in GC relative to performing actual work, so the Java runtime shuts it down.
Try to enable throughput collection with the following JVM option:
While you're at it, give your application as much heap space as possible:
(where ???? stands for the amount of heap space in MB, e.g. -Xms8192m)
If this doesn't help, try to set a more lenient throughput goal with this option:
This specifies that your application should do 19 times more useful work than GC-related work, i.e. it allows the GC to consume up to 5% of the processor time (I believe the stricter 1% default goal may be causing the above runtime error)
No guarantee that his will work. Can you check and post back so others who experience similar problems may benefit?
Your root problem remains the fact that you need to hold the entire spreadhseet and all its related objects in memory while you are building it. Another solution would be to serialize the data, i.e. writing the actual spreadsheet file instead of constructing it in memory and saving it at the end. However, this requires reading up on the XLXS format and creating a custom solution.
Another option would be looking for a less memory-intensive library (if one exists). Possible alternatives to POI are JExcelAPI (open source) and Aspose.Cells (commercial).
I've used JExcelAPI years ago and had a positive experience (however, it appears that it is much less actively maintained than POI, so may no longer be the best choice).
Looks like POI offers a streaming model (https://poi.apache.org/spreadsheet/how-to.html#sxssf), so this may be the best overall approach.
Well try to not load all the data in memory. Even if the binary representation of 200k lines is not that big the hidrated object in memory may be too big. Just as a hint if you have a Pojo each attribute in this pojo has a pointer and each pointer depending on if it is compressed or not compressed will take 4 or 8 bytes. This mean that if your data is a Pojo with 4 attributes only for the pointers you will be spending 200 000* 4bytes(or 8 bytes).
Theoreticaly you can increase the amount of memory to the JVM, but this is not a good solution, or more precisly it is not a good solution for a Live system. For a non interactive system might be fine.
Hint: Use -Xmx -Xms jvm arguments to control the heap size.
Instead of getting the entire list from the data, iterate line wise.
If too cumbersome, write the list to a file, and reread it linewise, for instance as a Stream<String>:
Path path = Files.createTempFile(...);
Files.write(path, list, StandarCharsets.UTF_8);
Files.lines(path, StandarCharsets.UTF_8)
.forEach(line -> { ... });
On the Excel side: though xlsx uses shared strings, if XSSF was done careless,
the following would use a single String instance for repeated string values.
public class StringCache {
private static final int MAX_LENGTH = 40;
private Map<String, String> identityMap = new Map<>();
public String cached(String s) {
if (s == null) {
return null;
if (s.length() > MAX_LENGTH) {
return s;
String t = identityMap.get(s);
if (t == null) {
t = s;
identityMap.put(t, t);
return t;
StringCache strings = new StringCache();
for (String valueOfCell : valuesOfLine) {
Cell cell = row.createCell(colNum++);

Reading a csv file with millions of row via java as fast as possible

I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:
String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
int counterRow = 0;
br2 = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
System.out.println("counterRow is: "+counterRow);
for(int i=1;i<rowList.size();i++){
//this method includes many if elses only.
catch(Exception ex){
System.out.printlnt("Exception occurred");
catch(Exception ex){
It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.
Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.
It is extremely memory efficient and you can parse a million rows in less than a second. This link has a performance comparison of many java CSV libraries and univocity-parsers comes on top.
Here's a simple example of how to use it:
CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);
// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));
BUT, that loads everything into memory. To stream all rows, you can do this:
String[] row;
while ((row = parser.parseNext()) != null) {
//process row here.
The faster approach is to use a RowProcessor, it also gives more flexibility:
CsvParser parser = new CsvParser(settings);
Lastly, it has built-in routines that use the parser to perform some common tasks (iterate java beans, dump ResultSets, etc)
This should cover the basics, check the documentation to find the best approach for your case.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
In this snippet I see two issues which will slow you down considerably:
while ((line = br2.readLine()) != null) {
line=line.replaceAll(",,", ",NA,");
String[] object = line.split(cvsSplitBy);
First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.
Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row - not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.
(Creating many objects is bad, even if you can afford the memory.)
Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.
Postponing the split reduces the execution time for 10 million rows
from 1m8.262s (when the program ran out of heap space) to 13.067s.
If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.
Finally writing the split and replace by hand:
String[] object = new String[7];
String x = line + ",";
int iPos = 0;
int iStr = 0;
int iNext = -1;
while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
if( iNext == iPos ){
object[iStr++] = "NA";
} else {
object[iStr++] = x.substring( iPos, iNext );
iPos = iNext + 1;
// add more "NA" if rows can have less than 7 cells
reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.
on top of the aforementioned univocity it's worth checking
http://simpleflatmapper.org/0101-getting-started-csv.html, it also have a low level api that by pass the String creation.
the 3 of them would as the time of the comment the fastest csv parser.
Chance is that writting your own parser would be slower and buggy.
If you're aiming for objects (i.e. data-binding), I've written a high-performance library sesseltjonna-csv you might find interesting. Benchmark comparison with SimpleFlatMapper and uniVocity here.

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
//append the reverse file
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Java Heap Space Error, OutofMemory Exception while writing large data to excel sheet

I am getting Java Heap Space Error while writing large data from database to an excel sheet.
I dont want to use JVM -XMX options to increase memory.
Following are the details:
1) I am using org.apache.poi.hssf api
for excel sheet writing.
2) JDK version 1.5
3) Tomcat 6.0
Code i have wriiten works well for around 23 thousand records, but it fails for more than 23K records.
Following is the code:
ArrayList l_objAllTBMList= new ArrayList();
l_objAllTBMList = (ArrayList) m_objFreqCvrgDAO.fetchAllTBMUsers(p_strUserTerritoryId);
ArrayList l_objDocList = new ArrayList();
m_objTotalDocDtlsInDVL= new HashMap();
Object l_objTBMRecord[] = null;
Object l_objVstdDocRecord[] = null;
int l_intDocLstSize=0;
VisitedDoctorsVO l_objVisitedDoctorsVO=null;
int l_tbmListSize=l_objAllTBMList.size();
System.out.println(" getMissedDocDtlsList_NSM ");
for(int i=0; i<l_tbmListSize;i++)
l_objTBMRecord = (Object[]) l_objAllTBMList.get(i);
l_objDocList = (ArrayList) m_objGenerateVisitdDocsReportDAO.fetchAllDocDtlsInDVL_NSM((String) l_objTBMRecord[1], p_divCode, (String) l_objTBMRecord[2], p_startDt, p_endDt, p_planType, p_LMSValue, p_CycleId, p_finYrId);
try {
l_objVOFactoryForDoctors = new VOFactory(l_intDocLstSize, VisitedDoctorsVO.class);
/* Factory class written to create and maintain limited no of Value Objects (VOs)*/
} catch (ClassNotFoundException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (InstantiationException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (IllegalAccessException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
for(int j=0; j<l_intDocLstSize;j++)
l_objVstdDocRecord = (Object[]) l_objDocList.get(j);
l_objVisitedDoctorsVO = (VisitedDoctorsVO) l_objVOFactoryForDoctors.getVo();
if (((String) l_objVstdDocRecord[6]).equalsIgnoreCase("-"))
if (String.valueOf(l_objVstdDocRecord[2]) != "null")
l_objVisitedDoctorsVO.setEmpcode((String) l_objTBMRecord[1]);
l_objVisitedDoctorsVO.setEmpname((String) l_objTBMRecord[0]);
l_objVisitedDoctorsVO.setDoctorid((String) l_objVstdDocRecord[1]);
l_objVisitedDoctorsVO.setDr_name((String) l_objVstdDocRecord[4] + " " + (String) l_objVstdDocRecord[5]);
l_objVisitedDoctorsVO.setDoctor_potential((String) l_objVstdDocRecord[3]);
l_objVisitedDoctorsVO.setSpeciality((String) l_objVstdDocRecord[7]);
l_objVisitedDoctorsVO.setActualpractice((String) l_objVstdDocRecord[8]);
m_objTotalDocDtlsInDVL.put((String) l_objVstdDocRecord[1], l_objVisitedDoctorsVO);
}// End of While
writeExcelSheet(); // Pasting this method at the end
// Clean up code
m_objTotalDocDtlsInDVL.clear();// Clear the used map
}// End of While
private void writeExcelSheet() throws IOException
HSSFRow l_objRow = null;
HSSFCell l_objCell = null;
VisitedDoctorsVO l_objVisitedDoctorsVO = null;
Iterator l_itrDocMap = m_objTotalDocDtlsInDVL.keySet().iterator();
while (l_itrDocMap.hasNext())
Object key = l_itrDocMap.next();
l_objVisitedDoctorsVO = (VisitedDoctorsVO) m_objTotalDocDtlsInDVL.get(key);
l_objRow = m_objSheet.createRow(m_iRowCount++);
l_objCell = l_objRow.createCell(0);
l_objCell = l_objRow.createCell(1);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getEmpname() + " (" + l_objVisitedDoctorsVO.getEmpcode() + ")"); // TBM Name
l_objCell = l_objRow.createCell(2);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDr_name());// Doc Name
l_objCell = l_objRow.createCell(3);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPotential_score());// Freq potential score
l_objCell = l_objRow.createCell(4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDoctor_potential());// Freq potential score
l_objCell = l_objRow.createCell(5);
l_objCell = l_objRow.createCell(6);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getActualpractice());// Actual practise
l_objCell = l_objRow.createCell(7);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPreviousmet());// Lastmet
l_objCell = l_objRow.createCell(8);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getLastmet());// Previousmet
// Write OutPut Stream
try {
out = new FileOutputStream(m_objFile);
outBf = new BufferedOutputStream(out);
} catch (Exception ioe) {
System.out.println(" Exception in chunk write");
} finally {
if (outBf != null) {
Instead of populating the complete list in memory before starting to write to excel you need to modify the code to work in such a way that each object is written to a file as it is read from the database. Take a look at this question to get some idea of the other approach.
Well, I'm not sure if POI can handle incremental updates but if so you might want to write chunks of say 10000 Rows to the file. If not, you might have to use CSV instead (so no formatting) or increase memory.
The problem is that you need to make objects written to the file elligible for garbage collection (no references from a live thread anymore) before writing the file is finished (before all rows have been generated and written to the file).
If can you write smaller chunks of data to the file you'd also have to only load the necessary chunks from the db. So it doesn't make sense to load 50000 records at once and then try and write 5 chunks of 10000, since those 50000 records are likely to consume a lot of memory already.
As Thomas points out, you have too many objects taking up too much space, and need a way to reduce that. There is a couple of strategies for this I can think of:
Do you need to create a new factory each time in the loop, or can you reuse it?
Can you start with a loop getting the information you need into a new structure, and then discarding the old one?
Can you split the processing into a thread chain, sending information forwards to the next step, avoiding building a large memory consuming structure at all?

