Hadoop DistributedCache object changed during job - java

I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache:
java.io.IOException: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during the job from 4/8/13 2:20 PM to 4/8/13 2:20 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1246)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1237)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1152)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2541)
at java.lang.Thread.run(Thread.java:662)
What sets this question apart from this one is the fact that this error appears intermittently. I've run the same code successfully on a smaller dataset. Furthermore, when I change the number of centroids from 12 (seen above in the code) to 8, it fails on iteration 5 instead of 6 (which can you see in the centroids_6 name above).
Here's the relevant DistributedCache code in the main driver that runs the KMeans loop:
int iteration = 1;
long changes = 0;
do {
// First, write the previous iteration's centroids to the dist cache.
Configuration iterConf = new Configuration();
Path prevIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration - 1));
FileSystem fs = prevIter.getFileSystem(iterConf);
Path pathPattern = new Path(prevIter, "part-*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
DistributedCache.addCacheFile(status.getPath().toUri(), iterConf);
}
// Now, set up the job.
Job iterJob = new Job(iterConf);
iterJob.setJobName("KMeans " + iteration);
iterJob.setJarByClass(KMeansDriver.class);
Path nextIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration));
KMeansDriver.delete(iterConf, nextIter);
// Set input/output formats.
iterJob.setInputFormatClass(SequenceFileInputFormat.class);
iterJob.setOutputFormatClass(SequenceFileOutputFormat.class);
// Set Mapper, Reducer, Combiner
iterJob.setMapperClass(KMeansMapper.class);
iterJob.setCombinerClass(KMeansCombiner.class);
iterJob.setReducerClass(KMeansReducer.class);
// Set MR formats.
iterJob.setMapOutputKeyClass(IntWritable.class);
iterJob.setMapOutputValueClass(VectorWritable.class);
iterJob.setOutputKeyClass(IntWritable.class);
iterJob.setOutputValueClass(VectorWritable.class);
// Set input/output paths.
FileInputFormat.addInputPath(iterJob, data);
FileOutputFormat.setOutputPath(iterJob, nextIter);
iterJob.setNumReduceTasks(nReducers);
if (!iterJob.waitForCompletion(true)) {
System.err.println("ERROR: Iteration " + iteration + " failed!");
System.exit(1);
}
iteration++;
changes = iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).getValue();
iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).setValue(0);
} while (changes > 0);
How else would the files be modified? The only possibility I can think of is that, at the completion of one iteration, the loop begins again before the centroids from the previous job have finished writing. But within the comment, I invoke the job with waitForCompletion(true), so there shouldn't be any residual parts of the job running when the loop starts over. Any ideas?

This isn't really an answer, but I did realize it was silly to use the DistributedCache in the way I was, as opposed to reading the results from the previous iteration directly from HDFS. I instead wrote this method in the main driver:
public static HashMap<Integer, VectorWritable> readCentroids(Configuration conf, Path path)
throws IOException {
HashMap<Integer, VectorWritable> centroids = new HashMap<Integer, VectorWritable>();
FileSystem fs = FileSystem.get(path.toUri(), conf);
FileStatus [] list = fs.globStatus(new Path(path, "part-*"));
for (FileStatus status : list) {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, status.getPath(), conf);
IntWritable key = null;
VectorWritable value = null;
try {
key = (IntWritable)reader.getKeyClass().newInstance();
value = (VectorWritable)reader.getValueClass().newInstance();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
while (reader.next(key, value)) {
centroids.put(new Integer(key.get()),
new VectorWritable(value.get(), value.getClusterId(), value.getNumInstances()));
}
reader.close();
}
return centroids;
}
This is invoked in the setup() method of the Mapper and Reducer during each iteration, to read the centroids of the previous iteration.
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
Path centroidsPath = new Path(conf.get(KMeansDriver.CENTROIDS));
centroids = KMeansDriver.readCentroids(conf, centroidsPath);
}
This allowed me to remove the block of code in the loop in my original question which writes the centroids to the DistributedCache. I tested it, and it now works on both large and small datasets.
I still don't know why I was getting the error I posted about (how would something in the read-only DistributedCache be changed? especially when I was changing HDFS paths on every iteration?), but this seems to both work and be a much less hack-y way of reading the centroids.

Related

FileNotFound exception even though file is in the place during watch service in java

I have a watch service running on a folder, when I am trying to modify and existing file using evenKind == Modify (basically pasting a same file without removing the current file) I am getting FileNotFoundException (The process cannot access the file because it is being used by another process.)
if (eventKind == StandardWatchEventKinds.ENTRY_MODIFY) {
String newFileChecksum = null;
if (eventPath.toFile().exists()) {
newFileChecksum = getFileChecksum(eventPath.toFile());
}
if (fileMapper.containsKey(eventPath)) {
String existingFileChecksum = fileMapper.get(eventPath);
if (!existingFileChecksum.equals(newFileChecksum)) {
fileMapper.replace(eventPath, existingFileChecksum, newFileChecksum);
log.info("listener.filemodified IN");
for (DirectoryListener listener : this.listeners) {
listener.fileModified(this, eventPath);
}
log.info("listener.filemodified OUT");
} else {
log.info("existing checksum");
log.debug(String.format(
"Checksum for file [%s] has not changed. Skipping plugin processing.",
eventPath.getFileName()));
}
}
}
In the code when...getFileChecksum() is called
if (eventPath.toFile().exists()) {
newFileChecksum = getFileChecksum(eventPath.toFile());
}
So ideally, eventPath.toFile().exists() is TRUE, hence code is going inside if but when getFileChecksum() is called, it goes to method...
private synchronized String getFileChecksum(File file) throws IOException, NoSuchAlgorithmException {
MessageDigest md5Digest = MessageDigest.getInstance("MD5");
FileInputStream fis = null;
if(file.exists()) {
try {
fis = new FileInputStream(file);
} catch(Exception e) {
e.printStackTrace();
}
} else {
log.warn("File not detected.");
}
byte[] byteArray = new byte[1024];
int bytesCount = 0;
while ((bytesCount = fis.read(byteArray)) != -1) {
md5Digest.update(byteArray, 0, bytesCount);
};
fis.close();
byte[] bytes = md5Digest.digest();
StringBuilder stringBuilder = new StringBuilder();
for (int i=0; i< bytes.length ;i++) {
stringBuilder.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
return stringBuilder.toString();
}
}
An exception is coming fis = new FileInputStream(file); even if the file is present in the folder.
FileNotFoundException (The process cannot access the file because it is being used by another process.)
I created a RandomAccessFile and a channel to release any LOCK placed on file, but it is not working. Please suggest what could be happening here.
//UPDATE --> This is the infinite while loop that I have,
WHAT IS HAPPENING? WHEN I PUT A FILE 1 create and 2 update are getting called, suppose, when I am deleting the file, 1 delete 1 modify is being called, and IF I PUT THE SAME FILE BACK TO FOLDER, I GET CREATE but before CREATE is finishing, MODIFY IS BEING called. and create is not running instead modify is running.
I fixed this issue by putting Thread.sleep(500) between
WatchKey wk = watchService.take();
Thread.sleep(500)
for (WatchEvent<?> event : wk.pollEvents()) {
But I dont think I can justify use of sleep here. Please help
WatchService watchService = null;
WatchKey watchKey = null;
while (!this.canceled && (watchKey == null)) {
watchService = watchService == null
? FileSystems.getDefault().newWatchService() : watchService;
watchKey = this.directory.register(watchService,
StandardWatchEventKinds.ENTRY_MODIFY, StandardWatchEventKinds.ENTRY_DELETE,
StandardWatchEventKinds.ENTRY_CREATE);
}
while (!this.canceled) {
try {
WatchKey wk = watchService.take();
for (WatchEvent<?> event : wk.pollEvents()) {
Kind<?> eventKind = event.kind();
System.out.println("Event kind : " + eventKind);
Path dir = (Path)wk.watchable();
Path eventPath = (Path) event.context();
Path fullPath = dir.resolve(eventPath);
fireEvent(eventKind, fullPath);
}
wk.reset();
}
I have a better approach, use and a while loop on a var isFileReady like this...
var isFileReady = false;
while(!isFile...) {
}
inside while create a try and catch.
try {
FileInputStream fis = new FileInputStream();
isFileReady = true;
} catch () {
catch exception or print file not ready.
}
This will solve your problem.
The WatchService is verbose and may report multiple ENTRY_MODIFY events for save operation - even when another application is part way through or doing writes repeatedly. Your code is probably acting on a modify event while the other app is still writing and there may be a second ENTRY_MODIFY on its way.
A safer strategy for using the WatchService is to collate the events you receive and only act on the changes when there is a pause. Something like this will ensure that you block on first event but then poll the watch service with small timeout to see if more changes are present before you act on the previous set:
WatchService ws = ...
HashSet<Path> modified = new HashSet<>();
while(appIsRunning) {
int countNow = modified.size();
WatchKey k = countNow == 0 ? ws.take() : ws.poll(1, TimeUnit.MILLISECONDS);
if (k != null) {
// Loop through k.pollEvents() and put modify file path into modified set:
// DO NOT CALL fireEvent HERE, save the path instead:
...
if (eventKind == ENTRY_MODIFY)
modified.add(filePath);
}
// Don't act on changes unless no new events:
if (countNow == modified.size()) {
// ACT ON modified list here - the watch service did not report new changes
for (Path filePath : modified) {
// call fireEvent HERE:
fireEvent(filePath);
}
// reset the list so next watch call is take() not poll(1)
modified.clear();
}
}
If you are also looking out for CREATE and DELETE operations with MODIFY you will have to collate and ignore some of the earlier events because the last recorded event type can take precedence over a previously recorded type. For example, if calling take() then poll(1) until nothing new is reported:
Any DELETE then CREATE => you might want to consider as MODIFY
Any CREATE then MODIFY => you might want to consider as CREATE
Any CREATE or MODIFY then a DELETE => treat as DELETE
Your logic would also want to only act when value of modified.size() + created.size() + deleted.size() gets changed between runs.
let me guess...
modify event gets called when you modify a file. to modify the file you most likely use a seperate tool like notepad that opens and LOCKS the file.
your watcher gets an event that the file gets modified (right now) but you can not modify it again (which fileinputstream wants to do) since it is locked already.

Reading Data From FTP Server in Hadoop/Cascading

I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password#host/path.
When I use map reduce program to read data from file it works fine. I want to read data from same file through Cascading framework. I am using Hfs tap of cascading framework to read data. It throws following exception
java.io.IOException: Stream closed
at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
at java.io.FilterInputStream.close(Unknown Source)
at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
public static void main(String args[]) {
Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd#xx.xx.xx.xx//input1");
Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
Pipe pipe = new Pipe("First");
pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
pipe = new GroupBy(pipe);
Pipe tailpipe = new Every(pipe, new Count());
FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
new HadoopFlowConnector().connect(flowDef).complete();
}
}
I tried to look in Hadoop Source code for the same exception. I found that in the MapTask class there is one method runOldMapper which deals with stream. And in the same method there is finally block where stream gets closed (in.close()). When I remove that line from finally block it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
throws IOException, InterruptedException, ClassNotFoundException {
InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());
updateJobWithSplit(job, inputSplit);
reporter.setInputSplit(inputSplit);
RecordReader<INKEY, INVALUE> in = isSkipping()
? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
: new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks();
LOG.info("numReduceTasks: " + numReduceTasks);
MapOutputCollector collector = null;
if (numReduceTasks > 0) {
collector = new MapOutputBuffer(umbilical, job, reporter);
} else {
collector = new DirectMapOutputCollector(umbilical, job, reporter);
}
MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
job);
try {
runner.run(in, new OldOutputCollector(collector, conf), reporter);
collector.flush();
} finally {
// close
in.close(); // close input
collector.close();
}
}
please assist me in solving this problem.
Thanks,
Arshadali
After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP.
This class doesn't supports seek, i.e. Seek to the given offset from the start of the file. Data is read in one block and then file system seeks to next block to read. Default block size is 4KB for FTPFileSystem. As seek is not supported it can only read data less than or equal to 4KB.

Mapreduce job to HBase throws IOException: Pass a Delete or a Put

I am trying to output to a HBase table directly from my Mapper while using Hadoop2.4.0 with HBase0.94.18 on EMR.
I am getting a nasty IOException: Pass a Delete or a Put when executing the code below.
public class TestHBase {
static class ImportMapper
extends Mapper<MyKey, MyValue, ImmutableBytesWritable, Writable> {
private byte[] family = Bytes.toBytes("f");
#Override
public void map(MyKey key, MyValue value, Context context) {
MyItem item = //do some stuff with key/value and create item
byte[] rowKey = Bytes.toBytes(item.getKey());
Put put = new Put(rowKey);
for (String attr : Arrays.asList("a1", "a2", "a3")) {
byte[] qualifier = Bytes.toBytes(attr);
put.add(family, qualifier, Bytes.toBytes(item.get(attr)));
}
context.write(new ImmutableBytesWritable(rowKey), put);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String input = args[0];
String table = "table";
Job job = Job.getInstance(conf, "stuff");
job.setJarByClass(ImportMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.setInputDirRecursive(job, true);
FileInputFormat.addInputPath(job, new Path(input));
TableMapReduceUtil.initTableReducerJob(
table, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Does anyone know what I am doing wrong?
Stacktrace
Error: java.io.IOException: Pass a Delete or a Put at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:125) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:84) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:646) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:775) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
It would be better if you can show the full stack trace, so that i can help you solve it easily. I've not executed your code. As far as i've seen your code, this could be the issue
job.setNumReduceTasks(0);
Mapper will be expecting your put object to write directly to Apache HBase.
You can increase the setNumReduceTasks OR If you see the API you can find its default value and comment it.
Thanks for adding the stack trace. Unfortunately you didn't include the code that threw the exception so I can't fully trace it for you. Instead I did a little searching around and discovered a few things for you.
Your stack trace is similar to one in another SO question here:
Pass a Delete or a Put error in hbase mapreduce
That one solved the issue by commenting out job.setNumReduceTasks(0);
There is a similar SO question that had the same exception but couldn't solve the problem that way. Instead it was having a problem with annotations:
"java.io.IOException: Pass a Delete or a Put" when reading HDFS and storing HBase
Here are some good examples of how to write working code both with setNumReduceTasks at 0 and at 1 or more.
"51.2. HBase MapReduce Read/Write Example
The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
This is the 1 or more example:
"51.4. HBase MapReduce Summary to HBase Example
The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
MyTableReducer.class, // reducer class
job);
job.setNumReduceTasks(1); // at least one, adjust as required
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
http://hbase.apache.org/book.html#mapreduce.example
You seem to be more closely following the first example. I wanted to show that sometimes there is a reason to set the number of reduce tasks to zero.

How Can I Read The Next Row From A CSV Data Set Config In JMeter?

I am in the process of creating a test place in JMeter which visits a random amount of pages (from 2 - 10), whose URLs are to be fetched from a CSV Data Set. I have created the CSV Data Set and the samplers which are working fine, except that only one row is read from the Data Set per thread, which is not as a I need - I want a new row to be read after the sampler has completed (or before, I'm not fussed).
I saw that this question is very similar and the solution was to use the Raw Data Source Pre-Processor, which does work but requires arduous alterations to the file in question (adding chunk sizes before each line), which is a bit of a pain when the file is about 500 lines long.
Is there a way I can set the CSV Data Set to advance to the next row on reading, or use some post or pre processor, such as beanshell, in order to do this? I have seen people state that CSVRead can do this, but that states that access is per-thread, which would be no good for me.
As a side note - ultimately all I want to do is access a random line in the file which gets passed to a HTTP sampler, if there is an easier or better way to do this I'm open to suggestions.
You can possibly use for this beanshell (= java) code executed from BeanShell Sampler / BeanShell PostProcessor / BeanShell PreProcessor.
The following code will read all the lines from your file and then select single random:
import java.text.*;
import java.io.*;
import java.util.*;
String [] params = Parameters.split(",");
String csvTest = params[0];
String csvDir = params[0];
ArrayList strList = new ArrayList();
try {
File file = new File(System.getProperty("user.dir") + File.separator + csvDir + File.separator + csvTest);
if (!file.exists()) {
throw new Exception ("ERROR: file " + csvTest + " not found in " + csvDir + " directory.");
}
BufferedReader bufRdr = new BufferedReader(new FileReader(file));
String line = null;
while((line = bufRdr.readLine()) != null) {
strList.add(line);
}
bufRdr.close();
Random rnd = new java.util.Random();
vars.put("csvUrl",strList.get(rnd.nextInt(strList.size())));
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
Then you can access extracted URL via variable (${csvUrl} in this example).
I doubt only that reading full file on each iteration (if you have to execute this in loop) is good solution from performance point of view.

Java Heap Space Error, OutofMemory Exception while writing large data to excel sheet

I am getting Java Heap Space Error while writing large data from database to an excel sheet.
I dont want to use JVM -XMX options to increase memory.
Following are the details:
1) I am using org.apache.poi.hssf api
for excel sheet writing.
2) JDK version 1.5
3) Tomcat 6.0
Code i have wriiten works well for around 23 thousand records, but it fails for more than 23K records.
Following is the code:
ArrayList l_objAllTBMList= new ArrayList();
l_objAllTBMList = (ArrayList) m_objFreqCvrgDAO.fetchAllTBMUsers(p_strUserTerritoryId);
ArrayList l_objDocList = new ArrayList();
m_objTotalDocDtlsInDVL= new HashMap();
Object l_objTBMRecord[] = null;
Object l_objVstdDocRecord[] = null;
int l_intDocLstSize=0;
VisitedDoctorsVO l_objVisitedDoctorsVO=null;
int l_tbmListSize=l_objAllTBMList.size();
System.out.println(" getMissedDocDtlsList_NSM ");
for(int i=0; i<l_tbmListSize;i++)
{
l_objTBMRecord = (Object[]) l_objAllTBMList.get(i);
l_objDocList = (ArrayList) m_objGenerateVisitdDocsReportDAO.fetchAllDocDtlsInDVL_NSM((String) l_objTBMRecord[1], p_divCode, (String) l_objTBMRecord[2], p_startDt, p_endDt, p_planType, p_LMSValue, p_CycleId, p_finYrId);
l_intDocLstSize=l_objDocList.size();
try {
l_objVOFactoryForDoctors = new VOFactory(l_intDocLstSize, VisitedDoctorsVO.class);
/* Factory class written to create and maintain limited no of Value Objects (VOs)*/
} catch (ClassNotFoundException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (InstantiationException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (IllegalAccessException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
}
for(int j=0; j<l_intDocLstSize;j++)
{
l_objVstdDocRecord = (Object[]) l_objDocList.get(j);
l_objVisitedDoctorsVO = (VisitedDoctorsVO) l_objVOFactoryForDoctors.getVo();
if (((String) l_objVstdDocRecord[6]).equalsIgnoreCase("-"))
{
if (String.valueOf(l_objVstdDocRecord[2]) != "null")
{
l_objVisitedDoctorsVO.setPotential_score(String.valueOf(l_objVstdDocRecord[2]));
l_objVisitedDoctorsVO.setEmpcode((String) l_objTBMRecord[1]);
l_objVisitedDoctorsVO.setEmpname((String) l_objTBMRecord[0]);
l_objVisitedDoctorsVO.setDoctorid((String) l_objVstdDocRecord[1]);
l_objVisitedDoctorsVO.setDr_name((String) l_objVstdDocRecord[4] + " " + (String) l_objVstdDocRecord[5]);
l_objVisitedDoctorsVO.setDoctor_potential((String) l_objVstdDocRecord[3]);
l_objVisitedDoctorsVO.setSpeciality((String) l_objVstdDocRecord[7]);
l_objVisitedDoctorsVO.setActualpractice((String) l_objVstdDocRecord[8]);
l_objVisitedDoctorsVO.setLastmet("-");
l_objVisitedDoctorsVO.setPreviousmet("-");
m_objTotalDocDtlsInDVL.put((String) l_objVstdDocRecord[1], l_objVisitedDoctorsVO);
}
}
}// End of While
writeExcelSheet(); // Pasting this method at the end
// Clean up code
l_objVOFactoryForDoctors.resetFactory();
m_objTotalDocDtlsInDVL.clear();// Clear the used map
l_objDocList=null;
l_objTBMRecord=null;
l_objVstdDocRecord=null;
}// End of While
l_objAllTBMList=null;
m_objTotalDocDtlsInDVL=null;
-------------------------------------------------------------------
private void writeExcelSheet() throws IOException
{
HSSFRow l_objRow = null;
HSSFCell l_objCell = null;
VisitedDoctorsVO l_objVisitedDoctorsVO = null;
Iterator l_itrDocMap = m_objTotalDocDtlsInDVL.keySet().iterator();
while (l_itrDocMap.hasNext())
{
Object key = l_itrDocMap.next();
l_objVisitedDoctorsVO = (VisitedDoctorsVO) m_objTotalDocDtlsInDVL.get(key);
l_objRow = m_objSheet.createRow(m_iRowCount++);
l_objCell = l_objRow.createCell(0);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(String.valueOf(l_intSrNo++));
l_objCell = l_objRow.createCell(1);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getEmpname() + " (" + l_objVisitedDoctorsVO.getEmpcode() + ")"); // TBM Name
l_objCell = l_objRow.createCell(2);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDr_name());// Doc Name
l_objCell = l_objRow.createCell(3);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPotential_score());// Freq potential score
l_objCell = l_objRow.createCell(4);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDoctor_potential());// Freq potential score
l_objCell = l_objRow.createCell(5);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getSpeciality());//CP_GP_SPL
l_objCell = l_objRow.createCell(6);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getActualpractice());// Actual practise
l_objCell = l_objRow.createCell(7);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPreviousmet());// Lastmet
l_objCell = l_objRow.createCell(8);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getLastmet());// Previousmet
}
// Write OutPut Stream
try {
out = new FileOutputStream(m_objFile);
outBf = new BufferedOutputStream(out);
m_objWorkBook.write(outBf);
} catch (Exception ioe) {
ioe.printStackTrace();
System.out.println(" Exception in chunk write");
} finally {
if (outBf != null) {
outBf.flush();
outBf.close();
out.close();
l_objRow=null;
l_objCell=null;
}
}
}
Instead of populating the complete list in memory before starting to write to excel you need to modify the code to work in such a way that each object is written to a file as it is read from the database. Take a look at this question to get some idea of the other approach.
Well, I'm not sure if POI can handle incremental updates but if so you might want to write chunks of say 10000 Rows to the file. If not, you might have to use CSV instead (so no formatting) or increase memory.
The problem is that you need to make objects written to the file elligible for garbage collection (no references from a live thread anymore) before writing the file is finished (before all rows have been generated and written to the file).
Edit:
If can you write smaller chunks of data to the file you'd also have to only load the necessary chunks from the db. So it doesn't make sense to load 50000 records at once and then try and write 5 chunks of 10000, since those 50000 records are likely to consume a lot of memory already.
As Thomas points out, you have too many objects taking up too much space, and need a way to reduce that. There is a couple of strategies for this I can think of:
Do you need to create a new factory each time in the loop, or can you reuse it?
Can you start with a loop getting the information you need into a new structure, and then discarding the old one?
Can you split the processing into a thread chain, sending information forwards to the next step, avoiding building a large memory consuming structure at all?

Categories

Resources