How to merge many pdfs

How to merge many pdfs - java

I want to ask how to merge more than 100k pdf files (file each pdf around 160 KB ) into 1 pdf file ?
Tutorial
I already read this tutorial, that code is working for few pdf. But when I tried for 10k pdf files I get this error "java.lang.OutOfMemoryError: GC overhead limit exceeded"
I already tried using -Xmx or -Xms, the error become "java heap space".
I am also using "pdf.flushCopiedObjects(firstSourcePdf);" it doesn't help. Or maybe I am using it incorrectly?
File file = new File(pathName);
File[] listFile = file.listFiles();
if (listFile == null) {
throw new Exception("File not Found at " + pathName);
}
Arrays.sort(listFile, 0, listFile.length - 1);
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1A,
new PdfOutputIntent("Custom", "", "http://www.color.org",
"sRGB IEC61966-2.1", null));
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-1a example");
//Create PdfMerger instance
PdfMerger merger = new PdfMerger(pdf);
//Add pages from the first document
for (File filePdf : listFile) {
System.out.println("filePdf = " +filePdf.getName());
PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(filePdf));
merger.merge(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
pdf.flushCopiedObjects(firstSourcePdf);
firstSourcePdf.close();
}
pdf.close();
Thank You

This is a known issue when merging a large amount of PDF documents (or large PDFs).
iText will try to make the resulting PDF as small as possible. It does this by trying to reuse objects. For instance, if you have an image that occurs multiple times, in stead of embedding that image every time, it will embed it once and simply use a reference for the other occurrences.
That means iText has to keep all objects in memory, because there is no way of knowing beforehand whether an object will get reused.
A solution that usually helps is splitting the process in batches.
In stead of merging 1000 files into 1, try merging 1000 files in pairs (resulting in 500 documents) and then merge each of those in pairs (resulting in 250 documents) and so on.
That allows iText to flush the buffer regularly, which should stop the memory overhead from crashing the VM.

If it doesn't have to be iText, you could try using a command line application that supports merging of files. PDFtk, QPDF and HexaPDF CLI (note: I'm the author of HexaPDF) are some CLI tools that support basic PDF file merging.

Related

Azure Storage Blob: Uploaded CSV file shows zero bytes

This problem I am facing in title is very similar to this question previously raised here (Azure storage: Uploaded files with size zero bytes), but it was for .NET and the context for my Java scenario is that I am uploading small-size CSV files on a daily basis (about less than 5 Kb per file). In addition the API code uses the latest version of Azure API that I am using in contrast against the 2010 used by the other question.
I couldn't figure out where have I missed out, but the other alternative is to do it in File Storage, but of course the blob approach was recommended by a few of my peers.
So far, I have mostly based my code on uploading a file as a block of blob on the sample that was shown in the Azure Samples git [page] (https://github.com/Azure-Samples/storage-blob-java-getting-started/blob/master/src/BlobBasics.java). I have already done the container setup and file renaming steps, which isn't a problem, but after uploading, the size of the file at the blob storage container on my Azure domain shows 0 bytes.
I've tried alternating in converting the file into FileInputStream and upload it as a stream but it still produces the same manner.
fileName=event.getFilename(); //fileName is e.g eod1234.csv
String tempdir = System.getProperty("java.io.tmpdir");
file= new File(tempdir+File.separator+fileName); //
try {
PipedOutputStream pos = new PipedOutputStream();
stream= new PipedInputStream(pos);
buffer = new byte[stream.available()];
stream.read(buffer);
FileInputStream fils = new FileInputStream(file);
int content = 0;
while((content = fils.read()) != -1){
System.out.println((char)content);
}
//Outputstream was written as a test previously but didn't work
OutputStream outStream = new FileOutputStream(file);
outStream.write(buffer);
outStream.close();
// container name is "testing1"
CloudBlockBlob blob = container.getBlockBlobReference(fileName);
if(fileName.length() > 0){
blob.upload(fils,file.length()); //this is testing with fileInputStream
blob.uploadFromFile(fileName); //preferred, just upload from file
}
}
There are no error messages shown, just we know that the file touches the blob storage and shows a size 0 bytes. It's a one-way process by only uploading CSV-format files. At the blob container, it should be showing those uploaded files a size of 1-5 KBs each.

Instead of blob.uploadFromFile(fileName); you should use blob.uploadFromFile(file.getAbsolutePath()); because uploadFromFile method requires absolute path. And you don't need the blob.upload(fils,file.length());.
Refer to Microsoft Docs: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-java#upload-blobs-to-the-container

The Azure team replied to a same query I've put on mail and I have confirmed that the problem was not on the API, but due to the Upload component in Vaadin which has a different behavior than usual (https://vaadin.com/blog/uploads-and-downloads-inputs-and-outputs). Either the CloudBlockBlob or the BlobContainerUrl approach works.
The out-of-the-box Upload component requires manual implementation of the FileOutputStream to a temporary object unlike the usual servlet object that is seen everywhere. Since there was limited time, I used one of their addons, EasyUpload, because it had Viritin UploadFileHandler incorporated into it instead of figuring out how to stream the object from scratch. Had there been more time, I would definitely try out the MultiFileUpload addon, which has additional interesting stuff, in my sandbox workspace.

I had this same problem working with .png (copied from multipart files) files I was doing this:
File file = new File(multipartFile.getOriginalFilename());
and the blobs on Azure were 0bytes but when I changed to this:
File file = new File("C://uploads//"+multipartFile.getOriginalFilename());
it started saving the files properly

How to process large no of xml files and write into text file in java faster

I have millions of xml file in a day. Size of the xml files are from 10KB to 50 MB .
I have written SAX parser to parse xml files and write into text file .
I am creating 35 unique text files from all millions of xml files .
I have to parse these xml files in first come first basic manner so that order of the records is maintained .
I have to process the files very quickly.
Total size of the xml files will be approx 1 TB.
I have not implemented multi thread to process xml files because i have to process it in first come first basis.
How to process all xml files real quick?
Before moving my code into prod i just wanted to check whether i need to rethink about my implementation.
This is how i read xml files and process it.
public static void main(String[] args) {
File folder = new File("c://temp//SDIFILES");
File[] files = folder.listFiles();
Arrays.sort(files, new Comparator<Object>() {
public int compare(Object o1, Object o2) {
if (((File) o1).lastModified() > ((File) o2).lastModified()) {
return -1;
} else if (((File) o1).lastModified() < ((File) o2).lastModified()) {
return +1;
} else {
return 0;
}
}
});
for (File file : files) {
System.out.println("Started Processing file :" + Arrays.asList(file));
new MySaxParser(file);
}
}
I am not sure my processing will work for millions of the xml files.

As you said,you have to process it in first come first basis .
You can think every xml file as a java method and then you can implement multi thread to process xml files.I think in this way you can save a lot of time .

Immediately:
return Long.compareTo(((File) o1).lastModified(), ((File) o2).lastModified());
read and write buffered
be careful of String operations
no validation
for DTDs use XML catalogs
use a profiler! (Saved me in Excel generation)
if possible use a database instead of 35 output files
check for a RAM disk or such
of course much memory -Xmx
The last resort, an XML pull parser (StaX) i.o. Xalan/Xerces or plain text parsing, is what you try to prevent; so no comment on that.
Arrays.sort(files, new Comparator<File>() {
#Override
public int compare(File o1, File o2) {
return Long.compareTo(o1.lastModified(), o2.lastModified());
}
});

There are number of things to consider...
Is it a batch process when all files already there in c://temp//SDIFILES folder or is it a kind of event listener which is waiting for a next file to appear there?
do you have XSD schemas for all those XMLs? If so you may think about to use JAXB unmarshaller upfront instead of custom SAX parser
IMHO from first glance...
If it is a batch process - Separate parsing process from combining results into text files. Then you can apply multi-threading to parse files by using some temp/stage files/objects before put them into text files.
i.e.
run as many parsing threads as your resources allow (memory/cpu)
place each parser result aside into temp file/DB/In memory Map etc. whith
its order number or timestamp
combine ready results into text files as last step of the whole process. So you will not wait to parse next XML file only when previous parsed.
if it is a listener it also can use multithreading to parse, but little more may needed. As example, spin up combining results into text files periodically (as example every 10 sec.) which will pick temp result files marked as ready
For both cases anyway it will be "portioned process".
Let say you can run 5 parsing threads for the next 5 files from sorted by timestamp list of files, then wait until all 5 parsing threads completed (result may not be necessary a temp file, but can stay in memory if possible), then combine into text file. ... then pick next 5 files and so on...
... something like that...
Definitely, sequential process that large number of files will take a time, and mostly to parse them from XML.

Merge document with PDFMergerUtility in pdfbox 2.00

Pdfbox Merge Document with 1.8.xx as like mergePdf.mergeDocuments() it working fine .now pdfbox version 2.0.0 contain some argument like org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(MemoryUsageSetting arg0)
what is MemoryUsageSetting how to use with mergeDocuments.I read as like Merge the list of source documents, saving the result in the destination file. kindly provide some code equivalent to version 2.0.0
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}

According to the javadoc, MemoryUsageSetting controls how memory/temporary files are used for buffering.
The two easiest usages are:
MemoryUsageSetting.setupMainMemoryOnly()
this sets buffering memory usage to only use main-memory (no temporary file) which is not restricted in size.
MemoryUsageSetting.setupTempFileOnly()
this sets buffering memory usage to only use temporary file(s) (no main-memory) which is not restricted in size.
So for you, the call would be
mergePdf.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
or
mergePdf.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
Or just pass null. This will default to main memory only. That's also what the javadoc tells: memUsageSetting defines how memory is used for buffering PDF streams; in case of null unrestricted main memory is used.

Google Appengine JAVA -Zipping up blobstore files results in error 202 when saving back to blobstore

I am working on a application in appengine that we want to be able to make the content available for offline users. This means we need to get all the used blobstore files and save them off for the offline user. I am using the server side to do this so that it is only done once, and not for every end user. I am using the task queue to run this process as it can easily time out. Assume all this code is running as a task.
Small collections work fine, but larger collections result in a appengine error 202 and it restarts the task again and again. Here is the sample code that comes from combination of Writing Zip Files to GAE Blobstore and following the advice for large zip files at Google Appengine JAVA - Zip lots of images saving in Blobstore by reopening the channel as needed. Also referenced AppEngine Error Code 202 - Task Queue as the error.
//Set up the zip file that will be saved to the blobstore
AppEngineFile assetFile = fileService.createNewBlobFile("application/zip", assetsZipName);
FileWriteChannel writeChannel = fileService.openWriteChannel(assetFile, true);
ZipOutputStream assetsZip = new ZipOutputStream(new BufferedOutputStream(Channels.newOutputStream(writeChannel)));
HashSet<String> blobsEntries = getAllBlobEntries(); //gets blobs that I need
saveBlobAssetsToZip(blobsEntries);
writeChannel.closeFinally();
.....
private void saveBlobAssetsToZip(blobsEntries) throws IOException {
for (String blobId : blobsEntries) {
/*gets the blobstote key that will result in the blobstore entry - ignore the bsmd as
that is internal to our wrapper for blobstore.*/
BlobKey blobKey = new BlobKey(bsmd.getBlobId());
//gets the blob file as a byte array
byte[] blobData = blobstoreService.fetchData(blobKey, 0, BlobstoreService.MAX_BLOB_FETCH_SIZE-1);
String extension = type of file saved from our metadata (ie .jpg, .png, .pfd)
assetsZip.putNextEntry(new ZipEntry(blobId + "." + extension));
assetsZip.write(blobData);
assetsZip.closeEntry();
assetsZip.flush();
/*I have found that if I don't close the channel and reopen it, I can get a IO exception
because the files in the blobstore are too large, thus the write a file and then close and reopen*/
assetsZip.close();
writeChannel.close();
String assetsPath = assetFile.getFullPath();
assetFile = new AppEngineFile(assetsPath);
writeChannel = fileService.openWriteChannel(assetFile, true);
assetsZip = new ZipOutputStream(new BufferedOutputStream(Channels.newOutputStream(writeChannel)));
}
}
What is the proper way to get this to run on appengine? Again small projects work fine and zip saves, but larger projects with more blob files results in this error.

I bet that the instance is running out of memory. Are you using appstats? It can consume a large amount of memory. If that doesn't work you will probably need to increase the instance size.

Java out of memory using fileoutstream

I,m trying to export some files from a system and save it in my drive, the problem is that some files are pretty big and I get the java out of memory error.
FileOutputStream fileoutstream = new FileOutputStream(filenameExtension);
fileoutstream.write(dataManagement.getContent(0).getData());
fileoutstream.flush();
fileoutstream.close();
Any recomendation that I can try, I add the flush but now diference, this will call the export method, generate the file and saved. I,m using a cursos to run over the data that I,m exporting not an array, I try to add more memory but the files are too big.

You are loading the whole file in memory before writing it. On the contrary you should:
load only a chunk of data
write it
repeat the steps above until you have processed all data.

If the files are really big, you may need to read/write them in chunks. If the files are big enough to fit in memory, then you can increase the size of the virtual machine memory.
i.e:
java -Xmx512M ...
FileInputStream fi = infile;
FileOutputStream fo = outfile
byte[] buffer = new byte[5000];
int n;
while((n = fi.read(buffer)) > 0)
fo.write(b, 0, n);
Hope this helps to get the idea.

you can use the spring batch framework to do the reading and writing the file in chunk size.
http://static.springsource.org/spring-batch/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.