PDFBox - Splitting one single pdf into multiple pdf files

PDFBox - Splitting one single pdf into multiple pdf files - java

My requirement is, i have to split a large pdf file into multiple small pdf files. I have a 10000 pages pdf file and i want to split the file into 1000 files with 10 pages each. I tried to split the file using pdfbox api. I am able to split the file as per my requirement and also it works fine with the file having small no of pages. But when i tried with 10000 pages, it is taking huge time, ie) in hours. In actual scenario i may even get pdf file with more than 20000 pages and more than 5000 splits.
The time to split is reducing based on the no of split. If i try to split the same file into 100*100 pages, it is taking less time. Can anyone please validate my code and check if i am doing it in a right way or i can add code to make the performance better.
Note: I cannot use 'iText' since this is for client specific project. Is there any api available to split the pdf file other than iText and pdfbox
Please refer my below code
public class Test {
private static String sourceFolderPath = "/local_path/PDFSplitter_perf/10000_pages/";
private static String outputPath = sourceFolderPath+"output/";
private static String pdfFileName = sourceFolderPath+"test_1.pdf";
private static int pageCount = 10;
public static void main(String[] args) throws IOException {
splitUsingPDFBox(pdfFileName);
}
public static void splitUsingPDFBox(String pdfFilePath) throws IOException, InterruptedException, ExecutionException{
try (final PDDocument document = PDDocument.load(new File(pdfFilePath));) {
int i = 1;
while(i<10000){
int startPage = i;
int endPage = i + (pageCount-1);
String chidlPdfFile = outputPath+"/"+startPage+"_"+endPage+".pdf";
Splitter splitter = new Splitter();
splitter.setStartPage(startPage);
splitter.setEndPage(endPage);
splitter.setSplitAtPage(endPage);
List<PDDocument> pages = splitter.split(document);
PDDocument pd = null;
try{
pd = pages.get(0);
pd.save(chidlPdfFile);
}finally{
if( pd != null ){
pd.close();
}
}
}
}
}
}

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

I'm writing a java programm to swap images inside a pdf. Due to the process of generation they are stored as high dpi, rgb images, but are bitonal/monochrome images. I'm using itext 7.1.1, but also testet the latest dev version (7.1.2 snapshot).
I'm already able to extract the images from pdf and convert them to png or tif using indexed colours or gray (0 & 255 only) in imagemagick (also testet gimp).
I modified some code from itext, to replace the images inside the pdf, which does work for DeviceRGB- and DeviceGray-Images, but not for Bitonal ones:
public static Image readPng(String pImageFolder, int pImageNumber) throws IOException {
String url = "./" + pImageFolder + "/" + pImageNumber + ".png";
File ifile = new File(url);
if (ifile.exists() && ifile.isFile()) {
return new Image(ImageDataFactory.create(url));
} else {
return null;
}
}
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes());
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
public static void replaceImages(String pFilename, String pImagefolder, String pOutputFilename) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(pFilename), new PdfWriter(pOutputFilename));
for (int i = 0; i < pdfDoc.getNumberOfPages(); i++) {
PdfDictionary page = pdfDoc.getPage(i + 1).getPdfObject();
PdfDictionary resources = page.getAsDictionary(PdfName.Resources);
PdfDictionary xobjects = resources.getAsDictionary(PdfName.XObject);
Iterator<PdfName> iter = xobjects.keySet().iterator();
PdfName imgRef;
PdfStream stream;
Image img;
int number;
while (iter.hasNext()) {
imgRef = iter.next();
number = xobjects.get(imgRef).getIndirectReference().getObjNumber();
stream = xobjects.getAsStream(imgRef);
img = readPng(pImagefolder, number);
if (img != null) {
replaceStream(stream, img.getXObject().getPdfObject());
}
}
}
pdfDoc.close();
}
If i convert the images to tif and use them as replacement, there are dark images (all pixels are black) inside the pdf. If i try to use png-images, they are not shown and pdfimages complaints "Unknown compression method in flate stream".

FYI:
There was an error in my replaceStream: getBytes() deflates a PdfStream. All Stream-Attributes were copied, thus there was a Filter-Information saying FlateDecoding is necessary.
I had to tell getBytes()not to deflate by setting the decoded-Parameter to false: getBytes(false)
public static void replaceStream(PdfStream orig, PdfStream stream) throws IOException {
orig.clear();
orig.setData(stream.getBytes(false));
for (PdfName name : stream.keySet()) {
orig.put(name, stream.get(name));
}
}
Now everything works fine, except:
Bitone-images are not CCITT4, which they should be. (Doesn't matter, because they are converted to JBig2.)
Images are said to have an error by Acrobat, but every other viewer displays just fine: There seems to be an error inside the ColorSpace information. That should be DeviceGray, but is CalGray with some Gamma-Information, but missing WhitePoint. Changing to DeviceGray by hand makes it work. A workaround is to strip gAMA and cHRM.
Both are conversion errors in iText7:
CCITT4: PNGImageHelper line 254 should be RawImageHelper.updateRawImageParameters(png.image, png.width, png.height, components, bpc, png.idat.toByteArray(), null); to trigger conversion.
WhitePoint is correctly read from the file and stored inside the ImageData-Class, but is discarded inside PdfImageXObject -> createPdfStream.

Printing multiple word documents corrupts every 2nd file

Im saving (generated) Word documents as file via jacob by printing them into a file (i have to do it like this because the file is required from legacy programms)
The problem is if i do this for multiple files, every second file is not written correctly.
The first file is ok.
2nd is only written about 80% of the file.
3rd is ok
4th is the same as the 2nd (exactly the same filesize as the 2nd)
... and so on
This is my code.
public static void main(String args[]) {
Variant background = new Variant(false);
Variant append = new Variant(false);
Variant range = new Variant(0);//wdPrintAllDocument
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch doc = Dispatch.call(document, "Open", "c:/temp/Test.rtf").toDispatch();
for (int i = 0; i < 10; i++) {
Dispatch.call(doc, "PrintOut", background, append, range, new Variant("c:/temp/test" + i));
}
Dispatch.call(doc, "Close", 0);
Dispatch.call(oleComponent, "Quit");
}
The problem appears on different printers (the pdf printer for example works)
why every 2nd file? word problem? printer (driver) problem?
help is very much appreciated

To search a particular file in PDF document using Java

Hi I have a PDF file and I need to search a particular string in that. I tried various methods, and I am able to read all the contents in PDF file but unable to find a particular string.
Here in this file, I need to search string such as Telephone, Garbage, Rent etc individually.
Could you please help me?
I have the below code for reading the file.
public class PDFBoxReader {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFBoxReader() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File("D:\\report.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
It would be great if someone could help me with a code that searches for a particular string. Thanks in advance.

Try String.indexOf("substring") with String being what is returned from your ToText() method, and substring the string you wish to search for. (Side note, the custom in Java is camel-case methods, which would be toText() in this case.)
This method should find the first index of the entered substring in your long String of text. So you could do String.indexOf("Telephone") to find the first occurrence of the word Telephone in your String.
If you want the stuff directly after that substring, the index would simply be String.indexOf("substring")+"substring".length()
You can even find the next occurrence (or the next after that) with another variation of this method String.indexOf("substring", indexOfLastOccurrence+"substring".length)
Example:
String myPDF = ToText();
int rentIndex = myPDF.indexOf("Rent")+"Rent".length();
String rent = myPDF.substring(rentIndex); //Find 1st occurrence of "Rent" and get info after it
rent = rent.substring(int beginIndex, int endIndex); //Get endIndex-beginIndex characters after rent. (I assume you only want like a few numbers afterwards or something.)
//process rent e.g. Integer.parseInt(rent) or something
rentIndex = myPDF.indexOf("Rent",rentIndex)+"Rent".length();
rent = myPDF.substring(rentIndex); //Next occurrence of "Rent"
//Repeat to find the next occurrence, and the one after that. (Until rentIndex gets set to a negative, indicating that no more occurrences exist.)
Both methods can be found in the Java API:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String)

How to get the page count of a microsoft word document in java?

for a server based j2ee application, I need to retrieve the number of pages from word documents.. any ideas what works?

If the documents are modern Word 2007 format you can use direct XML-based manipulation, through OOXML. This is by far the better long term solution, though I realize it may not be realistic for an entire organization to change overnight.
If they are older Word formats, you're probably stuck with server-side Word/Excel/Powerpoint/Outlook programmable object models, although you're not supposed to do that on the server..

Regarding Office Open XML support, the latest beta of Java-POI is supposed to support it.

Haven't used it before but you could try Apache POI. Looks like it has a WordCount function.

//Open the Word Document
Document doc = new Document("C:\\Temp\\file.doc");
//Get page count
int pageCount = doc.getPageCount();

To read the page count of MS Office files you can use aspose libraries (aspose-words, aspose-cells, aspose-slides).
Examples:
Excel:
number of pages of the printable version of the Woorkbook:
import com.aspose.cells.*;
public int getPageCount(String filePath) throws Exception {
Workbook book = new Workbook(filePath);
ImageOrPrintOptions imageOrPrintOptions = new ImageOrPrintOptions();
// Default 0 Prints all pages.
// IgnoreBlank 1 Don't print the pages which the cells are blank.
// IgnoreStyle 2 Don't print the pages which cells only contain styles.
imageOrPrintOptions.setPrintingPage(PrintingPageType.IGNORE_STYLE);
int pageCount = 0;
for (int i = 0; i < book.getWorksheets().getCount(); i++) {
Worksheet sheet = book.getWorksheets().get(i);
PageSetup pageSetup = sheet.getPageSetup();
pageSetup.setOrientation(PageOrientationType.PORTRAIT);
pageSetup.setPaperSize(PaperSizeType.PAPER_LETTER);
pageSetup.setTopMarginInch(1);
pageSetup.setBottomMarginInch(1);
pageSetup.setRightMarginInch(1);
pageSetup.setLeftMarginInch(1);
SheetRender sheetRender = new SheetRender(sheet, imageOrPrintOptions);
int sheetPageCount = sheetRender.getPageCount();
pageCount += sheetPageCount;
}
return pageCount;
}
Word: number of pages:
import com.aspose.words.Document;
public int getPageCount(String filePath) throws Exception {
Document document = new Document(filePath);
return document.getPageCount();
}
PowerPoint: number of slides:
import com.aspose.slides.*;
public int getPageCount(String filePath) throws Exception {
Presentation presentation = new Presentation(filePath);
return presentation.getSlides().toArray().length;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox - Splitting one single pdf into multiple pdf files - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

How to convert an image (rgb/gray) inside a pdf to monochrom/bitonal one using itext and java

Printing multiple word documents corrupts every 2nd file

To search a particular file in PDF document using Java

How to get the page count of a microsoft word document in java?

Categories

Resources