Need to split docx file based on string using docx4j Java?

Need to split docx file based on string using docx4j Java? - java

I am new to Docx4j ,Need help to split docx file based on string using docx4j Java,So that it writes output into multiple files.
I tried to do the same using Apache POI and got the output,however when tried to convert it into HTML, got issues on style missing,also added styles later, still facing the same issue.
Below is the code using apache poi:
public static int pos = 0;
public static int posc = 0;
public static String ind = "n";
final static int DEFAULT_FONT_SIZE = 10;
public static void main(String[] args) throws FileNotFoundException,
IOException, XmlException {
File file = null;
File outfilep = null;
File outfilec = null;
File dir = new File(PropertyUtils.getProperty("INPUT_DIR"));
String[] files = dir.list();
if (files.length == 0) {
System.out.println("The directory is empty");
} else {
for (String aFile : files) {
System.out.println(aFile);
file = new File(PropertyUtils.getProperty("INPUT_DIR") + aFile
+ "/" + aFile + ".docx");
outfilep = new File(PropertyUtils.getProperty("INPUT_DIR")
+ aFile + "/" + aFile + "-Product.docx");
outfilec = new File(PropertyUtils.getProperty("INPUT_DIR")
+ aFile + "/" + aFile + "-Component.docx");
// Write Soruce file
}
}
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
XWPFDocument destDoc = new XWPFDocument();
copyLayout(doc, destDoc);
XWPFDocument destDocc = new XWPFDocument();
OutputStream out = new FileOutputStream(outfilep);
OutputStream outc = new FileOutputStream(outfilec);
for (IBodyElement bodyElement : doc.getBodyElements()) {
BodyElementType elementType = bodyElement.getElementType();
if (elementType.name().equals("PARAGRAPH")) {
XWPFParagraph pr = (XWPFParagraph) bodyElement;
if (pr.getText().contains("CONSTRUCTION DETAILS:"))
{
ind = "y";
System.out.println("ind is Y++++++++++++");
}
if (ind == "n")
{
copyStyle(doc, destDoc,
doc.getStyles().getStyle(pr.getStyleID()));
XWPFParagraph dstPr = destDoc.createParagraph();
dstPr.createRun();
pos = destDoc.getParagraphs().size() - 1;
CTPPr ppr = pr.getCTP().getPPr();
if (ppr == null) ppr = pr.getCTP().addNewPPr();
CTSpacing spacing = ppr.isSetSpacing()? ppr.getSpacing() : ppr.addNewSpacing();
spacing.setAfter(BigInteger.valueOf(0));
spacing.setBefore(BigInteger.valueOf(0));
spacing.setLineRule(STLineSpacingRule.AUTO);
spacing.setLine(BigInteger.valueOf(240));
destDoc.setParagraph(pr, pos);
// System.out.println("prod "
// + destDoc.getParagraphArray(pos).getParagraphText());
}
else {
copyStyle(doc, destDocc,
doc.getStyles().getStyle(pr.getStyleID()));
XWPFParagraph dstPrr = destDocc.createParagraph();
dstPrr.createRun();
pos = destDocc.getParagraphs().size() - 1;
CTPPr ppr = pr.getCTP().getPPr();
if (ppr == null) ppr = pr.getCTP().addNewPPr();
CTSpacing spacing = ppr.isSetSpacing()? ppr.getSpacing() : ppr.addNewSpacing();
spacing.setAfter(BigInteger.valueOf(0));
spacing.setBefore(BigInteger.valueOf(0));
spacing.setLineRule(STLineSpacingRule.AUTO);
spacing.setLine(BigInteger.valueOf(240));
destDocc.setParagraph(pr, pos);
//// System.out.println("comp "
//// + destDoc.getParagraphArray(pos).getParagraphText());
}
} else if (elementType.name().equals("TABLE")) {
XWPFTable table = (XWPFTable) bodyElement;
if (ind == "n")
{
copyStyle(doc, destDoc,
doc.getStyles().getStyle(table.getStyleID()));
destDoc.createTable();
pos = destDoc.getTables().size() - 1;
destDoc.setTable(pos, table);
// System.out.println("prodtable " + destDoc.getParagraphArray(pos).getParagraphText());
}
else {
copyStyle(doc, destDocc,
doc.getStyles().getStyle(table.getStyleID()));
destDocc.createTable();
pos = destDocc.getTables().size() - 1;
destDocc.setTable(pos, table);
// System.out.println("comptable " + destDoc.getParagraphArray(pos).getParagraphText());
}
}
}
destDoc.write(out);
destDocc.write(outc);
}
// Copy Styles of Table and Paragraph.
private static void copyStyle(XWPFDocument srcDoc, XWPFDocument destDoc,
XWPFStyle style) {
if (destDoc == null || style == null)
return;
if (destDoc.getStyles() == null) {
destDoc.createStyles();
}
List<XWPFStyle> usedStyleList = srcDoc.getStyles().getUsedStyleList(
style);
for (XWPFStyle xwpfStyle : usedStyleList) {
destDoc.getStyles().addStyle(xwpfStyle);
}
}
private static void copyLayout(XWPFDocument srcDoc, XWPFDocument destDoc)
{
CTPageMar pgMar = srcDoc.getDocument().getBody().getSectPr().getPgMar();
BigInteger bottom = pgMar.getBottom();
BigInteger footer = pgMar.getFooter();
BigInteger gutter = pgMar.getGutter();
BigInteger header = pgMar.getHeader();
BigInteger left = pgMar.getLeft();
BigInteger right = pgMar.getRight();
BigInteger top = pgMar.getTop();
CTPageMar addNewPgMar = destDoc.getDocument().getBody().addNewSectPr().addNewPgMar();
addNewPgMar.setBottom(bottom);
addNewPgMar.setFooter(footer);
addNewPgMar.setGutter(gutter);
addNewPgMar.setHeader(header);
addNewPgMar.setLeft(left);
addNewPgMar.setRight(right);
addNewPgMar.setTop(top);
CTPageSz pgSzSrc = srcDoc.getDocument().getBody().getSectPr().getPgSz();
BigInteger code = pgSzSrc.getCode();
BigInteger h = pgSzSrc.getH();
Enum orient = pgSzSrc.getOrient();
BigInteger w = pgSzSrc.getW();
CTPageSz addNewPgSz = destDoc.getDocument().getBody().addNewSectPr().addNewPgSz();
addNewPgSz.setCode(code);
addNewPgSz.setH(h);
addNewPgSz.setOrient(orient);
addNewPgSz.setW(w);
}

Splitting a docx is easy enough to do in a brute force kind of a way: you can delete the content (paragraphs etc) you don't want, then save the result.
This way, the original relationships will stay intact, but your docx container may be bigger than necessary, since it might have images etc which are no longer used.
Done this way, there are still things you need to look out for:
splitting between a bookmark start and end tag (same for comments)
automatic numbering might give the wrong start number, unless you set start at
Obviously you could write code to address such issues.
Alternatively, with our commercial Enterprise edition of docx4j, you can use its "merge" code to say you want say paragraphs X to Y, and it'll give you a docx containing only that (ie no extraneous images in the docx container, split bookmarks taken care of etc).

I hope this will solve the issue.
public class SplitUsingDocx4j {
/**
* #param args
* #throws Docx4JException
* #throws FileNotFoundException
*/
public static void main(String[] args) throws Docx4JException,
FileNotFoundException {
File dir = new File(PropertyUtils.getProperty("INPUT_DIR"));
String[] files = dir.list();
File file = null;
if (files.length == 0) {
System.out.println("The directory is empty");
} else {
for (String aFile : files) {
System.out.println(aFile);
file = new File(PropertyUtils.getProperty("INPUT_DIR") + aFile
+ "/" + aFile + ".docx");
}
}
// Creating new documents
WordprocessingMLPackage doc1 = WordprocessingMLPackage.createPackage();
WordprocessingMLPackage doc2 = WordprocessingMLPackage.createPackage();
// loading existing document
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(new java.io.File(file.getPath()));
MainDocumentPart tempDocPart = wordMLPackage.getMainDocumentPart();
List<Object> obj = wordMLPackage.getMainDocumentPart().getContent();
// for copying styles from existing doc to new docs
StyleDefinitionsPart sdp = tempDocPart.getStyleDefinitionsPart();
Styles tempStyle = sdp.getJaxbElement();
doc1.getMainDocumentPart().getStyleDefinitionsPart()
.setJaxbElement(tempStyle);
doc2.getMainDocumentPart().getStyleDefinitionsPart()
.setJaxbElement(tempStyle);
boolean flag = false;
for (Object object : obj) {
if (!flag) {
if (object.toString().equalsIgnoreCase("CONSTRUCTION DETAILS:")) {
flag = true;
}
doc1.getMainDocumentPart().addObject(object);
} else {
doc2.getMainDocumentPart().addObject(object);
}
}
String fileName = file.getName().toString().replace(".docx", "");
doc1.save(new File(fileName + "-1.docx"));
doc2.save(new File(fileName + "-2.docx"));
}}

Related

The value "name" and "surname" aren't read apache poi

My purpose is to read a file docx and take this text "#name#" and "#surname#" and change the value with another casual text:
This is my docx file:
I do this:
XWPFDocument docx = new XWPFDocument(OPCPackage.open("..."));
for (XWPFParagraph p : docx.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null && text.startsWith("#") && text.endsWith("#")) {
text = text.replace("#", "new ");
r.setText(text, 0);
}
}
}
}
for (XWPFTable tbl : docx.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.startsWith("#") && text.endsWith("#")) {
text = text.replace("#", "new ");
r.setText(text,0);
}
}
}
}
}
the problem is that my code reads all label in docx file but it doesn't read the label "#surname#" and "#name". Anyone can help me?

From your screenshot it looks like the "#name#" and "#suremane#" are not in the document body directly but in a drawing (a text-box for example or a shape). Such elements are not covered by XWPFDocument.getParagraphs or .getTables or any other high level method in apache poi. So your main problem will be that the paragraphs which contain your text simply are not traversed by your code.
The only way to get really all paragraphs out of the documents body is using a XmlCursor which selects all w:p elements from the XML directly.
The code below shows that. It traverses really all XWPFParagraphs in documents body using a XmlCursor and replaces text if found.
For the replacement process I prefer the TextSegment replacement approach shown in Apache POI: ${my_placeholder} is treated as three different runs already. This is necessary because, even if the containing paragraph gets traversed, the text could be separated in different text runs because of formatting, spell checking or any other strange reasons. Microsoft Word knows nearly infinity reasons to strangely split text into different text runs.
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;
public class WordReplaceTextSegment {
/**
* this methods parse the paragraph and search for the string searched.
* If it finds the string, it will return true and the position of the String
* will be saved in the parameter startPos.
*
* #param searched
* #param startPos
*/
static TextSegment searchText(XWPFParagraph paragraph, String searched, PositionInParagraph startPos) {
int startRun = startPos.getRun(),
startText = startPos.getText(),
startChar = startPos.getChar();
int beginRunPos = 0, candCharPos = 0;
boolean newList = false;
//CTR[] rArray = paragraph.getRArray(); //This does not contain all runs. It lacks hyperlink runs for ex.
java.util.List<XWPFRun> runs = paragraph.getRuns();
int beginTextPos = 0, beginCharPos = 0; //must be outside the for loop
//for (int runPos = startRun; runPos < rArray.length; runPos++) {
for (int runPos = startRun; runPos < runs.size(); runPos++) {
//int beginTextPos = 0, beginCharPos = 0, textPos = 0, charPos; //int beginTextPos = 0, beginCharPos = 0 must be outside the for loop
int textPos = 0, charPos;
//CTR ctRun = rArray[runPos];
CTR ctRun = runs.get(runPos).getCTR();
XmlCursor c = ctRun.newCursor();
c.selectPath("./*");
try {
while (c.toNextSelection()) {
XmlObject o = c.getObject();
if (o instanceof CTText) {
if (textPos >= startText) {
String candidate = ((CTText) o).getStringValue();
if (runPos == startRun) {
charPos = startChar;
} else {
charPos = 0;
}
for (; charPos < candidate.length(); charPos++) {
if ((candidate.charAt(charPos) == searched.charAt(0)) && (candCharPos == 0)) {
beginTextPos = textPos;
beginCharPos = charPos;
beginRunPos = runPos;
newList = true;
}
if (candidate.charAt(charPos) == searched.charAt(candCharPos)) {
if (candCharPos + 1 < searched.length()) {
candCharPos++;
} else if (newList) {
TextSegment segment = new TextSegment();
segment.setBeginRun(beginRunPos);
segment.setBeginText(beginTextPos);
segment.setBeginChar(beginCharPos);
segment.setEndRun(runPos);
segment.setEndText(textPos);
segment.setEndChar(charPos);
return segment;
}
} else {
candCharPos = 0;
}
}
}
textPos++;
} else if (o instanceof CTProofErr) {
c.removeXml();
} else if (o instanceof CTRPr) {
//do nothing
} else {
candCharPos = 0;
}
}
} finally {
c.dispose();
}
}
return null;
}
static void replaceTextSegment(XWPFParagraph paragraph, String textToFind, String replacement) {
TextSegment foundTextSegment = null;
PositionInParagraph startPos = new PositionInParagraph(0, 0, 0);
//while((foundTextSegment = paragraph.searchText(textToFind, startPos)) != null) { // search all text segments having text to find
while((foundTextSegment = searchText(paragraph, textToFind, startPos)) != null) { // search all text segments having text to find
System.out.println(foundTextSegment.getBeginRun()+":"+foundTextSegment.getBeginText()+":"+foundTextSegment.getBeginChar());
System.out.println(foundTextSegment.getEndRun()+":"+foundTextSegment.getEndText()+":"+foundTextSegment.getEndChar());
// maybe there is text before textToFind in begin run
XWPFRun beginRun = paragraph.getRuns().get(foundTextSegment.getBeginRun());
String textInBeginRun = beginRun.getText(foundTextSegment.getBeginText());
String textBefore = textInBeginRun.substring(0, foundTextSegment.getBeginChar()); // we only need the text before
// maybe there is text after textToFind in end run
XWPFRun endRun = paragraph.getRuns().get(foundTextSegment.getEndRun());
String textInEndRun = endRun.getText(foundTextSegment.getEndText());
String textAfter = textInEndRun.substring(foundTextSegment.getEndChar() + 1); // we only need the text after
if (foundTextSegment.getEndRun() == foundTextSegment.getBeginRun()) {
textInBeginRun = textBefore + replacement + textAfter; // if we have only one run, we need the text before, then the replacement, then the text after in that run
} else {
textInBeginRun = textBefore + replacement; // else we need the text before followed by the replacement in begin run
endRun.setText(textAfter, foundTextSegment.getEndText()); // and the text after in end run
}
beginRun.setText(textInBeginRun, foundTextSegment.getBeginText());
// runs between begin run and end run needs to be removed
for (int runBetween = foundTextSegment.getEndRun() - 1; runBetween > foundTextSegment.getBeginRun(); runBetween--) {
paragraph.removeRun(runBetween); // remove not needed runs
}
}
}
static List<XmlObject> getCTPObjects(XWPFDocument doc) {
List<XmlObject> result = new ArrayList<XmlObject>();
//create cursor selecting all paragraph elements
XmlCursor cursor = doc.getDocument().newCursor();
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:p");
while(cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
// add only if the paragraph contains at least a run containing text
if (obj.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' ./w:r/w:t").length > 0) {
result.add(obj);
}
}
return result;
}
static void traverseAllParagraphsAndReplace(XWPFDocument doc, Map<String, String> replacements) throws Exception {
//This gets all XWPFParagraph out od the stored XML and replaces
//first get all CTP objects
List<XmlObject> allCTPObjects = getCTPObjects(doc);
//then traverse them and create XWPFParagraphs from them and do the replacing
for (XmlObject obj : allCTPObjects) {
XWPFParagraph paragraph = null;
if (obj instanceof CTP) {
CTP p = (CTP)obj;
paragraph = new XWPFParagraph(p, doc);
} else {
CTP p = CTP.Factory.parse(obj.xmlText());
paragraph = new XWPFParagraph(p, doc);
}
if (paragraph != null) {
for (String textToFind : replacements.keySet()) {
String replacement = replacements.get(textToFind);
if (paragraph.getText().contains(textToFind)) replaceTextSegment(paragraph, textToFind, replacement);
}
}
obj.set(paragraph.getCTP());
}
}
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
Map<String, String> replacements;
replacements = new HashMap<String, String>();
replacements.put("#name#", "Axel");
replacements.put("#surename#", "Richter");
traverseAllParagraphsAndReplace(doc, replacements);
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
}
}

Read test data from excel

I am using POM framework and test data I kept in a properties file (This code is working without any issue) but as per the current requirement, I should keep the test data in excel file. As per my code data is reading from excel but values are not sending to chrome(i cross-checked by printing the values in the console) when I debug I got to know that Data is returning a null value. The issue is with p.load(fs1); Line because data is not loading.
// below code is for properties file and its working without any issue.
/* public static String readTestData(String key) throws IOException {
String filename = "testData";
String path = System.getProperty("user.dir") + "/data/testData.properties";
if (path == null || path.length() == 0) {
path = System.getProperty("user.dir") + "/data/" + filename + ".properties";
}
Properties p = new Properties();
FileInputStream fs = new FileInputStream(path);
System.out.print("File Input Stream value is "+fs);
p.load(fs);
System.out.println("Value of login username is "+(String)p.get(key));
return (String) p.get(key);
}*/
// Below code is for reading test data from xlsx
public static String readTestData(String key) throws IOException {
String filename = "testData";
String path = System.getProperty("user.dir") + "/data/testData.xlsx";
if (path == null || path.length() == 0) {
path = System.getProperty("user.dir") + "/data/" + filename + ".xlsx";
}
Properties p = new Properties();
FileInputStream fs = new FileInputStream(path);
Workbook SapWorkbook = null;
StringBuffer sbf = new StringBuffer();
SapWorkbook = new XSSFWorkbook(fs);
Sheet SapSheet = SapWorkbook.getSheet("Sheet1");
int rowCount = SapSheet.getLastRowNum()-SapSheet.getFirstRowNum();
for (int i = 0; i < rowCount+1; i++) {
Row row = SapSheet.getRow(i);
//Create a loop to print cell values in a row
for (int j = 0; j < row.getLastCellNum(); j++) {
//Print Excel data in console
sbf.append(row.getCell(j).getStringCellValue());
System.out.print(row.getCell(j).getStringCellValue()+"|| ");
}
System.out.println();
}
byte[] bytes = sbf.toString().getBytes();
ByteArrayInputStream fs1 = new ByteArrayInputStream(bytes);
p.load(fs1);
System.out.println("Value of login username is "+(String)p.get(key));
return (String) p.get(key);
}
public static void enterText(String key, String data) throws IOException, InterruptedException {
try {
waitForPresenceAndVisibilityOfElementry(readobjectRepo(key));
WebElement ele = driver.findElement(By.xpath(readobjectRepo(key)));
ele.clear();
Thread.sleep(1200);
System.out.println("about to read Base page");
ele.sendKeys(readTestData(data));
System.out.println("data read");
startExtent.log(LogStatus.PASS, "Entering data.. " + readTestData(data) + " is sucessful");
Thread.sleep(1200);
} catch (Exception e) {
e.printStackTrace();
reportFailure("Click on element is unsucessful");
}
}
In the console result.
loginUserName|| ABCD#gmail.com||
loginPassword|| abc#1a||
Value of login username is null

Your code looks quite messy and could do with some clean up. This block is pointless:
String filename = "testData";
String path = System.getProperty("user.dir") + "/data/testData.xlsx";
if (path == null || path.length() == 0) {
path = System.getProperty("user.dir") + "/data/" + filename + ".xlsx";
}
You are just hard coding the same thing in different ways, stick to just
String path = System.getProperty("user.dir") + "/data/testData.xlsx";
It's pretty unclear what you are trying to do here, but I'm going to guess you want the value in the second column that is associated with a key that is in the first column. All of the stuff creating a properties object seems completely obsolete, so I've rewritten your code to be this:
Workbook SapWorkbook = new XSSFWorkbook(fs);
Sheet SapSheet = SapWorkbook.getSheet("Sheet1");
int rowCount = SapSheet.getLastRowNum() - SapSheet.getFirstRowNum();
for (int i = 0; i < rowCount + 1; i++) {
Row row = SapSheet.getRow(i);
if (key.equals(row.getCell(0).getStringCellValue())) {
return row.getCell(1).getStringCellValue();
}
}
throw new IllegalArgumentException(String.format("%s not found!", key));
This will now scroll through each row searching for the first instance of your "key" and will then return the associated "value". If it doesn't find a key it will throw an exception. To package it all up as a single method:
public static String readTestData(String key) throws IllegalArgumentException, IOException {
String path = System.getProperty("user.dir") + "/data/testData.xlsx";
FileInputStream fs = new FileInputStream(path);
Workbook SapWorkbook = new XSSFWorkbook(fs);
Sheet SapSheet = SapWorkbook.getSheet("Sheet1");
int rowCount = SapSheet.getLastRowNum() - SapSheet.getFirstRowNum();
for (int i = 0; i < rowCount + 1; i++) {
Row row = SapSheet.getRow(i);
if (key.equals(row.getCell(0).getStringCellValue())) {
return row.getCell(1).getStringCellValue();
}
}
throw new IllegalArgumentException(String.format("%s not found!", key));
}

How to merge PDF documents with correct orientation? [duplicate]

How to merge multiple pdf files (generated on run time) through ItextSharp then printing them.
I found the following link but that method requires the pdf names considering that the pdf files stored and this is not my case .
I have multiple reports i'll convert them to pdf files through this method :
private void AddReportToResponse(LocalReport followsReport)
{
string mimeType;
string encoding;
string extension;
string[] streams = new string[100];
Warning[] warnings = new Warning[100];
byte[] pdfStream = followsReport.Render("PDF", "", out mimeType, out encoding, out extension, out streams, out warnings);
//Response.Clear();
//Response.ContentType = mimeType;
//Response.AddHeader("content-disposition", "attachment; filename=Application." + extension);
//Response.BinaryWrite(pdfStream);
//Response.End();
}
Now i want to merge all those generated files (Bytes) in one pdf file to print it

If you want to merge source documents using iText(Sharp), there are two basic situations:
You really want to merge the documents, acquiring the pages in their original format, transfering as much of their content and their interactive annotations as possible. In this case you should use a solution based on a member of the Pdf*Copy* family of classes.
You actually want to integrate pages from the source documents into a new document but want the new document to govern the general format and don't care for the interactive features (annotations...) in the original documents (or even want to get rid of them). In this case you should use a solution based on the PdfWriter class.
You can find details in chapter 6 (especially section 6.4) of iText in Action — 2nd Edition. The Java sample code can be accessed here and the C#'ified versions here.
A simple sample using PdfCopy is Concatenate.java / Concatenate.cs. The central piece of code is:
byte[] mergedPdf = null;
using (MemoryStream ms = new MemoryStream())
{
using (Document document = new Document())
{
using (PdfCopy copy = new PdfCopy(document, ms))
{
document.Open();
for (int i = 0; i < pdf.Count; ++i)
{
PdfReader reader = new PdfReader(pdf[i]);
// loop over the pages in that document
int n = reader.NumberOfPages;
for (int page = 0; page < n; )
{
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
}
}
}
mergedPdf = ms.ToArray();
}
Here pdf can either be defined as a List<byte[]> immediately containing the source documents (appropriate for your use case of merging intermediate in-memory documents) or as a List<String> containing the names of source document files (appropriate if you merge documents from disk).
An overview at the end of the referenced chapter summarizes the usage of the classes mentioned:
PdfCopy: Copies pages from one or more existing PDF documents. Major downsides: PdfCopy doesn’t detect redundant content, and it fails when concatenating forms.
PdfCopyFields: Puts the fields of the different forms into one form. Can be used to avoid the problems encountered with form fields when concatenating forms using PdfCopy. Memory use can be an issue.
PdfSmartCopy: Copies pages from one or more existing PDF documents. PdfSmartCopy is able to detect redundant content, but it needs more memory and CPU than PdfCopy.
PdfWriter: Generates PDF documents from scratch. Can import pages from other PDF documents. The major downside is that all interactive features of the imported page (annotations, bookmarks, fields, and so forth) are lost in the process.

I used iTextsharp with c# to combine pdf files. This is the code I used.
string[] lstFiles=new string[3];
lstFiles[0]=#"C:/pdf/1.pdf";
lstFiles[1]=#"C:/pdf/2.pdf";
lstFiles[2]=#"C:/pdf/3.pdf";
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage;
string outputPdfPath=#"C:/pdf/new.pdf";
sourceDocument = new Document();
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
//Open the output file
sourceDocument.Open();
try
{
//Loop through the files list
for (int f = 0; f < lstFiles.Length-1; f++)
{
int pages =get_pageCcount(lstFiles[f]);
reader = new PdfReader(lstFiles[f]);
//Add pages of current file
for (int i = 1; i <= pages; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
reader.Close();
}
//At the end save the output file
sourceDocument.Close();
}
catch (Exception ex)
{
throw ex;
}
private int get_pageCcount(string file)
{
using (StreamReader sr = new StreamReader(File.OpenRead(file)))
{
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
}
}

Here is some code I pulled out of an old project I had. It was a web application but I was using iTextSharp to merge pdf files then print them.
public static class PdfMerger
{
/// <summary>
/// Merge pdf files.
/// </summary>
/// <param name="sourceFiles">PDF files being merged.</param>
/// <returns></returns>
public static byte[] MergeFiles(List<Stream> sourceFiles)
{
Document document = new Document();
MemoryStream output = new MemoryStream();
try
{
// Initialize pdf writer
PdfWriter writer = PdfWriter.GetInstance(document, output);
writer.PageEvent = new PdfPageEvents();
// Open document to write
document.Open();
PdfContentByte content = writer.DirectContent;
// Iterate through all pdf documents
for (int fileCounter = 0; fileCounter < sourceFiles.Count; fileCounter++)
{
// Create pdf reader
PdfReader reader = new PdfReader(sourceFiles[fileCounter]);
int numberOfPages = reader.NumberOfPages;
// Iterate through all pages
for (int currentPageIndex = 1; currentPageIndex <=
numberOfPages; currentPageIndex++)
{
// Determine page size for the current page
document.SetPageSize(
reader.GetPageSizeWithRotation(currentPageIndex));
// Create page
document.NewPage();
PdfImportedPage importedPage =
writer.GetImportedPage(reader, currentPageIndex);
// Determine page orientation
int pageOrientation = reader.GetPageRotation(currentPageIndex);
if ((pageOrientation == 90) || (pageOrientation == 270))
{
content.AddTemplate(importedPage, 0, -1f, 1f, 0, 0,
reader.GetPageSizeWithRotation(currentPageIndex).Height);
}
else
{
content.AddTemplate(importedPage, 1f, 0, 0, 1f, 0, 0);
}
}
}
}
catch (Exception exception)
{
throw new Exception("There has an unexpected exception" +
" occured during the pdf merging process.", exception);
}
finally
{
document.Close();
}
return output.GetBuffer();
}
}
/// <summary>
/// Implements custom page events.
/// </summary>
internal class PdfPageEvents : IPdfPageEvent
{
#region members
private BaseFont _baseFont = null;
private PdfContentByte _content;
#endregion
#region IPdfPageEvent Members
public void OnOpenDocument(PdfWriter writer, Document document)
{
_baseFont = BaseFont.CreateFont(BaseFont.HELVETICA,
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
_content = writer.DirectContent;
}
public void OnStartPage(PdfWriter writer, Document document)
{ }
public void OnEndPage(PdfWriter writer, Document document)
{ }
public void OnCloseDocument(PdfWriter writer, Document document)
{ }
public void OnParagraph(PdfWriter writer,
Document document, float paragraphPosition)
{ }
public void OnParagraphEnd(PdfWriter writer,
Document document, float paragraphPosition)
{ }
public void OnChapter(PdfWriter writer, Document document,
float paragraphPosition, Paragraph title)
{ }
public void OnChapterEnd(PdfWriter writer,
Document document, float paragraphPosition)
{ }
public void OnSection(PdfWriter writer, Document document,
float paragraphPosition, int depth, Paragraph title)
{ }
public void OnSectionEnd(PdfWriter writer,
Document document, float paragraphPosition)
{ }
public void OnGenericTag(PdfWriter writer, Document document,
Rectangle rect, string text)
{ }
#endregion
private float GetCenterTextPosition(string text, PdfWriter writer)
{
return writer.PageSize.Width / 2 - _baseFont.GetWidthPoint(text, 8) / 2;
}
}
I didn't write this, but made some modifications. I can't remember where I found it. After I merged the PDFs I would call this method to insert javascript to open the print dialog when the PDF is opened. If you change bSilent to true then it should print silently to their default printer.
public Stream addPrintJStoPDF(Stream thePDF)
{
MemoryStream outPutStream = null;
PRStream finalStream = null;
PdfDictionary page = null;
string content = null;
//Open the stream with iTextSharp
var reader = new PdfReader(thePDF);
outPutStream = new MemoryStream(finalStream.GetBytes());
var stamper = new PdfStamper(reader, (MemoryStream)outPutStream);
var jsText = "var res = app.setTimeOut('this.print({bUI: true, bSilent: false, bShrinkToFit: false});', 200);";
//Add the javascript to the PDF
stamper.JavaScript = jsText;
stamper.FormFlattening = true;
stamper.Writer.CloseStream = false;
stamper.Close();
//Set the stream to the beginning
outPutStream.Position = 0;
return outPutStream;
}
Not sure how well the above code is written since I pulled it from somewhere else and I haven't worked in depth at all with iTextSharp but I do know that it did work at merging PDFs that I was generating at runtime.

Tested with iTextSharp-LGPL 4.1.6:
public static byte[] ConcatenatePdfs(IEnumerable<byte[]> documents)
{
using (var ms = new MemoryStream())
{
var outputDocument = new Document();
var writer = new PdfCopy(outputDocument, ms);
outputDocument.Open();
foreach (var doc in documents)
{
var reader = new PdfReader(doc);
for (var i = 1; i <= reader.NumberOfPages; i++)
{
writer.AddPage(writer.GetImportedPage(reader, i));
}
writer.FreeReader(reader);
reader.Close();
}
writer.Close();
outputDocument.Close();
var allPagesContent = ms.GetBuffer();
ms.Flush();
return allPagesContent;
}
}

To avoid the memory issues mentioned, I used file stream instead of memory stream(mentioned in ITextSharp Out of memory exception merging multiple pdf) to merge pdf files:
var parentDirectory = Directory.GetParent(SelectedDocuments[0].FilePath);
var savePath = parentDirectory + "\\MergedDocument.pdf";
using (var fs = new FileStream(savePath, FileMode.Create))
{
using (var document = new Document())
{
using (var pdfCopy = new PdfCopy(document, fs))
{
document.Open();
for (var i = 0; i < SelectedDocuments.Count; i++)
{
using (var pdfReader = new PdfReader(SelectedDocuments[i].FilePath))
{
for (var page = 0; page < pdfReader.NumberOfPages;)
{
pdfCopy.AddPage(pdfCopy.GetImportedPage(pdfReader, ++page));
}
}
}
}
}
}

****/*For Multiple PDF Print..!!*/****
<button type="button" id="btnPrintMultiplePdf" runat="server" class="btn btn-primary btn-border btn-sm"
onserverclick="btnPrintMultiplePdf_click">
<i class="fa fa-file-pdf-o"></i>Print Multiple pdf</button>
protected void btnPrintMultiplePdf_click(object sender, EventArgs e)
{
if (ValidateForMultiplePDF() == true)
{
#region Declare Temp Variables..!!
CheckBox chkList = new CheckBox();
HiddenField HidNo = new HiddenField();
string Multi_fofile, Multi_listfile;
Multi_fofile = Multi_listfile = "";
Multi_fofile = Server.MapPath("PDFRNew");
#endregion
for (int i = 0; i < grdRnew.Rows.Count; i++)
{
#region Find Grd Controls..!!
CheckBox Chk_One = (CheckBox)grdRnew.Rows[i].FindControl("chkOne");
Label lbl_Year = (Label)grdRnew.Rows[i].FindControl("lblYear");
Label lbl_No = (Label)grdRnew.Rows[i].FindControl("lblCode");
#endregion
if (Chk_One.Checked == true)
{
HidNo .Value = llbl_No .Text.Trim()+ lbl_Year .Text;
if (File.Exists(Multi_fofile + "\\" + HidNo.Value.ToString() + ".pdf"))
{
#region Get Multiple Files Name And Paths..!!
if (Multi_listfile != "")
{
Multi_listfile = Multi_listfile + ",";
}
Multi_listfile = Multi_listfile + Multi_fofile + "\\" + HidNo.Value.ToString() + ".pdf";
#endregion
}
}
}
#region For Generate Multiple Pdf..!!
if (Multi_listfile != "")
{
String[] Multifiles = Multi_listfile.Split(',');
string DestinationFile = Server.MapPath("PDFRNew") + "\\Multiple.Pdf";
MergeFiles(DestinationFile, Multifiles);
Response.ContentType = "pdf";
Response.AddHeader("Content-Disposition", "attachment;filename=\"" + DestinationFile + "\"");
Response.TransmitFile(DestinationFile);
Response.End();
}
else
{
}
#endregion
}
}
private void MergeFiles(string DestinationFile, string[] SourceFiles)
{
try
{
int f = 0;
/**we create a reader for a certain Document**/
PdfReader reader = new PdfReader(SourceFiles[f]);
/**we retrieve the total number of pages**/
int n = reader.NumberOfPages;
/**Console.WriteLine("There are " + n + " pages in the original file.")**/
/**Step 1: creation of a document-object**/
Document document = new Document(reader.GetPageSizeWithRotation(1));
/**Step 2: we create a writer that listens to the Document**/
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(DestinationFile, FileMode.Create));
/**Step 3: we open the Document**/
document.Open();
PdfContentByte cb = writer.DirectContent;
PdfImportedPage page;
int rotation;
/**Step 4: We Add Content**/
while (f < SourceFiles.Length)
{
int i = 0;
while (i < n)
{
i++;
document.SetPageSize(reader.GetPageSizeWithRotation(i));
document.NewPage();
page = writer.GetImportedPage(reader, i);
rotation = reader.GetPageRotation(i);
if (rotation == 90 || rotation == 270)
{
cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(i).Height);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
/**Console.WriteLine("Processed page " + i)**/
}
f++;
if (f < SourceFiles.Length)
{
reader = new PdfReader(SourceFiles[f]);
/**we retrieve the total number of pages**/
n = reader.NumberOfPages;
/**Console.WriteLine("There are"+n+"pages in the original file.")**/
}
}
/**Step 5: we Close the Document**/
document.Close();
}
catch (Exception e)
{
string strOb = e.Message;
}
}
private bool ValidateForMultiplePDF()
{
bool chkList = false;
foreach (GridViewRow gvr in grdRnew.Rows)
{
CheckBox Chk_One = (CheckBox)gvr.FindControl("ChkSelectOne");
if (Chk_One.Checked == true)
{
chkList = true;
}
}
if (chkList == false)
{
divStatusMsg.Style.Add("display", "");
divStatusMsg.Attributes.Add("class", "alert alert-danger alert-dismissable");
divStatusMsg.InnerText = "ERROR !!...Please Check At Least On CheckBox.";
grdRnew.Focus();
set_timeout();
return false;
}
return true;
}

apache poi word to html conversion - words boundry

I am using below code to convert word to html file
public Map convert(String wordDocPath, String htmlPath,
Map conversionParams)
{
log.info("Converting word file "+wordDocPath)
try
{
String workingFolder = "C:\temp"
File workingFolderFile = new File(workingFolder)
FileInputStream fis = new FileInputStream(wordDocPath);
XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile));
options.setExtractor(new FileImageExtractor(workingFolderFile))
File htmlFile = new File(htmlPath);
OutputStream out = new FileOutputStream(htmlFile)
XHTMLConverter.getInstance().convert(document, out, options);
log.info("Converted to HTML file "+htmlPath)
}
catch(Exception e)
{
log.error("Exception :"+e.getMessage(),e)
}
}
The code is properly generating html output.
I need to put some parameters in the doc like [[AGENT_NAME]] that I will replace with regex later in code. But apache poi is not treating this pattern as single word and sometime splitting "[[", "AGENT_NAME" & "]]" and inserting some tags with styles in between. I cannot write regex and replace the parameters because of it.
How does apache poi decides word boundry? is there a way to control it?

After all the efforts, I finally decided to write code to parse word doc and merge splitted runs. Here is the code, hope it will help someone else
Note: I have used pattern as ${pattern}
void mergeSplittedPatterns(XWPFDocument document)
{
List<XWPFParagraph> paragraphs = document.paragraphs
for(XWPFParagraph paragraph : paragraphs)
{
List<XWPFRun> runs = paragraph.getRuns()
int firstCharRun,closingCharRun
boolean firstCharFound = false;
boolean secondCharFoundImmediately = false;
boolean closingCharFound = false;
boolean gotoNextRun = true
boolean scan = (runs!=null && runs.size()>0)
int index = 0
while(scan)
{
gotoNextRun = true;
XWPFRun run = runs.get(index)
String runText = run.getText(0)
if(runText!=null)
for (int i = 0; i < runText.length(); i++)
{
char character = runText.charAt(i);
if(secondCharFoundImmediately)
{
closingCharFound = (character=="}")
if(closingCharFound)
{
closingCharRun = index
if(firstCharRun==closingCharRun)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
continue;
}
else
{
String mergedText= ""
for(int j=firstCharRun;j<=closingCharRun;j++)
{
mergedText += runs.get(j).getText(0)
}
runs.get(firstCharRun).setText(mergedText,0)
for(int j=closingCharRun;j>firstCharRun;j--)
{
paragraph.removeRun(j)
}
firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false
index = firstCharRun
break;
}
}
}
else if(firstCharFound)
{
secondCharFoundImmediately = (character=="{")
if(!secondCharFoundImmediately)
{
firstCharFound = secondCharFoundImmediately = closingCharFound = false
}
}
else if(character=="\$")
{
firstCharFound = true;
firstCharRun = index
}
}
if(gotoNextRun)
{
index++;
}
if(index>=runs.size())
{
scan = false;
}
}
}
}

Replace table column value in Apache POI

I am using apache POI 3.7. I am trying to replace the value of a table column in a word document (docx). However, what I have done is it keeps appending the value of the current value in the document. But if a table column value is null, it places the value. Can you give me some thoughts how to resolve this. Below is the code I have done so far.
Thanks in advance.
package test.doc;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;
public class POIDocXTableTest {
public static void main(String[] args)throws IOException {
String fileName = "C:\\Test.docx";
InputStream fis = new FileInputStream(fileName);
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (int x=0; x<paragraphs.size();x++)
{
XWPFParagraph paragraph = paragraphs.get(x);
System.out.println(paragraph.getParagraphText());
}
List<XWPFTable> tables = document.getTables();
for (int x=0; x<tables.size();x++)
{
XWPFTable table = tables.get(x);
List<XWPFTableRow> tableRows = table.getRows();
tableRows.remove(x);
for (int r=0; r<tableRows.size();r++)
{
System.out.println("Row "+ (r+1)+ ":");
XWPFTableRow tableRow = tableRows.get(r);
List<XWPFTableCell> tableCells = tableRow.getTableCells();
for (int c=0; c<tableCells.size();c++)
{
System.out.print("Column "+ (c+1)+ ": ");
XWPFTableCell tableCell = tableCells.get(c);
//tableCell.setText("TAE");
String tableCellVal = tableCell.getText();
if ((c+1)==2){
if (tableCellVal!=null){
if (tableCellVal.length()>0){
char c1 = tableCellVal.charAt(0);
String s2 = "-TEST";
char c2 = s2.charAt(0);
String test = tableCell.getText().replace(tableCellVal,s2);
tableCell.setText(test);
}else{
//tableCell.setText("NULL");
}
}
}
System.out.println("tableCell.getText(" + (c) + "):" + tableCellVal);
}
}
System.out.println("\n");
}
OutputStream out = new FileOutputStream(fileName);
document.write(out);
out.close();
}
}

The best solution to prevent styles in paragraphs and find search strings with different styles is this method:
private long replaceInParagraphs(Map<String, String> replacements, List<XWPFParagraph> xwpfParagraphs) {
long count = 0;
for (XWPFParagraph paragraph : xwpfParagraphs) {
List<XWPFRun> runs = paragraph.getRuns();
for (Map.Entry<String, String> replPair : replacements.entrySet()) {
String find = replPair.getKey();
String repl = replPair.getValue();
TextSegement found = paragraph.searchText(find, new PositionInParagraph());
if ( found != null ) {
count++;
if ( found.getBeginRun() == found.getEndRun() ) {
// whole search string is in one Run
XWPFRun run = runs.get(found.getBeginRun());
String runText = run.getText(run.getTextPosition());
String replaced = runText.replace(find, repl);
run.setText(replaced, 0);
} else {
// The search string spans over more than one Run
// Put the Strings together
StringBuilder b = new StringBuilder();
for (int runPos = found.getBeginRun(); runPos <= found.getEndRun(); runPos++) {
XWPFRun run = runs.get(runPos);
b.append(run.getText(run.getTextPosition()));
}
String connectedRuns = b.toString();
String replaced = connectedRuns.replace(find, repl);
// The first Run receives the replaced String of all connected Runs
XWPFRun partOne = runs.get(found.getBeginRun());
partOne.setText(replaced, 0);
// Removing the text in the other Runs.
for (int runPos = found.getBeginRun()+1; runPos <= found.getEndRun(); runPos++) {
XWPFRun partNext = runs.get(runPos);
partNext.setText("", 0);
}
}
}
}
}
return count;
}
This method works with search strings spanning over more than one Run. The replaced part gets the style from the first found Run.

well, I have done something like that, to replace marks in a word template by specified words...:
public DotxTemplateFiller() {
String filename = "/poi/ls_Template_modern_de.dotx";
String outputPath = "/poi/output/output" + new Date().getTime()
+ ".dotx";
OutputStream out = null;
try {
File file = new File(filename);
XWPFDocument template = new XWPFDocument(new FileInputStream(file));
List<XWPFParagraph> xwpfParagraphs = template.getParagraphs();
replaceInParagraphs(xwpfParagraphs);
List<XWPFTable> tables = template.getTables();
for (XWPFTable xwpfTable : tables) {
List<XWPFTableRow> tableRows = xwpfTable.getRows();
for (XWPFTableRow xwpfTableRow : tableRows) {
List<XWPFTableCell> tableCells = xwpfTableRow
.getTableCells();
for (XWPFTableCell xwpfTableCell : tableCells) {
xwpfParagraphs = xwpfTableCell.getParagraphs();
replaceInParagraphs(xwpfParagraphs);
}
}
}
out = new FileOutputStream(new File(outputPath));
template.write(out);
out.flush();
out.close();
//System.exit(0);
} catch (IOException e) {
e.printStackTrace();
} finally {
if (out != null) {
try {
out.close();
} catch (IOException e) {
// nothing to do ....
}
}
}
}
/**
* #param xwpfParagraphs
*/
private void replaceInParagraphs(List<XWPFParagraph> xwpfParagraphs) {
for (XWPFParagraph xwpfParagraph : xwpfParagraphs) {
List<XWPFRun> xwpfRuns = xwpfParagraph.getRuns();
for (XWPFRun xwpfRun : xwpfRuns) {
String xwpfRunText = xwpfRun.getText(xwpfRun
.getTextPosition());
for (Map.Entry<String, String> entry : replacements
.entrySet()) {
if (xwpfRunText != null
&& xwpfRunText.contains(entry.getKey())) {
xwpfRunText = xwpfRunText.replaceAll(
entry.getKey(), entry.getValue());
}
}
xwpfRun.setText(xwpfRunText, 0);
}
}
}
public static void main(String[] args) {
new DotxTemplateFiller();
}
First I did it for regular paragraphs in the MS Word template and than for paragraphs inside table cells.
Hope it is helpful for you and I hope I understood your problem right... :-)
Best wishes.

Adding on to Josh's solution, the map I am building has ended up with over a thousand tags and continues to grow. To cut down on processing, I decided to build a small subset of the tags that I know appear in the paragraph, typically ending up with a map of only one or two tags that I then pass as the Map to the replaceInParagraphs method provided above. Also, using the Substitution object to store the substitution text, allows me to add methods into that object (such as formatting) that I can call once the substitution has been completed. Using the subset Map also allows me to know what replacements have been made in any paragraph.
private Map<String, Substitution> buildTagList(Map<String, Substitution> replacements, List<XWPFParagraph> xwpfParagraphs, String start, String end) {
Map<String, Substitution> returnMap = new HashMap<String, Substitution> ();
for (XWPFParagraph paragraph : xwpfParagraphs) {
List<XWPFRun> runs = paragraph.getRuns();
// Check is there is a tag in the paragraph
TextSegment found = paragraph.searchText(start, new PositionInParagraph());
String runText = "";
XWPFRun run = null;
if ( found != null ) {
StringBuilder b = new StringBuilder();
for (int runPos = found.getBeginRun(); runPos < runs.size(); runPos++) {
run = runs.get(runPos);
b.append(run.getText(run.getTextPosition()));
runText = b.toString();
}
// Now we need to find all tags in the run
boolean finished = false;
int tagStart = 0;
int tagEnd = 0;
while ( ! finished ) {
// get the first tag
tagStart = runText.indexOf(start,tagStart);
tagEnd = runText.indexOf(end, tagEnd);
if ( tagStart >= 0 ) {
String tag = runText.substring(tagStart, tagEnd + end.length());
Substitution s = replacements.get(tag);
if (s != null) {
returnMap.putIfAbsent(tag,s);
}
}
else
finished = true;
tagStart = tagEnd + end.length();
tagEnd = tagStart;
}
}
}
return returnMap;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Need to split docx file based on string using docx4j Java? - java

Related

The value "name" and "surname" aren't read apache poi

Read test data from excel

How to merge PDF documents with correct orientation? [duplicate]

apache poi word to html conversion - words boundry

Replace table column value in Apache POI

Categories

Resources