Apache POI replaceText() side effect, changes line space - java

I am using POI 3.15 in Java to replace some text in my .doc template.
private HWPFDocument replaceText(HWPFDocument doc, String findText, String replaceText) {
Range r = doc.getRange();
for (int i = 0; i < r.numSections(); ++i) {
Section s = r.getSection(i);
for (int j = 0; j < s.numParagraphs(); j++) {
Paragraph p = s.getParagraph(j);
for (int k = 0; k < p.numCharacterRuns(); k++) {
CharacterRun run = p.getCharacterRun(k);
String text = run.text();
if (text.contains(findText)) {
run.replaceText(findText, replaceText);
}
}
}
}
return doc;
}
After I save the document. All content inside are correct. But the style of the document is not. The space between lines is changed. The original gap between lines is missing. All line are closely packed together.
Why? How do I keep the style of my template?

The HWPF library may not support all features, which exist in your doc file and this may result in changed formats. It may also result in unreadable files.
Some years ago I created a customized HWPF library, which could properly modify and write a wide variety of doc files for one of my clients and I gained a lot of experience about the doc file format and the HWPF library.
The problem is, that one has to properly support all features in HWPF, which may be present in the doc file. For instance, if clipart is included in the file, there will be separate tables, which maintain position and properties of the cliparts. If the content (text) is changed without adjusting the addresses in the other internal tables, formats etc. can be shifted, ignored or lost. (or in worst case, the document is unreadable)
I am not sure about the status of HWPF these days, but I expect, that it does not fully support the main relevant doc file features.
If you want to use HWPF for modifying / writing doc files, you may succeed with files, which have a reduced "feature set". For instance no tables, no cliparts, no text boxes - things like that. If you need to support almost any document, which a user may provide, I'd recommend to find a different solution.
One option could be to use rtf files, which are named .doc. Or use the XWPF library, which works for .docx files.

Related

Extract Reading Order Sequence in a Tagged PDF

I'm currently validating the correct order of the content in a Tagged PDF File.
Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?
I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.
I've tried the following Libraries:
Syncfusion
IText7
but I can't find any methods that get its reading order numbers.
Is it really possible? Thanks in advance!
You can extract the marked content tree of tagged pdf using the PdfPig (.Net) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).
If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.
Each MarkedContentElement will contain the letters, images and paths that belong to it:
using UglyToad.PdfPig;
[...]
using (PdfDocument document = PdfDocument.Open(pathToFile))
{
for (int p = 0; p < document.NumberOfPages; p++)
{
var page = document.GetPage(p + 1);
// extract the page's marked content
var markedContents = page.GetMarkedContents();
var orderedMarkedContents = markedContents
.OrderBy(mc => mc.MarkedContentIdentifier);
foreach (var mc in orderedMarkedContents)
{
// do something
}
}
}
If you want to extract the result to XML, you can have a look at the PageXmlTextExporter class. Have a look at the wiki for more information on ITextExporter and IReadingOrderDetector.
Note: I am an active contributer to this library.

How to clone a generated PDDocument in pdfbox 2.0.x?

I'm using pdfbox 2.0.12 to generate reports.
I want to create 2 versions in one go, with partly similar content.
(ie: generate 1-3 pages, clone, add more pages to each version, save)
What is the correct way to copy a PDDocument to a new PDDocument?
My files are fairly simple, just text and an image per page.
The existing StackO questions[1] use code from pdfbox 1.8, or whatever doesn't work today.
The multipdf.PDCloneUtility is marked deprecated for public use and also not for use for generated PDF:s.
I could not find an example in PDFbox tree that does this.
I'm using function importPage. This almost works, except there is some mixup with fonts.
The copied pages are correct in layout (some lines and an image), but the text is just dots because it cannot find the fonts used.
The ADDED pages in the copied doc are using copies of the same fonts, the text is fine.
When looking at font resources in Adobe Reader, in the copied doc, the used fonts are listed 2 times:
Roboto-Regular (Embedded Subset)
Type: TrueType (CID)
Encoding: Identity-H
Roboto-Regular
Type: TrueType (CID)
Encoding: Identity-H
Actual Font: Unknown
(etc)
When opening the copied doc, there's a warning
"Cannot find or create the font Roboto-Bold. Some characters may not display or print correctly"
In the source document, the fonts are listed once, exactly like the first entry above.
My code:
// Close content stream before copying
myContentStream.endText();
myContentStream.close();
// Copy pages
PDDocument result = new PDDocument();
result.setDocumentInformation(doc.getDocumentInformation());
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; ++i) {
PDPage page = doc.getPage(i);
PDPage importedPage = result.importPage(page);
// This is mentioned in importPage docs, bizarrely it's said to copy resources
importedPage.setRotation(page.getRotation());
// while this seems intuitive
importedPage.setResources(page.getResources());
}
// Fonts are recreated for copy by reloading from file
copy_plainfont = PDType0Font.load(result, new java.io.ByteArrayInputStream(plainfont_bytes));
//....etc
I have tried all combinations with and without importedPage.setRotation/setResources.
I've also tried using doc.getDocumentCatalog().getPages() and rolling through that. Same result.
[1]
I looked at
pdfbox: how to clone a page
Can duplicating a pdf with PDFBox be small like with iText?
and half a dozen more of varying irrelevance.
Grateful for any tips
/rasmus

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
Expected pdf file
Actual pdf file
Below is the general utility i am using to compare pdf files
private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
PDDocument pdf1 = PDDocument.load(pdfFile1);
PDDocument pdf2 = PDDocument.load(pdfFile2);
PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
try
{
if (pdf1pages.getCount() != pdf2pages.getCount())
{
String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
LOG.debug(message);
throw new TestException(message);
}
PDFTextStripper pdfStripper = new PDFTextStripper();
LOG.debug("pdfStripper is :- " + pdfStripper);
LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
for (int i = 0; i < pdf1pages.getCount(); i++)
{
pdfStripper.setStartPage(i + 1);
pdfStripper.setEndPage(i + 1);
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!pdf1PageText.equals(pdf2PageText))
{
String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
LOG.debug(message);
System.out.println("fff");
LOG.debug("pdf1PageText is " + pdf1PageText);
LOG.debug("pdf2PageText is " + pdf2PageText);
String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
LOG.debug("difference is "+difference);
throw new TestException(message+" [[ Difference is ]] "+difference);
}
}
LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
} finally {
pdf1.close();
pdf2.close();
}
}
Eclipse shows this differences in console
https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png
I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..
Can anyone please tell me how to fix this one ?
However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.
While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.
And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.
The font in question has these three glyphs:
The ToUnicode map for that font in your expected document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]
which claim that these three characters correspond to U+0000, U+F125, and U+F128.
The ToUnicode map for that font in your actual document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]
which claim that these three characters correspond to U+0000, U+F126, and U+F129.
Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!
(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)
This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.
A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:
String stripUnicode(String s) {
StringBuilder sb = new StringBuilder(s.length());
for (char c : s.toCharArray()) {
if (c <= 0xFF) {
sb.append(c);
}
}
return sb.toString();
}
...
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...
If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.

how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

I have some questions about parsing pdf anfd how to:
what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read PDF page per page without vaste in memory?
I use page per page reading because I need to write text in XML using DOM memory (using stripping technique, I decide to produce an XML for every page)
what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?
PDFBox implements two ways to read a PDF file.
loadNonSeq is the way documents should be loaded
load is the way documents should not be loaded but one might try to repair flles with broken cross references this way
In the 2.0.0 development branch, the algorithm formerly used for loadNonSeq is now used for load and the algorithm formerly used for load is not used anymore.
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
Using loadNonSeq instead of load may improve memory usage for multi-revision PDFs because it only reads objects still referenced from the reference table while load can keep more in memory.
I don't know, though, whether using a scratch file makes a big difference.
is it a good practice to read PDF page per page without vaste in memory?
Internally PDFBox parses the given range page after page, too. Thus, if you process the stripper output page-by-page, it certainly is ok to parse it page by page.

Relace HWPFDocument paragraph text using java results strange output

I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3
The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.

Categories

Resources