Relace HWPFDocument paragraph text using java results strange output

Relace HWPFDocument paragraph text using java results strange output - java

I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3

The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.

Related

PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f

There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

Creating a text file with java without using absolute path

following the question I asked before How to have my java project to use some files without using their absolute path? I found the solution but another problem popped up in creating text files that I want to write into.here's my code:
private String pathProvider() throws Exception {
//finding the location where the jar file has been located
String jarPath=URLDecoder.decode(getClass().getProtectionDomain().getCodeSource().getLocation().getPath(), "UTF-8");
//creating the full and final path
String completePath=jarPath.substring(0,jarPath.lastIndexOf("/"))+File.separator+"Records.txt";
return completePath;
}
public void writeRecord() {
try(Formatter writer=new Formatter(new FileWriter(new File(pathProvider()),true))) {
writer.format("%s %s %s %s %s %s %s %s %n", whichIsChecked(),nameInput.getText(),lastNameInput.getText()
,idInput.getText(),fieldOfStudyInput.getText(),date.getSelectedItem().toString()
,month.getSelectedItem().toString(),year.getSelectedItem().toString());
successful();
} catch (Exception e) {
failure();
}
}
this works and creates the text file wherever the jar file is running from but my problem is that when the information is been written to the file, the numbers,symbols, and English characters are remained but other characters which are in Persian are turned into question marks. like: ????? 111 ????? ????.although running the app in eclipse doesn't make this problem,running the jar does.
Note:I found the code ,inside pathProvider method, in some person's question.

Your pasted code and the linked question are complete red herrings - they have nothing whatsoever to do with the error you ran into. Also, that protection domain stuff is a hack and you've been told before not to write data files next to your jar files, it's not how OSes (are supposed to) work. Use user.home for this.
There is nothing in this method that explains the question marks - the string, as returned, has plenty of issues (see above), but NOT that it will result in question marks in the output.
Files are fundamentally bytes. Strings are fundamentally characters. Therefore, when you write code that writes a string to a file, some code somewhere is converting chars to bytes.
Make sure the place where that happens includes a charset encoding.
Use the new API (I think you've also been told to do this, by me, in an earlier question of yours) which defaults to UTF-8. Alternatively, specify UTF-8 when you write. Note that the usage of UTF-8 here is about the file name, not the contents of it (as in, if you put persian symbols in the file name, it's not about persian symbols in the contents of the file / in the contents you want to write).
Because you didn't paste the code, I can't give you specific details as there are hundreds of ways to do this, and I do not know which one you used.
To write to a file given a String representing its path:
Path p = Paths.get(completePath);
Files.write("Hello, World!", p);
is all you need. This will write as UTF_8, which can handle persian symbols (because the Files API defaults to UTF-8 if you specify no encoding, unlike e.g. new File, FileOutputStream, FileWriter, etc).
If you're using outdated APIs: new BufferedWriter(new OutputStreamWriter(new FileOutputStream(thePath), StandardCharsets.UTF-8) - but note that this is a resource leak bug unless you add the appropriate try-with-resources.
If you're using FileWriter: FileWriter is broken, never use this class. Use something else.
If you're converting the string on its own, it's str.getBytes(StandardCharsets.UTF_8), not str.getBytes().

Apache POI replaceText() side effect, changes line space

I am using POI 3.15 in Java to replace some text in my .doc template.
private HWPFDocument replaceText(HWPFDocument doc, String findText, String replaceText) {
Range r = doc.getRange();
for (int i = 0; i < r.numSections(); ++i) {
Section s = r.getSection(i);
for (int j = 0; j < s.numParagraphs(); j++) {
Paragraph p = s.getParagraph(j);
for (int k = 0; k < p.numCharacterRuns(); k++) {
CharacterRun run = p.getCharacterRun(k);
String text = run.text();
if (text.contains(findText)) {
run.replaceText(findText, replaceText);
}
}
}
}
return doc;
}
After I save the document. All content inside are correct. But the style of the document is not. The space between lines is changed. The original gap between lines is missing. All line are closely packed together.
Why? How do I keep the style of my template?

The HWPF library may not support all features, which exist in your doc file and this may result in changed formats. It may also result in unreadable files.
Some years ago I created a customized HWPF library, which could properly modify and write a wide variety of doc files for one of my clients and I gained a lot of experience about the doc file format and the HWPF library.
The problem is, that one has to properly support all features in HWPF, which may be present in the doc file. For instance, if clipart is included in the file, there will be separate tables, which maintain position and properties of the cliparts. If the content (text) is changed without adjusting the addresses in the other internal tables, formats etc. can be shifted, ignored or lost. (or in worst case, the document is unreadable)
I am not sure about the status of HWPF these days, but I expect, that it does not fully support the main relevant doc file features.
If you want to use HWPF for modifying / writing doc files, you may succeed with files, which have a reduced "feature set". For instance no tables, no cliparts, no text boxes - things like that. If you need to support almost any document, which a user may provide, I'd recommend to find a different solution.
One option could be to use rtf files, which are named .doc. Or use the XWPF library, which works for .docx files.

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
Expected pdf file
Actual pdf file
Below is the general utility i am using to compare pdf files
private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
PDDocument pdf1 = PDDocument.load(pdfFile1);
PDDocument pdf2 = PDDocument.load(pdfFile2);
PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
try
{
if (pdf1pages.getCount() != pdf2pages.getCount())
{
String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
LOG.debug(message);
throw new TestException(message);
}
PDFTextStripper pdfStripper = new PDFTextStripper();
LOG.debug("pdfStripper is :- " + pdfStripper);
LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
for (int i = 0; i < pdf1pages.getCount(); i++)
{
pdfStripper.setStartPage(i + 1);
pdfStripper.setEndPage(i + 1);
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!pdf1PageText.equals(pdf2PageText))
{
String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
LOG.debug(message);
System.out.println("fff");
LOG.debug("pdf1PageText is " + pdf1PageText);
LOG.debug("pdf2PageText is " + pdf2PageText);
String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
LOG.debug("difference is "+difference);
throw new TestException(message+" [[ Difference is ]] "+difference);
}
}
LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
} finally {
pdf1.close();
pdf2.close();
}
}
Eclipse shows this differences in console
https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png
I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..
Can anyone please tell me how to fix this one ?

However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.
While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.
And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.
The font in question has these three glyphs:
The ToUnicode map for that font in your expected document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]
which claim that these three characters correspond to U+0000, U+F125, and U+F128.
The ToUnicode map for that font in your actual document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]
which claim that these three characters correspond to U+0000, U+F126, and U+F129.
Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!
(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)

This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.
A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:
String stripUnicode(String s) {
StringBuilder sb = new StringBuilder(s.length());
for (char c : s.toCharArray()) {
if (c <= 0xFF) {
sb.append(c);
}
}
return sb.toString();
}
...
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...
If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.

Apache POI Formatting issue

I was wondering if someone could help me figure out why my text is not lining up when I read a .doc file. So far in my code I am using WordExtractor, but I am having formatting issue with stuff not lining up correctly. Here is my code that was written using Java 1.7.
public class Doc {
File docFile = null;
WordExtractor docExtractor = null ;
WordExtractor exprExtractor = null ;
public void read(){
docFile = new File("blue.doc");
try{
FileInputStream fis = new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(fis);
docExtractor = new WordExtractor(doc);
}catch(Exception e){
System.out.println(e.getMessage());
}
System.out.println(docExtractor.getText());
}
}
How the program displays the document.
A E
I'm stuck in Folsom Prison, and time keeps draggin on.
It is supposed to be displayed like this
A E
I'm stuck in Folsom Prison, and time keeps draggin on.

Of course this will not work. You are extracting the content of a document file into a string variable (which will distort formatting into document like paragraphs and all). Further you are printing the text into console and then you expect that it will look exactly like in Microsoft word?
Next, you should think what do you want to do. Assuming that you want to verify both formatting and content of the document, my answer follows. Converting a document into plain text using getText() will give you content of document in a distorted format which does not help you. By using POI library you should instead try to access each paragraph and table in the document and verify/read/write whatever you want to.
doc.getRange() would give you a Range object. Play with this object by referring to http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/Range.html and you would be able to access all paragraphs, tables and sections in the document. That should help you in working out the word document through program.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.