Apache POI Formatting issue - java

I was wondering if someone could help me figure out why my text is not lining up when I read a .doc file. So far in my code I am using WordExtractor, but I am having formatting issue with stuff not lining up correctly. Here is my code that was written using Java 1.7.
public class Doc {
File docFile = null;
WordExtractor docExtractor = null ;
WordExtractor exprExtractor = null ;
public void read(){
docFile = new File("blue.doc");
try{
FileInputStream fis = new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(fis);
docExtractor = new WordExtractor(doc);
}catch(Exception e){
System.out.println(e.getMessage());
}
System.out.println(docExtractor.getText());
}
}
How the program displays the document.
A E
I'm stuck in Folsom Prison, and time keeps draggin on.
It is supposed to be displayed like this
A E
I'm stuck in Folsom Prison, and time keeps draggin on.

Of course this will not work. You are extracting the content of a document file into a string variable (which will distort formatting into document like paragraphs and all). Further you are printing the text into console and then you expect that it will look exactly like in Microsoft word?
Next, you should think what do you want to do. Assuming that you want to verify both formatting and content of the document, my answer follows. Converting a document into plain text using getText() will give you content of document in a distorted format which does not help you. By using POI library you should instead try to access each paragraph and table in the document and verify/read/write whatever you want to.
doc.getRange() would give you a Range object. Play with this object by referring to http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/Range.html and you would be able to access all paragraphs, tables and sections in the document. That should help you in working out the word document through program.

Related

PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f
There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

How do I add line breaks e.g. \\n in a Apache POI HWPF Document

I have to modify Word Document in the old .doc format. Using Apache POI with the HWPF representation of the document. I struggled to insert line breaks into any table cell. In the modified document line breaks look like empty boxes.
table cell with added line break
The Code I used for this after I selected the specific cell:
cell.insertBefore("Test "+System.lineSeparator()+" Test");
The Following also doesnt work:
cell.insertBefore("Test "+System.getProperty("line.seperator")+" Test");
cell.insertBefore("Test \n Test");
cell.insertBefore("Test \r\n Test");
everything I tried was transformed into boxes.
I also tried writing the document to a temp file and then just replacing a placeholder with HWPF -> empty boxes.Does anybody know a solution to this?
Thanks in advance.
Forget about apache poi HWPF. It is in scratchpad and without any progress since decades. And there are no useable methods to insert or create new paragraphs. All Range.insertBefore and Range.insertAfter methods which take more than only text are private and deprecated and doesn't work properly also since decades. The reason of that may be that the binary file format of Microsoft Word HWPF of course is the most horrible file format of all the other horrible file formats like HSSF, HSLF. So who wants bothering with this?
But to answer your question:
In word processing text is structured in paragraphs containing text runs. Each paragraph takes a new line by default. But "Text\nText" or "Text\rText" or "Text\r\nText" stored in a text run would only mark a line break within that text run but not a new paragraph. Would ..., because of course Microsoft Word has it's own rules. There \u000B marks that line break within the text run.
So what you could do is the following:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCTable {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOC.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
TableIterator tableIterator = new TableIterator(bodyRange);
while (tableIterator.hasNext()) {
Table table = tableIterator.next();
System.out.println(table);
TableCell cell = table.getRow(0).getCell(0); // first cell in table
System.out.println(cell);
Paragraph paragraph = cell.getParagraph(0); // first paragraph in cell
System.out.println(paragraph);
CharacterRun run = paragraph.insertBefore("Test\u000BTest");
System.out.println(run);
}
FileOutputStream out = new FileOutputStream("ResultDOC.doc");
document.write(out);
out.close();
document.close();
}
}
That places the text run "Test\u000BTest" before first paragraph in first cell of each table in the document. And the \u000B marks a line feed within that text run.
Maybe that is what you wanted to achieve? But, as said, forget about apache poi HWPF. The next unsolvable problem is only a step far away.

How to read / write into docx file using commons.io.FileUtils?

Need some quick help. I am trying to write a java program to generate a report. I have the report template in a docx file.
What I want to do is, use that docx file as template and put data in it multiple times for various records and write that to a new docx file. The main thing is I want to maintain the formatting and indentation of the contents inside the docx file. They are bullets data. And that's where the problem is.
Below is the piece of code handling the above operation,
public void readWriteDocx(HashMap<String, String> detailsMap) {
try {
File reportTemplateFile = new File("ReportTemplate.docx");
File actualReportFile = new File("ActualReport.docx");
StringBuilder preReport = new StringBuilder();
preReport.append("Some details about pre report goes here...: ");
preReport.append(System.lineSeparator());
String docxContent = "";
for (Map.Entry<String, String> entry : detailsMap.entrySet()) {
docxContent = FileUtils.readFileToString(reportTemplateFile, StandardCharsets.UTF_8);
// code to fetch and get data to insert into docxContent
docxContent = docxContent.replace("$filename", keyFilename);
docxContent = docxContent.replace("$expected", expectedFile);
docxContent = docxContent.replace("$actual", actualFile);
docxContent = docxContent.replace("$reportCount", String.valueOf(reportCount));
docxContent = docxContent.replace("$diffMessage", key);
FileUtils.writeStringToFile(actualReportFile, docxContent, StandardCharsets.UTF_8, true);
}
preReport.append(FileUtils.readFileToString(actualReportFile, StandardCharsets.UTF_8));
System.out.print(preReport.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
As you can see, I am using FileUtils read and write methods and using UTF_8 encoding. That's just a guess, I am not sure about the same. I am trying to append the newly generated docx file contents to a string builder and print the same on console, but that's secondary. Main thing is that the docx should be written properly. But no luck.
When this prints, its all weird characters and nothing is readable. When I try to open the newly generated docx file, it doesn't even open.
Any idea what should I do to get the data in proper format. I am attaching image file of how my ReportTemplate.docx looks, that I am using as a template to generate this report. I am using commons-io-2.4.jar
Please guide if you can. Thanks a lot.
You can use Apache POI for creating and editing doc docx files or docx4j. Otherwise there is no simple way to edit doc or docx files without these libraries.

Relace HWPFDocument paragraph text using java results strange output

I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3
The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.

How to extract data from a .docx file including image, table, formula etc?

I am doing a task in which i have to extract data from word document mainly images, tables and special texts(formula etc) .
I am able to save image from a word file it is downloaded from web but when i am applying same code to my .docx file than it is giving error.
Code for same is
//create file inputstream to read from a binary file
FileInputStream fs=new FileInputStream(filename);
//create office word 2007+ document object to wrap the word file
XWPFDocument docx=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=docx.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
System.out.println(piclist.size());
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
int i=0;
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
//captureimage(imag,i,flag,j);
if(imag != null)
{
ImageIO.write(imag, "jpg", new File("D:/imagefromword"+i+".jpg"));
}else{
System.out.println("imag is empty");
}
It is giving incorrect format error. But I cannot change the doc file.
Secondly for above code if i am having more then one image and when i am saving this than every time it saving save image. Suppose we have 3 images then it will save 3 images but all three will be latest one.
Any help will be appreciated.
Without actual error one can only guess.
But there are two POI implementations HWPF and XWPF depending which version of word document your read the old doc one or xml-new-one docx. Typically the format error comes when you try to open the doc using the wrong one.
Also you need the full poi-ooxml-schemas jar to read more complicated documents.

Categories

Resources