Apache POI: find characters in Word document without spaces

Apache POI: find characters in Word document without spaces - java

I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?

If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

PDFbox saying PDDocument closed when its not

I am trying to populate repeated forms with PDFbox. I am using a TreeMap and populating the forms with individual records. The format of the pdf form is such that there are six records listed on page one and a static page inserted on page two. (For a TreeMap larger than six records, the process repeats). The error Im getting is specific to the size of the TreeMap. Therein lies my problem. I can't figure out why when I populate the TreeMap with more than 35 entries I get this warning:
Apr 23, 2018 2:36:25 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
public class test {
public static void main(String[] args) throws IOException, IOException {
// TODO Auto-generated method stub
File dataFile = new File("dataFile.csv");
File fi = new File("form.pdf");
Scanner fileScanner = new Scanner(dataFile);
fileScanner.nextLine();
TreeMap<String, String[]> assetTable = new TreeMap<String, String[]>();
int x = 0;
while (x <= 36) {
String lineIn = fileScanner.nextLine();
String[] elements = lineIn.split(",");
elements[0] = elements[0].toUpperCase().replaceAll(" ", "");
String key = elements[0];
key = key.replaceAll(" ", "");
assetTable.put(key, elements);
x++;
}
PDDocument newDoc = new PDDocument();
int control = 1;
PDDocument doc = PDDocument.load(fi);
PDDocumentCatalog cat = doc.getDocumentCatalog();
PDAcroForm form = cat.getAcroForm();
for (String s : assetTable.keySet()) {
if (control <= 6) {
PDField IDno1 = (form.getField("IDno" + control));
PDField Locno1 = (form.getField("locNo" + control));
PDField serno1 = (form.getField("serNo" + control));
PDField typeno1 = (form.getField("typeNo" + control));
PDField maintno1 = (form.getField("maintNo" + control));
String IDnoOne = assetTable.get(s)[1];
//System.out.println(IDnoOne);
IDno1.setValue(assetTable.get(s)[0]);
IDno1.setReadOnly(true);
Locno1.setValue(assetTable.get(s)[1]);
Locno1.setReadOnly(true);
serno1.setValue(assetTable.get(s)[2]);
serno1.setReadOnly(true);
typeno1.setValue(assetTable.get(s)[3]);
typeno1.setReadOnly(true);
String type = "";
if (assetTable.get(s)[5].equals("1"))
type += "Hydrotest";
if (assetTable.get(s)[5].equals("6"))
type += "6 Year Maintenance";
String maint = assetTable.get(s)[4] + " - " + type;
maintno1.setValue(maint);
maintno1.setReadOnly(true);
control++;
} else {
PDField dateIn = form.getField("dateIn");
dateIn.setValue("1/2019 Yearlies");
dateIn.setReadOnly(true);
PDField tagDate = form.getField("tagDate");
tagDate.setValue("2019 / 2020");
tagDate.setReadOnly(true);
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
control = 1;
doc = PDDocument.load(fi);
cat = doc.getDocumentCatalog();
form = cat.getAcroForm();
}
}
PDField dateIn = form.getField("dateIn");
dateIn.setValue("1/2019 Yearlies");
dateIn.setReadOnly(true);
PDField tagDate = form.getField("tagDate");
tagDate.setValue("2019 / 2020");
tagDate.setReadOnly(true);
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
newDoc.save("PDFtest.pdf");
Desktop.getDesktop().open(new File("PDFtest.pdf"));
}
I cant figure out for the life of me what I'm doing wrong. This is the first week I've been working with PDFbox so I'm hoping its something simple.
Updated Error Message
WARNING: Warning: You did not close a PDF Document
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:125)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1200)
at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:383)
at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:158)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)
at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1204)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1192)
at test.test.main(test.java:87)

The warning by itself
You appear to get the warning wrong. It says:
Warning: You did not close a PDF Document
So in contrast to what you think, "PDFbox saying PDDocument closed when its not", PDFBox says that you did not close a document!
After your edit one sees that it actually says that a COSStream has been closed and that a possible cause is that the enclosing PDDocument already has been closed. This is a mere possibility!
The warning in your case
That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents (e.g. automatically via garbage collection), the second one closing may indeed stumble across some already closed COSStream instances.
So my first advice to simply do close the documents at the end by
doc.close();
newDoc.close();
probably won't remove the warnings, merely change their timing.
Actually you don't merely create two documents doc and newDoc, you even create new PDDocument instances and assign them to doc again and again, in the process setting the former document objects in that variable free for garbage collection. So you eventually have a big bunch of documents to be closed as soon as not referenced anymore.
I don't think it would be a good idea to close all those documents in doc early, in particular not before saving newDoc.
But if your code will eventually be run as part of a larger application instead of as a small, one-shot test application, you should collect all those PDDocument instances in some Collection and explicitly close them right after saving newDoc and then clear the collection.
Actually your exception looks like one of those lost PDDocument instances has already been closed by garbage collection, so you should collect the documents even in case of a simple one-shot utility to keep them from being GC disposed.
(#Tilman, please correct me if I'm wrong...)
Importing pages
To prevent problems with different documents sharing pages, you can try and import the pages to the target document and thereafter add the imported page to the target document page tree. I.e. replace
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
by
newDoc.addPage(newDoc.importPage(doc.getPage(0)));
newDoc.addPage(newDoc.importPage(doc.getPage(1)));
This should allow you to close each PDDocument instance in doc before losing it.
There are certain drawbacks to this, though, cf. the method JavaDoc and this answer here.
An actual issue in your code
In your combined document you will have many fields with the same name (at least in case of a sufficiently high number of entries in your CSV file) which you initially set to different values. And you access the fields from the PDAcroForm of the respective original document but don't add them to the PDAcroForm of the combined result document.
This is asking for trouble! The PDF format does consider forms to be document-wide with all fields referenced (directly or indirectly) from the AcroForm dictionary of the document, and it expects fields with the same name to effectively be different visualizations of the same field and therefore to all have the same value.
Thus, PDF processors might handle your document fields in unexpected ways, e.g.
by showing the same value in all fields with the same name (as they are expected to have the same value) or
by ignoring your fields (as they are not in the document AcroForm structure).
In particular programmatic reading of your PDF field values will fail because in that context the form is definitively considered document-wide and based in AcroForm. PDF viewers on the other hand might first show your set values and make look things ok.
To prevent this you should rename the fields before merging. You might consider using the PDFMergerUtility which does such a renaming under the hood. For an example usage of that utility class have a look at the PDFMergerExample.

Even though the above answer was marked as the solution to the problem, since the solution is buried in the comments, I wanted to add this answer at this level. I spent several hours searching for the solution.
My code snippets and comments.
// Collection solely for purpose of preventing premature garbage collection
List<PDDocument> sourceDocuments = new ArrayList<>( );
...
// Source document (actually inside a loop)
PDDocument docIn = PDDocument.load( artifactBytes );
// Add document to collection before using it to prevent the problem
sourceDocuments.add( docIn );
// Extract from source document
PDPage extractedPage = docIn.getPage( 0 );
// Add page to destination document
docOut.addPage( extractedPage );
...
// This was failing with "COSStream has been closed and cannot be read."
// Now it works.
docOut.save( bundleStream );

IText Unable to read whitespace in PDF using Java

I am trying to read a PDF file trough IText,
Program successfully read pdf file but unable to include spaces.
program:
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser pdfReaderContentParser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy = null;
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
System.out.println(text);
}
}
here is data need to get from pdf
When program is reading the pdf then output is:
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
if you see in image Date is 01-04-2017 , MODE have empty PARTICULARS value is B/F, DEPOSITS and WITHDRAWALS is also empty value and BALANCE is 54,396.82
same data i need in text form
e.g.-->
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
Need help, thanks in advance.

You are extracting text from the PDF, the result is correct, it is not missing spaces, as there are no spaces in the raw text.
However (I missed that earlier, so I'm editing), you are using a LocationTextExtractionStrategy, which is "table-aware". This is good, but at the end getTextFromPage discards that table-aware information.
So instead you could create your own strategy implementation that would extend LocationTextExtractionStrategy, add a getTabulatedText() method to spit out the text with spaces inserted where you want them. Take inspiration from getResultantText(), see how it inserts a single space between each cell... In your code you would insert as many spaces (or tabs) as needed. See this answer for an example.
MyTextExtractionStrategy strategy = new MyTextExtractionStrategy();
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String rawText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
String tabulatedText = strategy.getTabulatedText();
System.out.println(text);
}
(maybe there is a "strategy" implementation that already does that, but I don't know it)

Parsing malformed/incomplete/invalid XML files [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:
private static SAXBuilder builder = null;
private static Document doc = null;
private static XPath xpathInstance = null;
builder = new SAXBuilder();
Text list = null;
try {
doc = builder.build(new StringReader(xmldocument));
} catch (JDOMException e) {
throw new Exception(e);
}
try {
xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}
The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).
I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.
The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.
I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.
I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
UPDATE - example:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}

How to get the page count of a microsoft word document in java?

for a server based j2ee application, I need to retrieve the number of pages from word documents.. any ideas what works?

If the documents are modern Word 2007 format you can use direct XML-based manipulation, through OOXML. This is by far the better long term solution, though I realize it may not be realistic for an entire organization to change overnight.
If they are older Word formats, you're probably stuck with server-side Word/Excel/Powerpoint/Outlook programmable object models, although you're not supposed to do that on the server..

Regarding Office Open XML support, the latest beta of Java-POI is supposed to support it.

Haven't used it before but you could try Apache POI. Looks like it has a WordCount function.

//Open the Word Document
Document doc = new Document("C:\\Temp\\file.doc");
//Get page count
int pageCount = doc.getPageCount();

To read the page count of MS Office files you can use aspose libraries (aspose-words, aspose-cells, aspose-slides).
Examples:
Excel:
number of pages of the printable version of the Woorkbook:
import com.aspose.cells.*;
public int getPageCount(String filePath) throws Exception {
Workbook book = new Workbook(filePath);
ImageOrPrintOptions imageOrPrintOptions = new ImageOrPrintOptions();
// Default 0 Prints all pages.
// IgnoreBlank 1 Don't print the pages which the cells are blank.
// IgnoreStyle 2 Don't print the pages which cells only contain styles.
imageOrPrintOptions.setPrintingPage(PrintingPageType.IGNORE_STYLE);
int pageCount = 0;
for (int i = 0; i < book.getWorksheets().getCount(); i++) {
Worksheet sheet = book.getWorksheets().get(i);
PageSetup pageSetup = sheet.getPageSetup();
pageSetup.setOrientation(PageOrientationType.PORTRAIT);
pageSetup.setPaperSize(PaperSizeType.PAPER_LETTER);
pageSetup.setTopMarginInch(1);
pageSetup.setBottomMarginInch(1);
pageSetup.setRightMarginInch(1);
pageSetup.setLeftMarginInch(1);
SheetRender sheetRender = new SheetRender(sheet, imageOrPrintOptions);
int sheetPageCount = sheetRender.getPageCount();
pageCount += sheetPageCount;
}
return pageCount;
}
Word: number of pages:
import com.aspose.words.Document;
public int getPageCount(String filePath) throws Exception {
Document document = new Document(filePath);
return document.getPageCount();
}
PowerPoint: number of slides:
import com.aspose.slides.*;
public int getPageCount(String filePath) throws Exception {
Presentation presentation = new Presentation(filePath);
return presentation.getSlides().toArray().length;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI: find characters in Word document without spaces - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

PDFbox saying PDDocument closed when its not

IText Unable to read whitespace in PDF using Java

Parsing malformed/incomplete/invalid XML files [duplicate]

How to get the page count of a microsoft word document in java?

Categories

Resources