Counting pages in a Word document - java

I'm trying to count pages from a word document with java.
This is my actual code, i'm using the Apache POI libraries
String path1 = "E:/iugkh";
File f = new File(path1);
File[] files = f.listFiles();
int pagesCount = 0;
for (int i = 0; i < files.length; i++) {
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(files[i]));
HWPFDocument wdDoc = new HWPFDocument(fis);
int pagesNo = wdDoc.getSummaryInformation().getPageCount();
pagesCount += pagesNo;
System.out.println(files[i].getName()+":\t"+pagesNo);
}
The output is:
ten.doc: 1
twelve.doc: 1
nine.doc: 1
one.doc: 1
eight.doc: 1
4teen.doc: 1
5teen.doc: 1
six.doc: 1
seven.doc: 1
And this is not what i expected, as the first three documents' page length is 4 and the other are from 1 to 5 pages long.
What am i missing?
Do i have to use another library to count the pages correctly?
Thanks in advance

This may help you. It counts the number of form feeds (sometimes used to separate pages), but I'm not sure if it's gonna work for all documents (I guess it does not).
WordExtractor extractor = new WordExtractor(document);
String[] paragraphs = extractor.getParagraphText();
int pageCount = 1;
for (int i = 0; i < paragraphs.length; ++i) {
if (paragraphs[i].indexOf("\f") >= 0) {
++pageCount;
}
}
System.out.println(pageCount);

This alas is a bug some versions of Word (pre-2010 versions apparently, possibly just in Word 9.0 aka 2000) or at least in some versions of the COM previewer that's used to count the pages. The apache devs refused to implement a workaround for it: https://issues.apache.org/jira/browse/TIKA-1523
In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.
This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.
I also found in there that the bug (for Word 9.0/2000) was confirmed by MS: http://support.microsoft.com/kb/212653/en-us
If opening and re-closing with a new version of Word is not possible/available, another workaround would be to covert the documents to pdf (or even xps) and count the pages of that.

Related

IText Unable to read whitespace in PDF using Java

I am trying to read a PDF file trough IText,
Program successfully read pdf file but unable to include spaces.
program:
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser pdfReaderContentParser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy = null;
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
System.out.println(text);
}
}
here is data need to get from pdf
When program is reading the pdf then output is:
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
if you see in image Date is 01-04-2017 , MODE have empty PARTICULARS value is B/F, DEPOSITS and WITHDRAWALS is also empty value and BALANCE is 54,396.82
same data i need in text form
e.g.-->
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
Need help, thanks in advance.
You are extracting text from the PDF, the result is correct, it is not missing spaces, as there are no spaces in the raw text.
However (I missed that earlier, so I'm editing), you are using a LocationTextExtractionStrategy, which is "table-aware". This is good, but at the end getTextFromPage discards that table-aware information.
So instead you could create your own strategy implementation that would extend LocationTextExtractionStrategy, add a getTabulatedText() method to spit out the text with spaces inserted where you want them. Take inspiration from getResultantText(), see how it inserts a single space between each cell... In your code you would insert as many spaces (or tabs) as needed. See this answer for an example.
MyTextExtractionStrategy strategy = new MyTextExtractionStrategy();
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String rawText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
String tabulatedText = strategy.getTabulatedText();
System.out.println(text);
}
(maybe there is a "strategy" implementation that already does that, but I don't know it)

Change order pages of PDF document in iTextSharp

I'm trying to change reorder pages of my PDF document, but i can't and I don't know why.
I read several articals about changing order, it's java(iText) and i have got few problems with it.(exampl1, exampl2, example3). This example on c#, but there is using other method(exampl4)
I want take my TOC on 12 page and put to 2 page. After 12 page I have other content. This is my template for change order of pages:
String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n)
This is my method for changing order of pages:
public void ChangePageOrder(string path)
{
MemoryStream baos = new MemoryStream();
PdfReader sourcePDFReader = new PdfReader(path);
int toc = 12;
int n = sourcePDFReader.NumberOfPages;
sourcePDFReader.SelectPages(String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n));
using (var fs = new FileStream(path, FileMode.Open, FileAccess.ReadWrite))
{
PdfStamper stamper = new PdfStamper(sourcePDFReader, fs);
stamper.Close();
}
}
Here is call to method:
...
doc.Close();
ChangePageOrder(filePath);
What am I doing not right?
Thank you.
Your code can't work because you are using path to create the PdfReader as well as to create the FileStream. You probably get an error such as "The file is in use" or "The file can't be accessed".
This is explained here:
StackOverflow: How to update a PDF without creating a new PDF?
Official web site:
How to update a PDF without creating a new PDF?
You create a MemoryStream() named baos, but you aren't using that object anywhere. One way to solve your problem, is to replace the FileStream when you first create your PDF by that MemoryStream, and then use the bytes stored in that memory stream to create a PdfReader instance. In that case, PdfStamper won't be writing to a file that is in use.
Another option would be to use a different path. For instance: first you write the document to a file named my_story_unordered.pdf (created by PdfWriter), then you write the document to a file named my_story_reordered.pdf (created by PdfStamper).
It's also possible to create the final document in one go. In that case, you need to switch to linear mode. There's an example in my book "iText in Action - Second Edition" that shows how to do this: MovieHistory1
In the C# port of this example, you have:
writer.SetLinearPageMode();
In normal circumstances, iText will create a page tree with branches and leaves. As soon a a branch has more than 10 leaves, a new branch is created. With setLinearPageMode(), you tell iText not to do this. The complete page tree will consist of one branch with nothing but leaves (no extra branches). This is bad from the point of view of performance when viewing the document, but it's acceptable if the number of pages in your document is limited.
Once you've switched to page mode, you can reorder the pages like this:
document.NewPage();
// get the total number of pages that needs to be reordered
int total = writer.ReorderPages(null);
// change the order
int[] order = new int[total];
for (int i = 0; i < total; i++) {
order[i] = i + toc;
if (order[i] > total) {
order[i] -= total;
}
}
// apply the new order
writer.ReorderPages(order);
Summarized: if your document doesn't have many pages, use the ReorderPages method. If your document has many pages, use the method you've been experimenting with, but do it correctly. Don't try to write to the file that you are still trying to read.
Without going into details about what you should do you can loop through all pages from a pdf, put them into a new pdf doc with all the pages. You can put your logic inside the for loop.
reader = new PdfReader(sourcePDFpath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startpage));
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPDFpath, System.IO.FileMode.Create));
sourceDocument.Open();
for (int i = startpage; i <= endpage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();

Printing multiple word documents corrupts every 2nd file

Im saving (generated) Word documents as file via jacob by printing them into a file (i have to do it like this because the file is required from legacy programms)
The problem is if i do this for multiple files, every second file is not written correctly.
The first file is ok.
2nd is only written about 80% of the file.
3rd is ok
4th is the same as the 2nd (exactly the same filesize as the 2nd)
... and so on
This is my code.
public static void main(String args[]) {
Variant background = new Variant(false);
Variant append = new Variant(false);
Variant range = new Variant(0);//wdPrintAllDocument
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch doc = Dispatch.call(document, "Open", "c:/temp/Test.rtf").toDispatch();
for (int i = 0; i < 10; i++) {
Dispatch.call(doc, "PrintOut", background, append, range, new Variant("c:/temp/test" + i));
}
Dispatch.call(doc, "Close", 0);
Dispatch.call(oleComponent, "Quit");
}
The problem appears on different printers (the pdf printer for example works)
why every 2nd file? word problem? printer (driver) problem?
help is very much appreciated

Java use 1 billion digits of pi

I am trying to make a program that will search the first 1 billion digits of pi and will find a user specified number, the problem I am facing is how to use pi... I have a .txt file that contains pi (I also broke it to 96 different files because java couldn't handle such a big file) all the digits are in the first line....
Code (only to read and save pi using the 96 files):
for(int i = 1;i <= 96; i++){
String filename = "";
if(i <= 9){
filename = "res//t//output2_00000" + i + "(500001).txt";
}else{
filename = "res//t//output2_0000" + i + "(500001).txt";
}
Scanner inFile = new Scanner(new FileReader(filename));
ar.add(inFile.nextLine());
}
List<String> pi = new ArrayList<String>();
for(int i = 0; i<97;i++){
System.out.println(i);
for(String j : ar.get(i).split("")){
pi.add(j);
}
}
This seems to work fine up to a point where it crashes with the following error (the last print statement is 3):
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(Unknown Source)
at java.lang.String.subSequence(Unknown Source)
at java.util.regex.Pattern.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at java.lang.String.split(Unknown Source)
at main.Main.main(Main.java:29)
Is there a way to overcome that, and is there a way to make it go faster?
Thanks in advance.
You are not required to load the whole file in memory. With RandomAccessFile, you can open a file, place the cursor at the place you want and read from it :
RandomAccessFile raf = new RandomAccessFile(
new File("/home/adenoyelle/dev/pi.txt"), "r");
raf.seek(1_000_000);
System.out.println(raf.read());
Note : raf.read() returns a byte of data. You might need to reinterpret it depending on what you need.
Example :
for(int i = 0; i< 10; i++) {
raf.seek(i);
System.out.println((char)raf.read());
}
Output :
3
.
1
4
1
5
9
2
6
5
Note 2 : As stated by SaviourSelf, if you need to read multiple bytes at a time, prefer read(byte [] b).
Don't split up the text file: that's the wrong solution, finding a number that's split across a file will be a pain. Of course Java can handle large files: how else do you think databases written in Java work?!
Consider using the Apache Commons IO, which gives you a LineIterator:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8"/*probably*/);
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
It is likely that you go out of heap memory if you try to load over 1GB of data into the heap. Just check every single file for the search string and then close the file.

How to get the page count of a microsoft word document in java?

for a server based j2ee application, I need to retrieve the number of pages from word documents.. any ideas what works?
If the documents are modern Word 2007 format you can use direct XML-based manipulation, through OOXML. This is by far the better long term solution, though I realize it may not be realistic for an entire organization to change overnight.
If they are older Word formats, you're probably stuck with server-side Word/Excel/Powerpoint/Outlook programmable object models, although you're not supposed to do that on the server..
Regarding Office Open XML support, the latest beta of Java-POI is supposed to support it.
Haven't used it before but you could try Apache POI. Looks like it has a WordCount function.
//Open the Word Document
Document doc = new Document("C:\\Temp\\file.doc");
//Get page count
int pageCount = doc.getPageCount();
To read the page count of MS Office files you can use aspose libraries (aspose-words, aspose-cells, aspose-slides).
Examples:
Excel:
number of pages of the printable version of the Woorkbook:
import com.aspose.cells.*;
public int getPageCount(String filePath) throws Exception {
Workbook book = new Workbook(filePath);
ImageOrPrintOptions imageOrPrintOptions = new ImageOrPrintOptions();
// Default 0 Prints all pages.
// IgnoreBlank 1 Don't print the pages which the cells are blank.
// IgnoreStyle 2 Don't print the pages which cells only contain styles.
imageOrPrintOptions.setPrintingPage(PrintingPageType.IGNORE_STYLE);
int pageCount = 0;
for (int i = 0; i < book.getWorksheets().getCount(); i++) {
Worksheet sheet = book.getWorksheets().get(i);
PageSetup pageSetup = sheet.getPageSetup();
pageSetup.setOrientation(PageOrientationType.PORTRAIT);
pageSetup.setPaperSize(PaperSizeType.PAPER_LETTER);
pageSetup.setTopMarginInch(1);
pageSetup.setBottomMarginInch(1);
pageSetup.setRightMarginInch(1);
pageSetup.setLeftMarginInch(1);
SheetRender sheetRender = new SheetRender(sheet, imageOrPrintOptions);
int sheetPageCount = sheetRender.getPageCount();
pageCount += sheetPageCount;
}
return pageCount;
}
Word: number of pages:
import com.aspose.words.Document;
public int getPageCount(String filePath) throws Exception {
Document document = new Document(filePath);
return document.getPageCount();
}
PowerPoint: number of slides:
import com.aspose.slides.*;
public int getPageCount(String filePath) throws Exception {
Presentation presentation = new Presentation(filePath);
return presentation.getSlides().toArray().length;
}

Categories

Resources