IText Unable to read whitespace in PDF using Java

IText Unable to read whitespace in PDF using Java - java

I am trying to read a PDF file trough IText,
Program successfully read pdf file but unable to include spaces.
program:
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser pdfReaderContentParser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy = null;
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
System.out.println(text);
}
}
here is data need to get from pdf
When program is reading the pdf then output is:
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
if you see in image Date is 01-04-2017 , MODE have empty PARTICULARS value is B/F, DEPOSITS and WITHDRAWALS is also empty value and BALANCE is 54,396.82
same data i need in text form
e.g.-->
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
Need help, thanks in advance.

You are extracting text from the PDF, the result is correct, it is not missing spaces, as there are no spaces in the raw text.
However (I missed that earlier, so I'm editing), you are using a LocationTextExtractionStrategy, which is "table-aware". This is good, but at the end getTextFromPage discards that table-aware information.
So instead you could create your own strategy implementation that would extend LocationTextExtractionStrategy, add a getTabulatedText() method to spit out the text with spaces inserted where you want them. Take inspiration from getResultantText(), see how it inserts a single space between each cell... In your code you would insert as many spaces (or tabs) as needed. See this answer for an example.
MyTextExtractionStrategy strategy = new MyTextExtractionStrategy();
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String rawText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
String tabulatedText = strategy.getTabulatedText();
System.out.println(text);
}
(maybe there is a "strategy" implementation that already does that, but I don't know it)

Related

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:

You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}

What about:
nameArrays.add(properties+"\n");

You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]

This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

Change order pages of PDF document in iTextSharp

I'm trying to change reorder pages of my PDF document, but i can't and I don't know why.
I read several articals about changing order, it's java(iText) and i have got few problems with it.(exampl1, exampl2, example3). This example on c#, but there is using other method(exampl4)
I want take my TOC on 12 page and put to 2 page. After 12 page I have other content. This is my template for change order of pages:
String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n)
This is my method for changing order of pages:
public void ChangePageOrder(string path)
{
MemoryStream baos = new MemoryStream();
PdfReader sourcePDFReader = new PdfReader(path);
int toc = 12;
int n = sourcePDFReader.NumberOfPages;
sourcePDFReader.SelectPages(String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n));
using (var fs = new FileStream(path, FileMode.Open, FileAccess.ReadWrite))
{
PdfStamper stamper = new PdfStamper(sourcePDFReader, fs);
stamper.Close();
}
}
Here is call to method:
...
doc.Close();
ChangePageOrder(filePath);
What am I doing not right?
Thank you.

Your code can't work because you are using path to create the PdfReader as well as to create the FileStream. You probably get an error such as "The file is in use" or "The file can't be accessed".
This is explained here:
StackOverflow: How to update a PDF without creating a new PDF?
Official web site:
How to update a PDF without creating a new PDF?
You create a MemoryStream() named baos, but you aren't using that object anywhere. One way to solve your problem, is to replace the FileStream when you first create your PDF by that MemoryStream, and then use the bytes stored in that memory stream to create a PdfReader instance. In that case, PdfStamper won't be writing to a file that is in use.
Another option would be to use a different path. For instance: first you write the document to a file named my_story_unordered.pdf (created by PdfWriter), then you write the document to a file named my_story_reordered.pdf (created by PdfStamper).
It's also possible to create the final document in one go. In that case, you need to switch to linear mode. There's an example in my book "iText in Action - Second Edition" that shows how to do this: MovieHistory1
In the C# port of this example, you have:
writer.SetLinearPageMode();
In normal circumstances, iText will create a page tree with branches and leaves. As soon a a branch has more than 10 leaves, a new branch is created. With setLinearPageMode(), you tell iText not to do this. The complete page tree will consist of one branch with nothing but leaves (no extra branches). This is bad from the point of view of performance when viewing the document, but it's acceptable if the number of pages in your document is limited.
Once you've switched to page mode, you can reorder the pages like this:
document.NewPage();
// get the total number of pages that needs to be reordered
int total = writer.ReorderPages(null);
// change the order
int[] order = new int[total];
for (int i = 0; i < total; i++) {
order[i] = i + toc;
if (order[i] > total) {
order[i] -= total;
}
}
// apply the new order
writer.ReorderPages(order);
Summarized: if your document doesn't have many pages, use the ReorderPages method. If your document has many pages, use the method you've been experimenting with, but do it correctly. Don't try to write to the file that you are still trying to read.

Without going into details about what you should do you can loop through all pages from a pdf, put them into a new pdf doc with all the pages. You can put your logic inside the for loop.
reader = new PdfReader(sourcePDFpath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startpage));
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPDFpath, System.IO.FileMode.Create));
sourceDocument.Open();
for (int i = startpage; i <= endpage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();

Printing multiple word documents corrupts every 2nd file

Im saving (generated) Word documents as file via jacob by printing them into a file (i have to do it like this because the file is required from legacy programms)
The problem is if i do this for multiple files, every second file is not written correctly.
The first file is ok.
2nd is only written about 80% of the file.
3rd is ok
4th is the same as the 2nd (exactly the same filesize as the 2nd)
... and so on
This is my code.
public static void main(String args[]) {
Variant background = new Variant(false);
Variant append = new Variant(false);
Variant range = new Variant(0);//wdPrintAllDocument
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch doc = Dispatch.call(document, "Open", "c:/temp/Test.rtf").toDispatch();
for (int i = 0; i < 10; i++) {
Dispatch.call(doc, "PrintOut", background, append, range, new Variant("c:/temp/test" + i));
}
Dispatch.call(doc, "Close", 0);
Dispatch.call(oleComponent, "Quit");
}
The problem appears on different printers (the pdf printer for example works)
why every 2nd file? word problem? printer (driver) problem?
help is very much appreciated

Apache POI: find characters in Word document without spaces

I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?

If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.

How to get the page count of a microsoft word document in java?

for a server based j2ee application, I need to retrieve the number of pages from word documents.. any ideas what works?

If the documents are modern Word 2007 format you can use direct XML-based manipulation, through OOXML. This is by far the better long term solution, though I realize it may not be realistic for an entire organization to change overnight.
If they are older Word formats, you're probably stuck with server-side Word/Excel/Powerpoint/Outlook programmable object models, although you're not supposed to do that on the server..

Regarding Office Open XML support, the latest beta of Java-POI is supposed to support it.

Haven't used it before but you could try Apache POI. Looks like it has a WordCount function.

//Open the Word Document
Document doc = new Document("C:\\Temp\\file.doc");
//Get page count
int pageCount = doc.getPageCount();

To read the page count of MS Office files you can use aspose libraries (aspose-words, aspose-cells, aspose-slides).
Examples:
Excel:
number of pages of the printable version of the Woorkbook:
import com.aspose.cells.*;
public int getPageCount(String filePath) throws Exception {
Workbook book = new Workbook(filePath);
ImageOrPrintOptions imageOrPrintOptions = new ImageOrPrintOptions();
// Default 0 Prints all pages.
// IgnoreBlank 1 Don't print the pages which the cells are blank.
// IgnoreStyle 2 Don't print the pages which cells only contain styles.
imageOrPrintOptions.setPrintingPage(PrintingPageType.IGNORE_STYLE);
int pageCount = 0;
for (int i = 0; i < book.getWorksheets().getCount(); i++) {
Worksheet sheet = book.getWorksheets().get(i);
PageSetup pageSetup = sheet.getPageSetup();
pageSetup.setOrientation(PageOrientationType.PORTRAIT);
pageSetup.setPaperSize(PaperSizeType.PAPER_LETTER);
pageSetup.setTopMarginInch(1);
pageSetup.setBottomMarginInch(1);
pageSetup.setRightMarginInch(1);
pageSetup.setLeftMarginInch(1);
SheetRender sheetRender = new SheetRender(sheet, imageOrPrintOptions);
int sheetPageCount = sheetRender.getPageCount();
pageCount += sheetPageCount;
}
return pageCount;
}
Word: number of pages:
import com.aspose.words.Document;
public int getPageCount(String filePath) throws Exception {
Document document = new Document(filePath);
return document.getPageCount();
}
PowerPoint: number of slides:
import com.aspose.slides.*;
public int getPageCount(String filePath) throws Exception {
Presentation presentation = new Presentation(filePath);
return presentation.getSlides().toArray().length;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

IText Unable to read whitespace in PDF using Java - java

Related

ArrayList<String> in PDF from a new row

Change order pages of PDF document in iTextSharp

Printing multiple word documents corrupts every 2nd file

Apache POI: find characters in Word document without spaces

How to get the page count of a microsoft word document in java?

Categories

Resources