IText Unable to read whitespace from tabular data from PDF using Java

IText Unable to read whitespace from tabular data from PDF using Java - java

This question is already asked but the query i have is not answered. i have a pdf with table in which some columns are not having any values. I need to read those blank spaces.
I have used Itext pdf for extracting data from pdf but while reading the data from table it is read col by col and the column having no value is not read with white spaces but the next column is read.
I have customized LocationTextExtractionStrategy and have overridden getResultantText()
In below image if there is no value for MD and TD col 1,2,3 then while reading the PDF after 1 it is not giving me spaces but giving the next value that is 2. Is there any solution for this to read the blank spaces
PdfReader reader = new PdfReader(filename);
FontRenderFilter fontFilter = new FontRenderFilter();
TextExtractionStrategy strategy = new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(),fontFilter);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
String finalText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
System.out.println("finalText.." + finalText);
}

Related

Convert PDF to CSV or EXCEL

I am trying to convert PDF file to CSV or EXCEL format.
Here is the code I use to convert to CSV format:
public void convert() throws Exception {
PdfReader pdfReader = new PdfReader("example.pdf");
PdfDocument pdf = new PdfDocument(pdfReader);;
int pages = pdf.getNumberOfPages();
FileWriter csvWriter = new FileWriter("student.csv");
for (int i = 1; i <= pages; i++) {
PdfPage page = pdf.getPage(i);
String content = PdfTextExtractor.getTextFromPage(page);
String[] splitContents = content.split("\n");
boolean isTitle = true;
for (int j = 0; j < splitContents.length; j++) {
if (isTitle) {
isTitle = false;
continue;
}
csvWriter.append(splitContents[j].replaceAll(" ", " "));
csvWriter.append("\n");
}
}
csvWriter.flush();
csvWriter.close();
}
This code works correctly, but the fact is that the CSV format groups rows without taking into account existing columns (some of them are empty), so I would like to convert this file (PDF) to EXCEL format.
The PDF file itself is formed as a table.
What do I mean about spaces. For example, in a PDF file, in a table
| name | some data | | | some data 1 | |
+----------+----------------+------------+-------------+-------------------+--------------+
After converting to a CSV file, the line looks like this:
name some data some data 1
How can I get the same result as a PDF table?

I'd suggest to use PDFBox, like here: Parsing PDF files (especially with tables) with PDFBox
or another library that will allow you to check the data in the Table point by point, and will allow you to create a table by column width (something like Table table = page.getTable(dividers)); ).
If the width of the columns changes, you'll have to implement it based on the headers/first data column ([e.g. position.x of the last character of the first word] minus [position.x of the first character of the new word] - you'll have to figure it out yourself), it's hard so you could make it hardcoded in the beginning. Using Foxit Reader PDF App you can easily measure column width. Then, if you don't find any data in a particular column, you will be able to add an empty column in the CSV file. I know from my own experience that it is not easy, so I wish you good luck.

How to compute =SUM(Above) function in docx using apache poi

I am trying to work with apache poi for docx format file and I am stuck at using formulas in table. For instance see the image :
I did try setting text to "=SUM(ABOVE)" but it doesnt work this way.
I think I might need to set custom xml data here but I am not sure how to proceed. I tried following piece of code :
XWPFTable table = document.createTable();
//create first row
XWPFTableRow tableRowOne = table.getRow(0);
table.getRow(0).createCell();
table.getRow(0).getCell(0).setText("10");
table.getRow(0).createCell();
table.getRow(0).getCell(1).setText("=SUM(ABOVE)");

What I am doing in case of such requirements is as follows:
First, creating the simplest possible Word document having the required things in it using the Word GUI. Then have a look into what Word has created to get a idea what needs to be created using apache poi.
In concrete here:
Do creating the simplest possible table in Word which has a field {=SUM(ABOVE)} in it. Save that as *.docx. Now unzip that *.docx (Office Open XML files like *.docx are simply ZIP archive). Have a look at /word/document.xml in that archive. There you will find something like:
<w:tc>
<w:p>
<w:fldSimple w:instr="=SUM(ABOVE)"/>
...
</w:p>
</w:tc>
This is XML for a table cell having a paragraph having a fldSimple element in it where instr attribute contains the formula.
Now we know, we need the table cell XWPFTableCell and the XWPFParagraph in it. Then we need set a fldSimple element in this paragaraph where instr attribute contains the formula.
This would be as simple as
paragraphInCell.getCTP().addNewFldSimple().setInstr("=SUM(ABOVE)");
But of course something must tell Word the need to calculate the formula when the document opens. The simplest solution for this is setting the field "dirty". That leads to the need for updating the field while opening the document in Word. It also leads to a confirming message dialog about the need for updating.
Complete example using apache poi 4.1.0:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSimpleField;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWordTableSumAbove {
public static void main(String[] args) throws Exception {
XWPFDocument document= new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run=paragraph.createRun();
run.setText("The table:");
//create the table
XWPFTable table = document.createTable(4,3);
table.setWidth("100%");
for (int row = 0; row < 3; row++) {
for (int col = 0; col < 3; col++) {
if (col < 2) table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
else table.getRow(row).getCell(col).setText("" + ((row + 1) * 1234));
}
}
//set Sum row
table.getRow(3).getCell(0).setText("Sum:");
//get paragraph from cell where the sum field shall be contained
XWPFParagraph paragraphInCell = null;
if (table.getRow(3).getCell(2).getParagraphs().size() == 0) paragraphInCell = table.getRow(3).getCell(2).addParagraph();
else paragraphInCell = table.getRow(3).getCell(2).getParagraphs().get(0);
//set sum field in
CTSimpleField sumAbove = paragraphInCell.getCTP().addNewFldSimple();
sumAbove.setInstr("=SUM(ABOVE)");
//set sum field dirty, so it must be calculated while opening the document
sumAbove.setDirty(STOnOff.TRUE);
paragraph = document.createParagraph();
FileOutputStream out = new FileOutputStream("create_table.docx");
document.write(out);
out.close();
document.close();
}
}
That all only works properly when the document is opened using Microsoft Word. LibreOffice Writer is not able storing such formula fields into Office Open XML (*.docx) format nor is it able reading such Office Open XML formula fields properly.

replace string with unicode text in pdf file using PDFbox?

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.
Sample unicode text
Bɐɑɒ
issue (Address column)
https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing
I am using PDFBox.
Please help me with this.....
check the code I am using.....
enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
for (PDPage page : document.getPages()) {
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
//PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
resources.add(font);
//resources.add(font2);
//resources.add(PDType1Font.TIMES_ROMAN);
page.setResources(resources);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that
// operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}

Your code utterly breaks the PDF, cf. the Adobe Preflight output:
The cause is obvious, your code
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);
drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.
You must not drop existing resources as they are used in your document.
Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.
I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task
to read the strings from PDF file and replace it with the Unicode text
cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.
What you can implement instead is:
First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.
Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

How to display barcodes in a matrix-like structure?

How can I display different barcodes in multiple columns in a PDF page using itext library to generate pdfs in java? I have to display 12 barcodes in the same PDF page in three columns, each one contains 4 barcodes (in other words it is a 4 by 3 matrix).

I've made a Barcodes example that does exactly what you need. See the resulting pdf: barcodes_table.pdf
There's nothing difficult about it. You just create a table with 4 column and you add 12 cell:
PdfPTable table = new PdfPTable(4);
table.setWidthPercentage(100);
for (int i = 0; i < 12; i++) {
table.addCell(createBarcode(writer, String.format("%08d", i)));
}
The createBarcode() method creates a cell with a barcode:
public static PdfPCell createBarcode(PdfWriter writer, String code) throws DocumentException, IOException {
BarcodeEAN barcode = new BarcodeEAN();
barcode.setCodeType(Barcode.EAN8);
barcode.setCode(code);
PdfPCell cell = new PdfPCell(barcode.createImageWithBarcode(writer.getDirectContent(), BaseColor.BLACK, BaseColor.GRAY), true);
cell.setPadding(10);
return cell;
}

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.