Convert PDF to CSV or EXCEL

Convert PDF to CSV or EXCEL - java

I am trying to convert PDF file to CSV or EXCEL format.
Here is the code I use to convert to CSV format:
public void convert() throws Exception {
PdfReader pdfReader = new PdfReader("example.pdf");
PdfDocument pdf = new PdfDocument(pdfReader);;
int pages = pdf.getNumberOfPages();
FileWriter csvWriter = new FileWriter("student.csv");
for (int i = 1; i <= pages; i++) {
PdfPage page = pdf.getPage(i);
String content = PdfTextExtractor.getTextFromPage(page);
String[] splitContents = content.split("\n");
boolean isTitle = true;
for (int j = 0; j < splitContents.length; j++) {
if (isTitle) {
isTitle = false;
continue;
}
csvWriter.append(splitContents[j].replaceAll(" ", " "));
csvWriter.append("\n");
}
}
csvWriter.flush();
csvWriter.close();
}
This code works correctly, but the fact is that the CSV format groups rows without taking into account existing columns (some of them are empty), so I would like to convert this file (PDF) to EXCEL format.
The PDF file itself is formed as a table.
What do I mean about spaces. For example, in a PDF file, in a table
| name | some data | | | some data 1 | |
+----------+----------------+------------+-------------+-------------------+--------------+
After converting to a CSV file, the line looks like this:
name some data some data 1
How can I get the same result as a PDF table?

I'd suggest to use PDFBox, like here: Parsing PDF files (especially with tables) with PDFBox
or another library that will allow you to check the data in the Table point by point, and will allow you to create a table by column width (something like Table table = page.getTable(dividers)); ).
If the width of the columns changes, you'll have to implement it based on the headers/first data column ([e.g. position.x of the last character of the first word] minus [position.x of the first character of the new word] - you'll have to figure it out yourself), it's hard so you could make it hardcoded in the beginning. Using Foxit Reader PDF App you can easily measure column width. Then, if you don't find any data in a particular column, you will be able to add an empty column in the CSV file. I know from my own experience that it is not easy, so I wish you good luck.

Related

How to compute =SUM(Above) function in docx using apache poi

I am trying to work with apache poi for docx format file and I am stuck at using formulas in table. For instance see the image :
I did try setting text to "=SUM(ABOVE)" but it doesnt work this way.
I think I might need to set custom xml data here but I am not sure how to proceed. I tried following piece of code :
XWPFTable table = document.createTable();
//create first row
XWPFTableRow tableRowOne = table.getRow(0);
table.getRow(0).createCell();
table.getRow(0).getCell(0).setText("10");
table.getRow(0).createCell();
table.getRow(0).getCell(1).setText("=SUM(ABOVE)");

What I am doing in case of such requirements is as follows:
First, creating the simplest possible Word document having the required things in it using the Word GUI. Then have a look into what Word has created to get a idea what needs to be created using apache poi.
In concrete here:
Do creating the simplest possible table in Word which has a field {=SUM(ABOVE)} in it. Save that as *.docx. Now unzip that *.docx (Office Open XML files like *.docx are simply ZIP archive). Have a look at /word/document.xml in that archive. There you will find something like:
<w:tc>
<w:p>
<w:fldSimple w:instr="=SUM(ABOVE)"/>
...
</w:p>
</w:tc>
This is XML for a table cell having a paragraph having a fldSimple element in it where instr attribute contains the formula.
Now we know, we need the table cell XWPFTableCell and the XWPFParagraph in it. Then we need set a fldSimple element in this paragaraph where instr attribute contains the formula.
This would be as simple as
paragraphInCell.getCTP().addNewFldSimple().setInstr("=SUM(ABOVE)");
But of course something must tell Word the need to calculate the formula when the document opens. The simplest solution for this is setting the field "dirty". That leads to the need for updating the field while opening the document in Word. It also leads to a confirming message dialog about the need for updating.
Complete example using apache poi 4.1.0:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSimpleField;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWordTableSumAbove {
public static void main(String[] args) throws Exception {
XWPFDocument document= new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run=paragraph.createRun();
run.setText("The table:");
//create the table
XWPFTable table = document.createTable(4,3);
table.setWidth("100%");
for (int row = 0; row < 3; row++) {
for (int col = 0; col < 3; col++) {
if (col < 2) table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
else table.getRow(row).getCell(col).setText("" + ((row + 1) * 1234));
}
}
//set Sum row
table.getRow(3).getCell(0).setText("Sum:");
//get paragraph from cell where the sum field shall be contained
XWPFParagraph paragraphInCell = null;
if (table.getRow(3).getCell(2).getParagraphs().size() == 0) paragraphInCell = table.getRow(3).getCell(2).addParagraph();
else paragraphInCell = table.getRow(3).getCell(2).getParagraphs().get(0);
//set sum field in
CTSimpleField sumAbove = paragraphInCell.getCTP().addNewFldSimple();
sumAbove.setInstr("=SUM(ABOVE)");
//set sum field dirty, so it must be calculated while opening the document
sumAbove.setDirty(STOnOff.TRUE);
paragraph = document.createParagraph();
FileOutputStream out = new FileOutputStream("create_table.docx");
document.write(out);
out.close();
document.close();
}
}
That all only works properly when the document is opened using Microsoft Word. LibreOffice Writer is not able storing such formula fields into Office Open XML (*.docx) format nor is it able reading such Office Open XML formula fields properly.

IText Unable to read whitespace from tabular data from PDF using Java

This question is already asked but the query i have is not answered. i have a pdf with table in which some columns are not having any values. I need to read those blank spaces.
I have used Itext pdf for extracting data from pdf but while reading the data from table it is read col by col and the column having no value is not read with white spaces but the next column is read.
I have customized LocationTextExtractionStrategy and have overridden getResultantText()
In below image if there is no value for MD and TD col 1,2,3 then while reading the PDF after 1 it is not giving me spaces but giving the next value that is 2. Is there any solution for this to read the blank spaces
PdfReader reader = new PdfReader(filename);
FontRenderFilter fontFilter = new FontRenderFilter();
TextExtractionStrategy strategy = new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(),fontFilter);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
String finalText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
System.out.println("finalText.." + finalText);
}

CSV to XLXS format with data in java [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have an .csv file in which data is in below format
TEST;"TEST1";"TEST2";"TEST3";"TEST4" in each column.
I need to convert .csv file to .xlsx file in which each value should in different column. eg:See attached image.
I tried using Apache POI however, its just converting into .xlsx format but data remains in one column.
Can you anyone share sample code.
Sample input in csv
Below is the sample output result which should in xlsx format.

Here is a simple example (without exception handling, encoding, file paths, ...) that could handle CSV with semicolons (in that case csv translates to "character separated file") and creates a Xslx file:
//open input file
BufferedReader br = new BufferedReader(new FileReader("input.csv"));
//create sheet
Workbook wb = new XSSFWorkbook();
Sheet sheet = wb.createSheet();
//read from file
String line = br.readLine();
for (int rows=0; line != null; rows++) {
//create one row per line
Row row = sheet.createRow(rows);
//split by semicolon
String[] items = line.split(";");
//ignore first item
for (int i=1, col=0; i<items.length; i++) {
//strip quotation marks
String item = items[i].substring(1, items[i].length()-1);
Cell cell = row.createCell(col++);
//set item
cell.setCellValue(item);
}
//read next line
line = br.readLine();
}
//write to xlsx
FileOutputStream out = new FileOutputStream("Output.xlsx");
wb.write(out);
//close resources
br.close();
out.close();
Given an input.csv like this:
TEST;"TEST1";"TEST2";"TEST3";"TEST4"
TEST;"TEST5";"TEST6";"TEST7";"TEST8"
the Output.xlsx looks like this:

Printing different values (Copy number) in different copies of a particular document

My application uses an RTF file with merge fields as source and creates a PDF file with it using Aspose.Words. The users of this application give that resulting document to their clients, so copies of same document will be printed for each of their client. There is only one difference on those copies however, and that is copy number at the end of each document copy.
For now; lets say there are 4 clients so 4 copies of the same document will be printed with only copy numbers different. I achieve this by creating same document for 4 times and each time I insert my html text, merge fields, and add copy number then append the documents. In the end, I have one big document in which all 4 created documents appended.
Here is my code block for it, there were lots of code there, so I tried to downsize them to only related parts:
import com.aspose.words.*
Document docAllAppended = new Document(loadDocument("/documents/" + RTFFileName));
Document docTemp=null;
for (int i = 1; i <= copyNumber; i++) {
docTemp = new Document(loadDocument("/documents/" + RTFFileName));
DocumentBuilder builder = new DocumentBuilder(docTemp);
//insert html which includes file context
builder.insertHtml(htmlText);
//insert Copy number
builder.moveToBookmark("sayfa");
Font font = builder.getFont();
font.setBold(true);
font.setSize(8);
builder.write("Copy Number-" + i+ " / ");
font.setBold(false);
docAllAppended.appendDocument(docTemp,ImportFormatMode.USE_DESTINATION_STYLES);
}
This looks so unnecessary and has low performance. Also each time my users try to change copy number to be printed, my application calculates whole thing from the start. What I am asking is, is there a way to make this faster or how not to create whole thing again when copy number to be printed changes? So far I haven't found much.
Thanks in advance.

If the only difference is the copy number, then you can just prepare the document once by inserting HTML, merging etc.
Then, in a for loop, set the copy number and save the document as docx or pdf. Appending the document in the loop is not necessary, you can save each copy as different name.
import com.aspose.words.*
Document docAllAppended = new Document(loadDocument("/documents/" + RTFFileName));
Document docTemp=null;
docTemp = new Document(loadDocument("/documents/" + RTFFileName));
DocumentBuilder builder = new DocumentBuilder(docTemp);
//insert html which includes file context
builder.insertHtml(htmlText);
// In for loop, only update the copy number
for (int i = 1; i <= copyNumber; i++) {
// Use DocumentBuilder for font setting
builder.moveToBookmark("sayfa");
Font font = builder.getFont();
font.setBold(true);
font.setSize(8);
builder.write("dummy value");
font.setBold(false);
// Use Bookmark for setting the actual value
Bookmark bookmark = docAllAppended.getRange().getBookmarks().get("sayfa");
bookmark.setText("Copy Number-" + i + " / ");
// Save the document for each client
docAllAppended.save(Common.DATA_DIR + "Letter-Client-" + i + ".docx");
}
I work with Aspose as Developer Evangelist.

How to display barcodes in a matrix-like structure?

How can I display different barcodes in multiple columns in a PDF page using itext library to generate pdfs in java? I have to display 12 barcodes in the same PDF page in three columns, each one contains 4 barcodes (in other words it is a 4 by 3 matrix).

I've made a Barcodes example that does exactly what you need. See the resulting pdf: barcodes_table.pdf
There's nothing difficult about it. You just create a table with 4 column and you add 12 cell:
PdfPTable table = new PdfPTable(4);
table.setWidthPercentage(100);
for (int i = 0; i < 12; i++) {
table.addCell(createBarcode(writer, String.format("%08d", i)));
}
The createBarcode() method creates a cell with a barcode:
public static PdfPCell createBarcode(PdfWriter writer, String code) throws DocumentException, IOException {
BarcodeEAN barcode = new BarcodeEAN();
barcode.setCodeType(Barcode.EAN8);
barcode.setCode(code);
PdfPCell cell = new PdfPCell(barcode.createImageWithBarcode(writer.getDirectContent(), BaseColor.BLACK, BaseColor.GRAY), true);
cell.setPadding(10);
return cell;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert PDF to CSV or EXCEL - java

Related

How to compute =SUM(Above) function in docx using apache poi

IText Unable to read whitespace from tabular data from PDF using Java

CSV to XLXS format with data in java [closed]

Printing different values (Copy number) in different copies of a particular document

How to display barcodes in a matrix-like structure?

Categories

Resources