How to convert a pdf file into CSV file?

How to convert a pdf file into CSV file? - java

I want to convert a PDF file into a CSV file.
I am using iText library for this.
The program is working fine but the output is not in desired format.
All the data is coming in first line of the csv file. The output should be exactly same as pdf file(means with line breaks).
Please help.
Thanks in advance.
Document document = new Document();
document.open();
PdfReader reader = new PdfReader("C:\\Indiaops-projects\\PREMIUM_PAID_ACKNOWLEDGEMENT.pdf");
PdfDictionary dictionary = reader.getPageN(1);
AcroFields fileds = reader.getAcroFields();
PRIndirectReference reference = (PRIndirectReference)
dictionary.get(PdfName.CONTENTS);
PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
byte[] bytes = PdfReader.getStreamBytes(stream);
PRTokeniser tokenizer = new PRTokeniser(bytes);
FileOutputStream fos=new FileOutputStream("C:\\Indiaops-projects\\pdf.csv");
StringBuffer buffer = new StringBuffer();
StringBuffer data = new StringBuffer();
int i=0;
while (tokenizer.nextToken()) {
if (tokenizer.getTokenType() == PRTokeniser.TK_STRING) {
String value = tokenizer.getStringValue();
if("x-none".equals(value)){
String datastr =data.toString();
if(!"".equals(datastr)){
buffer.append("\""+datastr+"\",");
data = new StringBuffer();
}
}else{
data.append(value);
}
}
}
String test=buffer.toString();
StringReader stReader = new StringReader(test);
int t;
while((t=stReader.read())>0)
fos.write(t);
document.add(new Paragraph(".."));
document.close();

You need to introduce a line break '\n' in the buffer after each table row.
buffer.append("\n");

Related

iText7 : how to use the same reader in multiple page of the same pdf

As the title said,I have an existing pdf that contains an empty table with a custom design, I want to create a pdf file that write data in that table, but I need to have that same table on every page of my new pdf.
public byte[] pdfRdv(Rdv rdv, List<RdvClient> clients) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
//this file contains a empty table in which I want to insert each iteration of my
//table clients
String src = "C://Users//ASUS//Downloads//rdv.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(baos);
PdfDocument pdfDoc = new PdfDocument(reader, writer);
Document document = new Document(pdfDoc);
for(int i = 0; i < clients.size(); i++){
if(i > 0){
// I tried to add for each itaration a new page but i don't know how to insert in
//it the table in the existing pdf (reader)
pdfDoc.addNewPage(i+2);
}
Paragraph p1 = new Paragraph(clients.get(i).getId().toString()).setFontSize(10);
p1.setFixedPosition(140, 687, 400);
document.add(p1);
//..
}
document.close();
pdfDoc.close();
byte[] pdf = baos.toByteArray();
return pdf;
}
Thank you in advance

Merging of different file format into single pdf using itext is giving corrupt pdf

I have a multipart form in which different type of files will be uploaded in single request like pdf,png,jpg etc. I need to read these files and create final merged single pdf file. Does iText will support this kind of merger of different file types.
I can able to merge all image type but when i try to merge pdf with imgae getting some issues.Any sample example on this.Below code creating final pdf without images.
public void generateMergedPDF(Map<String, String> dataMap, MultipartFile[] files) throws Exception {
ClassPathResource resource = new ClassPathResource("templatepdfForm.pdf");
FileOutputStream userInfo = new FileOutputStream(new File("C:\\pdf\\megepath\\updateddocument.pdf"));
//Update filed values to template Starts
PdfReader reader = new PdfReader(resource.getInputStream());
PdfStamper stamper = new PdfStamper(reader, userInfo);
stamper.setFormFlattening(true);
AcroFields form = stamper.getAcroFields();
Map<String, Item> fieldMap = form.getFields();
for (String key : fieldMap.keySet()) {
String fieldValue = dataMap.get(key);
if (fieldValue != null) {
form.setField(key, fieldValue);
}
}
stamper.close();
//Update filed values to template Ends
Document mergePdfDoc = new Document();
PdfCopy pdfCopy;
boolean smartCopy = false;
FileOutputStream finalFile = new FileOutputStream("C:\\pdf\\finalmergedfile.pdf");
//Merge updated pdf with multipart content
if(smartCopy)
pdfCopy = new PdfSmartCopy(mergePdfDoc, finalFile);
else
pdfCopy = new PdfCopy(mergePdfDoc, finalFile);
PdfWriter writer = PdfWriter.getInstance(mergePdfDoc, finalFile);
mergePdfDoc.open();
PdfReader mergeReader = new PdfReader(new FileInputStream(new File("C:\\pdf\\megepath\\updateddocument.pdf")));
pdfCopy.addDocument(mergeReader);
pdfCopy.freeReader(mergeReader);
mergeReader.close();
PdfReader[] pdfReader = new PdfReader[files.length];
for(int i=0; i<=files.length-1;i++) {
if(FileContentType.APPLICATION_TYPE.getContentTypes().contains(files[i].getContentType())) {
//To add multipart pdf content
pdfReader[i] = new PdfReader(files[i].getInputStream());
pdfCopy.addDocument(pdfReader[i]);
pdfCopy.freeReader(pdfReader[i]);
pdfReader[i].close();
}else if(FileContentType.IMAGE_TYPE.getContentTypes().contains(files[i].getContentType())) {
//To add multipart image content
System.out.println("Image Type Loop");
Image fileImage = Image.getInstance(files[i].getBytes());
mergePdfDoc.setPageSize(fileImage);
mergePdfDoc.newPage();
claimImage.setAbsolutePosition(0, 0);
mergePdfDoc.add(fileImage);
}
}
pdfCopy.setMergeFields();
//mergePdfDoc.close(); //If i enable this close stream closed error
//writer.close(); //If i enable this close stream closed error
memInfo.close();
finalFile.close();
}

merge multiple pdfs in order

hey guys sorry for long post and bad language and if there is unnecessary details
i created multiple 1page pdfs from one pdf template using excel document
i have now
something like this
tempfile0.pdf
tempfile1.pdf
tempfile2.pdf
...
im trying to merge all files in one single pdf using itext5
but it semmes that the pages in the resulted pdf are not in the order i wanted
per exemple
tempfile0.pdf in the first page
tempfile1. int the 2000 page
here is the code im using.
the procedure im using is:
1 filling a from from a hashmap
2 saving the filled form as one pdf
3 merging all the files in one single pdf
public void fillPdfitext(int debut,int fin) throws IOException, DocumentException {
for (int i =debut; i < fin; i++) {
HashMap<String, String> currentData = dataextracted[i];
// textArea.appendText("\n"+pdfoutputname +" en cours de preparation\n ");
PdfReader reader = new PdfReader(this.sourcePdfTemplateFile.toURI().getPath());
String outputfolder = this.destinationOutputFolder.toURI().getPath();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(outputfolder+"\\"+"tempcontrat"+debut+"-" +i+ "_.pdf"));
// get the document catalog
AcroFields acroForm = stamper.getAcroFields();
// as there might not be an AcroForm entry a null check is necessary
if (acroForm != null) {
for (String key : currentData.keySet()) {
try {
String fieldvalue=currentData.get(key);
if (key=="ADRESSE1"){
fieldvalue = currentData.get("ADRESSE1")+" "+currentData.get("ADRESSE2") ;
acroForm.setField("ADRESSE", fieldvalue);
}
if (key == "IMEI"){
acroForm.setField("NUM_SERIE_PACK", fieldvalue);
}
acroForm.setField(key, fieldvalue);
// textArea.appendText(key + ": "+fieldvalue+"\t\t");
} catch (Exception e) {
// e.printStackTrace();
}
}
stamper.setFormFlattening(true);
}
stamper.close();
}
}
this is the code for merging
public void Merge() throws IOException, DocumentException
{
File[] documentPaths = Main.objetapp.destinationOutputFolder.listFiles((dir, name) -> name.matches( "tempcontrat.*\\.pdf" ));
Arrays.sort(documentPaths, NameFileComparator.NAME_INSENSITIVE_COMPARATOR);
byte[] mergedDocument;
try (ByteArrayOutputStream memoryStream = new ByteArrayOutputStream())
{
Document document = new Document();
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
document.open();
for (File docPath : documentPaths)
{
PdfReader reader = new PdfReader(docPath.toURI().getPath());
try
{
reader.consolidateNamedDestinations();
PdfImportedPage pdfImportedPage = pdfSmartCopy.getImportedPage(reader, 1);
pdfSmartCopy.addPage(pdfImportedPage);
}
finally
{
pdfSmartCopy.freeReader(reader);
reader.close();
}
}
document.close();
mergedDocument = memoryStream.toByteArray();
}
FileOutputStream stream = new FileOutputStream(this.destinationOutputFolder.toURI().getPath()+"\\"+
this.sourceDataFile.getName().replaceFirst("[.][^.]+$", "")+".pdf");
try {
stream.write(mergedDocument);
} finally {
stream.close();
}
documentPaths=null;
Runtime r = Runtime.getRuntime();
r.gc();
}
my question is how to keep the order of the files the same in the resulting pdf

It is because of naming of files. Your code
new FileOutputStream(outputfolder + "\\" + "tempcontrat" + debut + "-" + i + "_.pdf")
will produce:
tempcontrat0-0_.pdf
tempcontrat0-1_.pdf
...
tempcontrat0-10_.pdf
tempcontrat0-11_.pdf
...
tempcontrat0-1000_.pdf
Where tempcontrat0-1000_.pdf will be placed before tempcontrat0-11_.pdf, because you are sorting it alphabetically before merge.
It will be better to left pad file number with 0 character using leftPad() method of org.apache.commons.lang.StringUtils or java.text.DecimalFormat and have it like this tempcontrat0-000000.pdf, tempcontrat0-000001.pdf, ... tempcontrat0-9999999.pdf.
And you can also do it much simpler and skip writing into file and then reading from file steps and merge documents right after the form fill and it will be faster. But it depends how many and how big documents you are merging and how much memory do you have.
So you can save the filled document into ByteArrayOutputStream and after stamper.close() create new PdfReader for bytes from that stream and call pdfSmartCopy.getImportedPage() for that reader. In short cut it can look like:
// initialize
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
for (int i = debut; i < fin; i++) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
// fill in the form here
stamper.close();
PdfReader reader = new PdfReader(out.toByteArray());
reader.consolidateNamedDestinations();
PdfImportedPage pdfImportedPage = pdfSmartCopy.getImportedPage(reader, 1);
pdfSmartCopy.addPage(pdfImportedPage);
// other actions ...
}

Java Pdf content to String

I'm wondering if is there a way to obtain the content of a pdf file (raw bytes) as a String using Apache PdfBox 2.0.8. What I'm doing is to save the PDDocument object to a ByteArrayOutputStream and then create a new String getting ByteArrayOutputStream's byte array. But if I save the String to a file, the result is a blank pdf. The reason for this is because pdf's stream section bytes are different from a pdf created directly from PdDocument object to a file. After knowing this, I tried to get the ByteArrayOutputStream's character encoding using juniversalchardet, but no luck. So, is there a way to acomplish this?
This is what I have tried so far:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PDDocument doc = new PDDocument();
... //Add page, font, pdPageContentStream and text only to doc object with some latin chars (áéíóú)
doc.save(baos);
So, if I create a file using baos object, the pdf file looks as expected, but if I do this:
String str = new String(baos.toByteArray());
And then create a file using str bytes, the pdf file only shows a blank page.
Hope I was clear enough this time :)

Using this, just append everything to a String.
StringBuilder sb = new StringBuilder();
try (PDDocument document = PDDocument.load(new File("your\\path\\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
sb.append(line);
}
}
}
return sb.toString();

Unable to read unicode character in pdf using java

I am trying to convert Pdf document that contains Tamil unicode characters into a word document retaining all the formatting. I am not able to read the unicode character in the Pdf they are appearing as junk character in word. I am using the below code can someone please help?
public static void main(String[] args) throws IOException {
System.out.println("Document converted started");
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\sample1.pdf";
PdfReader reader = new PdfReader(pdf);
// InputStreamReader isr = new InputStreamReader(reader,"UTF8");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,
new SimpleTextExtractionStrategy());
System.out.println(strategy.getResultantText());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
// run.setFontFamily(new Font("Arial"));
run.setFontSize(14);
run.setText(text);
// run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("D:\\tamildoc.docx");
doc.write(out);
out.close();
reader.close();
System.out.println("Document converted successfully");
}

You can use the library Apache PDFBox https://pdfbox.apache.org/download.cgi . With the component PDFTextStripper, invoking method getText(PDDocument doc) you will obtain a simple String that represents the content of .pdf file
Here an example :
UploadedFile file = new UploadedFile(fileName);
InputStream is = file.getInputStream();
PDDocument doc = PDDocument.load(is);
String content = new PDFTextStripper().getText(doc);
doc.close();
And after that you can write on your file

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert a pdf file into CSV file? - java

You need to introduce a line break '\n' in the buffer after each table row. buffer.append("\n");

Related

iText7 : how to use the same reader in multiple page of the same pdf

Merging of different file format into single pdf using itext is giving corrupt pdf

merge multiple pdfs in order

Java Pdf content to String

Unable to read unicode character in pdf using java

Categories

Resources