Convert pdf to byte[] and vice versa with pdfbox - java

I've read the documentation and the examples but I'm having a hard time putting it all together. I'm just trying to take a test pdf file and then convert it to a byte array then take the byte array and convert it back into a pdf file then create the pdf file onto disk.
It probably doesn't help much, but this is what I've got so far:
package javaapplication1;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
public class JavaApplication1 {
private COSStream stream;
public static void main(String[] args) {
try {
PDDocument in = PDDocument.load("C:\\Users\\Me\\Desktop\\JavaApplication1\\in\\Test.pdf");
byte[] pdfbytes = toByteArray(in);
PDDocument out;
} catch (Exception e) {
System.out.println(e);
}
}
private static byte[] toByteArray(PDDocument pdDoc) throws IOException, COSVisitorException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
try {
pdDoc.save(out);
pdDoc.close();
} catch (Exception ex) {
System.out.println(ex);
}
return out.toByteArray();
}
public void PDStream(PDDocument document) {
stream = new COSStream(document.getDocument().getScratchFile());
}
}

You can use Apache commons, which is essential in any java project IMO.
Then you can use FileUtils's readFileToByteArray(File file) and writeByteArrayToFile(File file, byte[] data).
(here is commons-io, which is where FileUtils is: http://commons.apache.org/proper/commons-io/download_io.cgi )
For example, I just tried this here and it worked beautifully.
try {
File file = new File("/example/path/contract.pdf");
byte[] array = FileUtils.readFileToByteArray(file);
FileUtils.writeByteArrayToFile(new File("/example/path/contract2.pdf"), array);
} catch (IOException e) {
e.printStackTrace();
}

Related

Cyrillic text coming from Document Properties is corrupt in PDF file in docx4j

I am trying to convert docx to pdf using docx4j 3.7.7.The issue is pdf is getting generated properly but the docpropery having cyrillic text is not coming up. It coming as #####. Normal paragraph with cyrillic text is getting generated properly. The issue is reproducible only in linux. In windows, docProperty is getting converted properly.
The file for testing can be found here
file
Below is the code :
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.docx4j.Docx4J;
import org.docx4j.convert.out.FOSettings;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class TestRussian {
public static void main(String[] args) {
new TestRussian().convertWordToPdf();
}
public void convertWordToPdf() {
FileOutputStream fileOutputStream =null;
try {
File file = new File("Test1.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
boolean checkViaFo = Docx4J.pdfViaFO();
FOSettings foSettings = Docx4J.createFOSettings();
fileOutputStream= new FileOutputStream("PDFRussian1.pdf");
foSettings.setWmlPackage(wordMLPackage);
//Getting error in update() during complex field update
//FieldUpdater updater = new FieldUpdater(wordMLPackage);
//updater.update(true);
Docx4J.toPDF(wordMLPackage,fileOutputStream);
System.out.println("Done");
} catch (Exception ex) {
} finally {
try {
if (fileOutputStream != null) {
fileOutputStream.close();
}
} catch (IOException e) {
}
}
}
}
I have read something about MERGEGORMAT & CHARFORMAT but didnt have much idea on that

How to write extracted image from pdf to a file

Hopefully this is simple.
I am using pdfbox to extract images from a pdf. I want to write the images to a folder. I don't seem to get any output (the folder has read and write privileges).
I am probably not writing the output stream properly I think.
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
public final class JavaImgExtactor
{
public static void main(String[] args) throws IOException{
Stuff();
}
#SuppressWarnings("resource")
public static void Stuff() throws IOException{
File inFile = new File("/Users/sebastianzeki/Documents/Images Captured with Proc Data Audit.pdf");
PDDocument document = new PDDocument();
//document=null;
try {
document = PDDocument.load(inFile);
} catch (Exception e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
System.out.println("page"+page);
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
System.out.println("Success"+imageIter);
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
FileOutputStream out = new FileOutputStream("/Users/sebastianzeki/Documents/ImgPDF.jpg");
try {
image.write2OutputStream(out);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
}
}
You are not closing the output stream, and the file name is always the same.
try (FileOutputStream out = new FileOutputStream("/Users/sebastianzeki/Documents/ImgPDF" + key + ".jpg") {
write2OutputStream(out);
} (Exception e) {
printStackTrace();
}
try-with-resources will automatically close out. Not sure whether key is usable as file name part.
image.write2OutputStream(out); writes the bytes from the image object to the out FileOutputStream object but it doesn't flush the buffer of out .
Add it should do the job :
out.flush();

Why can't I use FileInputStream to feed MessageDigest object?

Why must I use DigestInputStream and not FileInputStream to get a digest of an file?
I have written a program that reads ints from FileInputStream, converts them to bytes and passes them to update method of MessageDigest object. But I have a suspicion that it doesn't work properly, because it calculates a digest of a very large file instanlty. Why doesn't it work?
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class DigestDemo {
public static byte[] getSha1(String file) {
FileInputStream fis = null;
MessageDigest md = null;
try {
fis = new FileInputStream(file);
} catch(FileNotFoundException exc) {
System.out.println(exc);
}
try {
md = MessageDigest.getInstance("SHA-1");
} catch (NoSuchAlgorithmException exc) {
System.out.println(exc);
}
byte b = 0;
do {
try {
b = (byte) fis.read();
} catch (IOException e) {
System.out.println(e);
}
if (b != -1)
md.update(b);
} while(b != -1);
return md.digest();
}
public static void writeBytes(byte[] a) {
for (byte b : a) {
System.out.printf("%x", b);
}
}
public static void main(String[] args) {
String file = "C:\\Users\\Mike\\Desktop\\test.txt";
byte[] digest = getSha1(file);
writeBytes(digest);
}
}
You need to change the type of b to int,, and you need to call MessageDigest.doFinal() at the end of the file, but this is horrifically inefficient. Try reading and updating from a byte array.
There's too much try-catching in this code. Reduce it to one try and two catches, outside the loop.

see the content of a .bson file using java

I have a very large .bson file.
Now I have two question:
How can I see the content of that file? (I know it can do with "bsondump", but this command is slow, specialy for large database) (In fact I want to see the structure of that file)
How can I see the content of that file using java?
You can easily read/parse a bson file in Java using a BSONDecoder instance such as BasicBSONDecoder or DefaultBSONDecoder. These classes are included in mongo-java-driver.
Here's a simple example of a Java implementation of bsondump.
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.bson.BSONDecoder;
import org.bson.BSONObject;
import org.bson.BasicBSONDecoder;
public class BsonDump {
public void bsonDump(String filename) throws FileNotFoundException {
File file = new File(filename);
InputStream inputStream = new BufferedInputStream(new FileInputStream(file));
BSONDecoder decoder = new BasicBSONDecoder();
int count = 0;
try {
while (inputStream.available() > 0) {
BSONObject obj = decoder.readObject(inputStream);
if(obj == null){
break;
}
System.out.println(obj);
count++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException e) {
}
}
System.err.println(String.format("%s objects read", count));
}
public static void main(String args[]) throws Exception {
if (args.length < 1) {
//TODO usage
throw new IllegalArgumentException("Expected <bson filename> argument");
}
String filename = args[0];
BsonDump bsonDump = new BsonDump();
bsonDump.bsonDump(filename);
}
}

Cannot append data to a binary file with code?

import java.io.FileOutputStream;
import java.io.File;
public class AppendBinaryFile
{
public static void main (String[] args)
{
FileOutputStream toFile = null;
try
{
toFile = new FileOutputStream(new File("numbers.dat"), true);
toFile.write(15);
toFile.write(30);
toFile.close();
}
catch (Exception e)
{
}
}
}
I run another program to get the data from a binary file after running the program but data in the binary file does not change. What is wrong with the code?
You need to close your file output stream I believe.

Categories

Resources