PDFBox document to InputStream

PDFBox document to InputStream - java

I'm trying to take a PDDocument object and pass it to other module as InputStream without saving the document to the file system.
Now, I read about PDStream and kind of understood the purpose of this. Hence, I tried to do something like this:
PDStream stream = new PDStream(document);
InputStream is = stream.createInputStream();
But when I try to load that input stream into a PDDocument, I get this error:
Exception in thread "main" java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1111)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:1885)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1868)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:245)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1098)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:995)
at app.DGDCreator.main(DGDCreator.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:143)
Later I discovered that the result file is 0kb in size...

So anyone else searching can have a good answer to this. I ran into this same situation where I didn't want to have to save the file to any machine and just handle the stream itself. I found an answer here and will repeat it below.
ByteArrayOutputStream out = new ByteArrayOutputStream();
pdDoc.save(out);
pdDoc.close();
ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());

I couldn't understand why you want to do this but, following code will do it:
public static void main(String[] args) throws IOException {
byte[] file = FileUtils.readFileToByteArray(new File(
"C:\\temp\\a_file.pdf"));
PDDocument document = null;
InputStream is = null;
ByteArrayOutputStream out = null;
try {
document = PDDocument.load(file);
out = new ByteArrayOutputStream();
document.save(out);
byte[] data = out.toByteArray();
is = new ByteArrayInputStream(data);
FileUtils.writeByteArrayToFile(new File(
"C:\\temp\\denemeTEST123.pdf"), IOUtils.toByteArray(is));
} finally {
IOUtils.closeQuietly(out);
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(document);
}
}

Related

Read the binary image data from a URL into a ByteArrayInputStream from HttpUrlConnect::URL

I'm trying to pull the image from a URL and read it directly into a ByteArrayInputStream. I found one way of doing it, but it requires an image type, and there will be various image types, so I'd like to find a simple way to just read the binary data right in.
Here is my latest attempt. I'm using a BufferedImage, which I don't think is necessary.
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
//Read in the image
BufferedImage image = ImageIO.read(url);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, "png", baos);
is = new ByteArrayInputStream(baos.toByteArray());

URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
url.openStream().transferTo(baos);
ByteArrayInputStream in = new ByteArrayInputStream(baos.toByteArray());
The transferTo() method exists since Java 9. If you should use an older version of Java please see here for an alternative. Main drawback of this solution is that it has to read the whole file into memory first. If you anyway plan to forward the binary data to an other process you could omit the ByteArray streams and transfer the content directly to an OutputStream.

As an alternative to the solution proposed by #rmunge, the Apache Commons IO library provides the class IOUtils which can be vey useful in your use case.
If you are using Maven for instance, you can import the library including the following dependency in your pom.xml:
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.8.0</version>
</dependency>
Then, you can use IOUtils like this:
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
try (
InputStream imageInputStream = url.openStream();
ByteArrayOutputStream bOut = new ByteArrayOutputStream()
) {
// You can obtain a byte[] as well if required
// Please, consider write to the actual final OutputStream instead
// of into the intermediate byte array output stream to optimize memory
// consumption
IOUtils.copy(imageInputStream, bOut);
// Create an input stream from the read bytes
ByteArrayInputStream in = new ByteArrayInputStream(bOut.toByteArray());
// ...
} catch (IOException ioe) {
ioe.printStackTrace();
}
Or simply this approach:
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
byte[] imageBytes = IOUtils.toByteArray(url);
ByteArrayInputStream in = new ByteArrayInputStream(imageBytes);
For your comments, if the problem if you are trying to avoid network latency problems, if the requirement for a ByteArrayInputStream is not strictly necessary, as you can see in the javadocs perhaps the following code may be helpful as well:
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
try (InputStream imageInputStream = url.openStream()) {
InputStream in = IOUtils.toBufferedInputStream(imageInputStream);
//...
}
Of course, you can always perform the read and write "manually" using the standard Java InputStream and OutputStream mechanisms:
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
try (
InputStream inputStream = url.openStream();
BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(outputStream);
) {
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = bufferedInputStream.read(buffer)) != -1) {
bufferedOutputStream.write(buffer, 0, bytesRead);
}
bufferedOutputStream.flush();
// Create an input stream from the read bytes
ByteArrayInputStream in = new ByteArrayInputStream(outputStream.toByteArray());
// ...
} catch (IOException ioe) {
ioe.printStackTrace();
}
If you require more control about the underlying URL connection you can use URLConnection or HttpURLConnection, or many HTTP client libraries like Apache HttpClient or OkHttp, to name some of them.
Take as example the problem pointed out by #LuisCarlos in his comment, in order to avoid possible leak connections:
URLConnection urlConn = null;
try {
urlConn = url.openConnection();
urlConn.setConnectTimeout(5000);
urlConn.setReadTimeout(30000);
InputStream inputStream = urlConn.getInputStream();
// the rest of the code...
} catch (Exception e) {
}
If you need to detect the actual image type consider the use of Tika or JMimeMagic.

Here's the solution I found to work. Thanks for the two approaches above. I'd rather avoid external libraries, but because the environment is a real pain. Similar, I should have access to Java 9 and transferTo(), but that's not working.
This answerer was also helpful: Convert InputStream(Image) to ByteArrayInputStream
URL url = new URL("http://hobbylesson.com/wp-content/uploads/2015/04/Simple-Acrylic-Painting-Ideas00005.jpg");
InputStream source = url.openStream();
byte[] buf = new byte[8192];
int bytesRead = 0;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
while((bytesRead = source.read(buf)) != -1) {
baos.write(buf, 0, bytesRead);
}
is = new ByteArrayInputStream(baos.toByteArray());

Blank pages in pdf after downloading it from web

I am trying to download a PDF file with HttpClient, it is downloading the PDF file but pages are blank. I can see the bytes on console from response if I print them. But when I try to write it to file it is producing a blank file.
FileUtils.writeByteArrayToFile(new File(outputFilePath), bytes);
However the file is showing correct size of 103KB and 297KB as expected but its just blank!!
I tried with Output stream as well like:
FileOutputStream fileOutputStream = new FileOutputStream(outFile);
fileOutputStream.write(bytes);
Also tried to write with UTF-8 coding like:
Writer out = new BufferedWriter( new OutputStreamWriter(
new FileOutputStream(outFile), "UTF-8"));
String str = new String(bytes, StandardCharsets.UTF_8);
try {
out.write(str);
} finally {
out.close();
}
Nothing is working for me. Any suggestion is highly appreciated..
Update: I am using DefaultHttpClient.
HttpGet httpget = new HttpGet(targetURI);
HttpResponse response = null;
String htmlContents = null;
try {
httpget = new HttpGet(url);
response = httpclient.execute(httpget);
InputStreamReader dataStream=new InputStreamReader(response.getEntity().getContent());
byte[] bytes = IOUtils.toByteArray(dataStream);
...

You do
InputStreamReader dataStream=new InputStreamReader(response.getEntity().getContent());
byte[] bytes = IOUtils.toByteArray(dataStream);
As has already been mentioned in comments, using a Reader class can damage binary data, e.g. PDF files. Thus, you should not wrap your content in an InputStreamReader.
As your content can be used to construct an InputStreamReader, though, I assume response.getEntity().getContent() returns an InputStream. Such an InputStream usually can be directly used as IOUtils.toByteArray argument.
So:
InputStream dataStream=response.getEntity().getContent();
byte[] bytes = IOUtils.toByteArray(dataStream);
should already work for you!

Here is a method I use to download a PDF file from a specific URL. The method requires two string arguments, an url string (example: "https://www.ibm.com/support/knowledgecenter/SSWRCJ_4.1.0/com.ibm.safos.doc_4.1/Planning_and_Installation.pdf") and a destination folder path to download the PDF file (or whatever) into. If the destination path does not exist within the local file system then it is automatically created:
public boolean downloadFile(String urlString, String destinationFolderPath) {
boolean result = false; // will turn to true if download is successful
if (!destinationFolderPath.endsWith("/") && !destinationFolderPath.endsWith("\\")) {
destinationFolderPath+= "/";
}
// If the destination path does not exist then create it.
File foldersToMake = new File(destinationFolderPath);
if (!foldersToMake.exists()) {
foldersToMake.mkdirs();
}
try {
// Open Connection
URL url = new URL(urlString);
// Get just the file Name from URL
String fileName = new File(url.getPath()).getName();
// Try with Resources....
try (InputStream in = url.openStream(); FileOutputStream outStream =
new FileOutputStream(new File(destinationFolderPath + fileName))) {
// Read from resource and write to file...
int length = -1;
byte[] buffer = new byte[1024]; // buffer for portion of data from connection
while ((length = in.read(buffer)) > -1) {
outStream.write(buffer, 0, length);
}
}
// File Successfully Downloaded");
result = true;
}
catch (MalformedURLException ex) { ex.printStackTrace(); }
catch (IOException ex) { ex.printStackTrace(); }
return result;
}

How to zip and base64 an org.w3c.dom.Document

I have a org.w3c.dom.Document and have to zip and base64 encode it to send it with the EBICS protocol via HTTP/HTTPS
I tried
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);
TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
InputStream inflated_stream = new InflaterInputStream(new ByteArrayInputStream(outputStream.toByteArray()));
final byte[] bytes64bytes = Base64.encodeBase64(IOUtils.toByteArray(inflated_stream));
OrderData = new String(bytes64bytes);
but get an exception
java.util.zip.ZipException: incorrect header check
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:218)

I haven't tried this, but wouldn't the following do what you need?
OutputStream outputStream = new ZipOutputStream(new ByteArrayOutputStream());
I think your problem might be down to the use of InflaterInputStream - aren't you trying to deflate this stream? You code may work if you just change InflaterInputStream to DeflaterInputStream

Changing
InputStream inflated_stream = new InflaterInputStream(new ByteArrayInputStream(outputStream.toByteArray()));
to
InputStream inflated_stream = new DeflaterInputStream(new ByteArrayInputStream(outputStream.toByteArray()));
solved the issue
Thanks

GZIPInputStream throws exception when reading GZIP FIle

I am trying to read files from a public anonymous ftp and I am running in to a problem. I can read the plain text files just fine, but when I try to read in gzip files, I get this exception:
Exception in thread "main" java.util.zip.ZipException: invalid distance too far back
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at java_io_FilterInputStream$read.call(Unknown Source)
at GenBankFilePoc.main(GenBankFilePoc.groovy:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
I have tried downloading the file and using a FileInputStream wrapped in a GZIPInputStream and got the exact same problem, so I don't think it is a problem with the FTP client (which is apache).
Here is some test code that reproduces the problem. It is just trying to print to stdout:
FTPClient ftp = new FTPClient();
ftp.connect("ftp.ncbi.nih.gov");
ftp.login("anonymous", "");
InputStream is = new GZIPInputStream(ftp.retrieveFileStream("/genbank/gbbct1.seq.gz"));
try {
byte[] buffer = new byte[65536];
int noRead;
while ((noRead = is.read(buffer)) != 1) {
System.out.write(buffer, 0, noRead);
}
} finally {
is.close();
ftp.disconnect();
}
I cannot find any documentation on why this would be happening, and following it through the code in a debugger is not getting me anywhere. I feel like I am missing something obvious.
EDIT: I manually downloaded the file and read it in with a GZIPInputStream and was able to print it out just fine. I have tried this with 2 different Java FTP Clients

Ah, I found out what was wrong. You have to set the file type to FTP.BINARY_FILE_TYPE so that the SocketInputStream returned from retrieveFileStream is not buffered.
The following code works:
FTPClient ftp = new FTPClient();
ftp.connect("ftp.ncbi.nih.gov");
ftp.login("anonymous", "");
ftp.setFileType(FTP.BINARY_FILE_TYPE);
InputStream is = new GZIPInputStream(ftp.retrieveFileStream("/genbank/gbbct1.seq.gz"));
try {
byte[] buffer = new byte[65536];
int noRead;
while ((noRead = is.read(buffer)) != 1) {
System.out.write(buffer, 0, noRead);
}
} finally {
is.close();
ftp.disconnect();
}
}

You need to first download the file completely before, since ftp.retrieveFileStream() doesn't support file seeking.
Your code should be:
FTPClient ftp = new FTPClient();
ftp.connect("ftp.ncbi.nih.gov");
ftp.login("anonymous", "");
File downloaded = new File("");
FileOutputStream fos = new FileOutputStream(downloaded);
ftp.retrieveFile("/genbank/gbbct1.seq.gz", fos);
InputStream is = new GZIPInputStream(new FileInputStream(downloaded));
try {
byte[] buffer = new byte[65536];
int noRead;
while ((noRead = is.read(buffer)) != 1) {
System.out.write(buffer, 0, noRead);
}
} finally {
is.close();
ftp.disconnect();
}

How to read file with cyrillic path

Actually I have next code:
String f = LauncherFrame.class.getProtectionDomain().getCodeSource().getLocation().getPath(); // path to launcher
java.lang.System.out.println(f);
String launcherHash = "";
try{
MessageDigest md5 = MessageDigest.getInstance("MD5");
launcherHash = calculateHash(md5, f);
}catch (Exception e) {
java.lang.System.out.println(e){
return;
}
calculateHash function:
public static String calculateHash(MessageDigest algorithm,String fileName) throws Exception{
FileInputStream fis = new FileInputStream(fileName);
BufferedInputStream bis = new BufferedInputStream(fis);
DigestInputStream dis = new DigestInputStream(bis, algorithm);
while (dis.read() != -1);
byte[] hash = algorithm.digest();
return byteArray2Hex(hash);
}
It's work good on unix/windows when my .jar file haven't cyrillic characters in path. But when it have, I getting next exception:
java.io.FileNotFoundException:
C:\Users\%d0%90%d1%80%d1%82%d1%83%d1%80\Documents\NetBeansProjects\artcraft-client\build\classes (Can't find file)
How I can fix it?

This is from memory so the syntax may be a bit off, but the best is probably to use the built in URL support for opening streams directly from an URL;
URL url = LauncherFrame.class.getProtectionDomain().getCodeSource().getLocation();
then pass the URL to calculateHash instead of the filename and use URL's openStream method to get a stream with the content;
InputStream is = url.openStream();
BufferedInputStream bis = new BufferedInputStream(is);
...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox document to InputStream - java

Related

Read the binary image data from a URL into a ByteArrayInputStream from HttpUrlConnect::URL

Blank pages in pdf after downloading it from web

How to zip and base64 an org.w3c.dom.Document

GZIPInputStream throws exception when reading GZIP FIle

How to read file with cyrillic path

Categories

Resources