Downloading a PDF file from a protected webpage

Downloading a PDF file from a protected webpage - java

So I've been trying this for a couple of days now and I really don't have any time left since the project is due in tomorrow. I was wondering if someone could help me out with this. I'm trying to download a PDF file from this link, which is a link to a webpage of PDF content. I have tried using Jsoup but Jsoup does not support webpages when they are written in PDF format. This is the code I've been trying to use:
System.out.println("opening connection");
URL url = new URL("https://www.capitaliq.com/CIQDotNet/Filings/DocumentRedirector.axd?versionId=1257051021&type=pdf&forcedownload=false");
InputStream in = url.openStream();
FileOutputStream fos = new FileOutputStream("/Users/HIDDEN/Desktop/fullreport.pdf");
System.out.println("reading file...");
int length = -1;
byte[] buffer = new byte[1024];// buffer for portion of data from
// connection
while ((length = in.read(buffer)) > -1) {
fos.write(buffer, 0, length);
}
fos.close();
in.close();
System.out.println("file was downloaded");
The problem with this code is that it automatically redirects you to a login page in which you have to type your username and password. Therefore, I have to find a way to login to my account and connect to the page without using Jsoup (as earlier mentioned, this is unable to read PDF contents). If someone could alter this code to make it possible for me to login and subsequently download the pdf by looking at the html of this login page and adjusting the code. I would be eternally grateful. Thank you!

HtmlUnit is what I use for stuff like this, especially when speed is not critical.
Here's a random-ish piece of psuedo code from another one of my answers:
WebClient wc = new WebClient(BrowserVersion.CHROME);
HtmlPage p = wc.getPage(url)
((HtmlTextInput) p.getElementById(userNameId)).setText(userName);
((HtmlTextInput) p.getElementById(passId)).setText(pass);
p = ((HtmlElement) p.getElementById(submitBtnId)).click();
// Just as an example for something I've had to do, I use
// UnexpectedPage when the "content-type" is "application/zip"
UnexpectedPage up = ((HtmlElement) p.getElementById(downloadBtn)).click();
InputStream in = up.getInputStream();
...
Use another library for reading the pdf

Related

Not able to generate multiple documents using ServletOutputStream in Java [duplicate]

For example, i would like to download one zip file and one csv file in one response. Is there any way other than compressing these two files in one zip file.

Although ServletResponse is not meant to do this, we could programmatically tweak it to send multiple files, which all client browsers except IE seems to handle properly. A sample code snippet is given below.
response.setContentType("multipart/x-mixed-replace;boundary=END");
ServletOutputStream out = response.getOutputStream();
out.println("--END");
for(File f:files){
FileInputStream fis = new FileInputStream(file);
BufferedInputStream fif = new BufferedInputStream(fis);
int data = 0;
out.println("--END");
while ((data = fif.read()) != -1) {
out.write(data);
}
fif.close();
out.println("--END");
out.flush();
}
out.flush();
out.println("--END--");
out.close();
This will not work in IE browsers.
N.B - Try Catch blocks not included

Code developed by Jason Hunter to handle servlet request and response having multiple parts has been the defacto since years. You can find it at servlets.com

No you can not do that. The reason is that whenever you want to sent any data in request you use steam available in request and retrive this data using request.getRequestParameter("streamParamName").getInputStream(), also please make a note if you have already consumed this stream once you will not be able to get it again.
The example mentioned above is a tweak that google also uses in sending multipart email with multiple attachments. To achieve that they define boundaries for each attachment and client have to take care of these boundaries while retrieving this information and rendering it.

Can i attach multiple attachments in one HttpServletResponse

For example, i would like to download one zip file and one csv file in one response. Is there any way other than compressing these two files in one zip file.

Although ServletResponse is not meant to do this, we could programmatically tweak it to send multiple files, which all client browsers except IE seems to handle properly. A sample code snippet is given below.
response.setContentType("multipart/x-mixed-replace;boundary=END");
ServletOutputStream out = response.getOutputStream();
out.println("--END");
for(File f:files){
FileInputStream fis = new FileInputStream(file);
BufferedInputStream fif = new BufferedInputStream(fis);
int data = 0;
out.println("--END");
while ((data = fif.read()) != -1) {
out.write(data);
}
fif.close();
out.println("--END");
out.flush();
}
out.flush();
out.println("--END--");
out.close();
This will not work in IE browsers.
N.B - Try Catch blocks not included

Code developed by Jason Hunter to handle servlet request and response having multiple parts has been the defacto since years. You can find it at servlets.com

No you can not do that. The reason is that whenever you want to sent any data in request you use steam available in request and retrive this data using request.getRequestParameter("streamParamName").getInputStream(), also please make a note if you have already consumed this stream once you will not be able to get it again.
The example mentioned above is a tweak that google also uses in sending multipart email with multiple attachments. To achieve that they define boundaries for each attachment and client have to take care of these boundaries while retrieving this information and rendering it.

Not able to get the pdf from internet

I am trying to download the pdf content from the internet and download it to a local file.
I am using iText for reading the pdf and using Java stream to write it.
After writing the file, new file can not be opened in the PDF reader.
PdfReader reader = new PdfReader(strURL);
FileOutputStream fos = new FileOutputStream(new File(fileName));
fos.write(reader.getPageContent(1));
fos.flush();
fos.close();
I am trying to get the PDf from this link
I am debugged few things. Here are the findings.
reader.getEofPos()
gives 291633, Which is same as file length.But
reader.getPageContent(1).length;
gives only 42360 bytes. Clearly byte read are less than the actual size.
Only one page of pdf is present
reader.getNumberOfPages() =1
Do i need to specify few more things to reader to read the entire pdf file?

If all you're trying to do is download a PDF from the internet and save it locally, this can be accomplished using a simple HTTP web request. An internet download is normally just an HTTP GET request, which you can accomplish by doing something like this:
URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Charset", charset);
InputStream response = connection.getInputStream();
Once you get the response, you can save the bytes to a path of your choosing.

Protecting PDF using PDFBox

Im really struggling with the documentation for PDFBox. For such a popular library info seems to be a little thin on the ground (for me!).
Anyway the problem Im having relates to protecting the PDF. At the moment all I want is to control the access permissions of the users. specifically I want to prevent the user from being able to modify the PDF.
If I omit the access permission code everything works perfectly. I am reading in a PDF from an external resource. I am then reading and populating the fields, adding some images before saving the new PDF. That all works perfectly.
The problem comes when I add the following code to manage the access:
/* Secure the PDF so that it cannot be edited */
try {
String ownerPassword = "DSTE$gewRges43";
String userPassword = "";
AccessPermission ap = new AccessPermission();
ap.setCanModify(false);
StandardProtectionPolicy spp = new StandardProtectionPolicy(ownerPassword, userPassword, ap);
pdf.protect(spp);
} catch (BadSecurityHandlerException ex) {
Logger.getLogger(PDFManager.class.getName()).log(Level.SEVERE, null, ex);
}
When I add this code, all the text and images are striped from the outgoing pdf. The fields are still present in the document but they are all empty and all the text and images that where part of the original PDF and that were added dynamically in the code are gone.
UPDATE:
Ok, as best as I can tell the problem is coming from a bug relating to the form fields. I'm going to try a different approach without the form fields and see what it gives.

I found the solution to this problem. It would appear that if the PDF comes from an external source, sometimes the PDF is protected or encrypted.
If you get a blank output when loading up a PDF document from an external source and add protections you are probably working with an encrypted document. I have a stream processing system working on PDF documents. So the following code works for me. If you are just working with PDF inputs then you could integrate the below code with your flow.
public InputStream convertDocument(InputStream dataStream) throws Exception {
// just acts as a pass through since already in pdf format
PipedOutputStream os = new PipedOutputStream();
PipedInputStream is = new PipedInputStream(os);
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "2024768"); //for large files
PDDocument doc = PDDocument.load(dataStream, true);
if (doc.isEncrypted()) { //remove the security before adding protections
doc.decrypt("");
doc.setAllSecurityToBeRemoved(true);
}
doc.save(os);
doc.close();
dataStream.close();
os.close();
return is;
}
Now take that returned InputStream and use it for your security application;
PipedOutputStream os = new PipedOutputStream();
PipedInputStream is = new PipedInputStream(os);
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "2024768");
InputStream dataStream = secureData.data();
PDDocument doc = PDDocument.load(dataStream, true);
AccessPermission ap = new AccessPermission();
//add what ever perms you need blah blah...
ap.setCanModify(false);
ap.setCanExtractContent(false);
ap.setCanPrint(false);
ap.setCanPrintDegraded(false);
ap.setReadOnly();
StandardProtectionPolicy spp = new StandardProtectionPolicy(UUID.randomUUID().toString(), "", ap);
doc.protect(spp);
doc.save(os);
doc.close();
dataStream.close();
os.close();
Now this should return a proper document with no blank output!
Trick is to remove encryption first!

Generating a pdf, and sending it to user with using HttpServletResponse(headers)

Welcome all.
I'm trying to create a PDF to send to user, without saving the file on my server first.
I'm using Hibernate + struts2.
My samples code:
CreatePDF.java (Class for generate pdf)
Method BuildPdf():
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
document = new Document();
PdfWriter.getInstance(document,baos);
document.open();
buildPage(document, snippet, snippetContent);
document.close();
response.setContentType("application/pdf");
response.setContentLength(baos.size());
response.setHeader("Content-Disposition", "attachment;filename=document.pdf");
ServletOutputStream out = response.getOutputStream();
baos.writeTo(out);
out.flush();
response.flushBuffer();
} catch (Exception e) {
Log4jUtil.debug(logger, "Can not buid pdf-file", e);
}
My sample action:
method index():
pdf = new CreatePDF();
pdf.buildPdf(snippet, snippetContent);
return SUCCESS;
Can you check my code please for search error? Could there be errors....
Please help me. Need ideas, or example code to solve my task.

First, Hibernate is fully irrelevant here. Struts2 is relevant, but you are not using it, you are using plain (low level) servlet API. That should probably work, but if your webapp is built around Struts2, that's not the recommended way. You should instead use the Stream result

For creating PDF documents, you can use Smart PDF Creator. It creates professional PDFs in a couple of clicks. You can try it for free here: http://www.smartpdfcreator.com

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Downloading a PDF file from a protected webpage - java

Related

Not able to generate multiple documents using ServletOutputStream in Java [duplicate]

Can i attach multiple attachments in one HttpServletResponse

Not able to get the pdf from internet

Protecting PDF using PDFBox

Generating a pdf, and sending it to user with using HttpServletResponse(headers)

Categories

Resources