How to set default zoom value of pdf response in java - java

I have written restful webservice which will return PDF file and this PDF will be appearing in IFrame in browser.
This part is good going.
But, the thing on which I am facing difficulties is, PDF file is opening on browser with selected zoom vale 'Automatic Zoom' but I wanted to show this PDF with zoom value 'Page Width' selected.
Please find below method which return PDF.
/**
* #param file
* #return Response object.
*/
private Response processRequest(final String filePath)
{
File file = new File(filePath);
PDPageFitDestination dest = new PDPageFitDestination();
PDActionGoTo action = new PDActionGoTo();
action.setDestination(dest);
ByteArrayOutputStream output = new ByteArrayOutputStream();
PDDocument pd=null;
try
{
pd = PDDocument.load(file);
pd.getDocumentCatalog().setOpenAction(action);
pd.save(output);
}
catch(IOException e)
{
e.printStackTrace();
}
catch(COSVisitorException e)
{
e.printStackTrace();
}
//ResponseBuilder responseBuilder = Response.ok((Object)file);
ResponseBuilder responseBuilder = Response.ok(output.toByteArray());
responseBuilder.header("Content-Type", "application/pdf; filename=return.pdf");
responseBuilder.header("Content-Disposition", "inline");
return responseBuilder.build();
}
I think by providing any header value specific to zoom value will return PDF with zoom value 'Page Width' but not getting which header related to it.
Please provide your suggestions in this regard.

I solved my problem. Just I needed to use pdf specific parameters in request URL. For detail go to PDFOpenParameters

Related

How to download embedded images from websites Java

I am trying to download the first 20 images/comics from xkcd website.
The code I've written allows me to download a text file of the website or image if I change the fileName to "xkcd.jpg" and the URL to "http://imgs.xkcd.com/comics/monty_python.jpg"
The problem is that I need to download the embedded image on the site, without having to go back and forth copying the Image URLS of each comic over and over, that defeats the purpose of this program. I am guessing I need a for-loop at some point but I can't do that if I don't know how to download the embedded image on the website itself.
I hope my explanation isn't too complicated
Below is my code
String fileName = "xkcd.txt";
URL url = new URL("http://xkcd.com/16/");
InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
int n = 0;
while (-1 != (n = in.read(buf))) {
out.write(buf, 0, n);
}
out.close();
in.close();
byte[] response = out.toByteArray();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(response);
fos.close();
This can be solved using the debugging-console of your browser and JSoup.
Finding the Image-URL
What we get from the debugging-console (firefox here, but should work with any brower):
This already shows pretty clearly the path to the comic itself would be the following:
html -> div with id "middleContainer" -> div with id "comic" -> image element
Just use "Inspect Element" or whatever it's called in your browser from the context-menu, and the respective element should be highlighted (like in the screenshot).
I'll leave figuring out how extracting the relevant elements and attributes can be done to you, since it's already covered in quite a few other questions and I don't want to ruin your project by doing all of it ;).
Now creating a list can be done in numerous ways:
The simple way:
Posts all come with a sequential ID. Simply start with the number of the first question and extract that ID and decrement the respective number. This works, if you have a hard-coded link pointing to a specific comic.
A bit harder, but more generic
Actually these are two ways, assuming you start from xkcd.com:
1.)
There's a bit of text on the site, that helps finding the ID of the respective comic:
Extracting the ID from from the plain-text-HTML isn't too hard, since it's pre-/ and postfixed by some text that should be pretty unique on the site.
2.)
Directly extracting the path of the previous or next comic from the elements of the buttons for going to the next/previous comic. As shown above, use the development console to extract the respective information from the HTML-file. This method should be more bulletproof than the first, as it only relies on the structure of the HTML-file, contrary to the other methods.
Note though that any of the above methods only work by downloading the HTML-file in which a specific comic is embedded. The image-URL won't be of much help (other than brute-force searching, which you shouldn't do for a number of reasons).
You could use JSoup... and it would probably be a more stable option but if you just wanted to hack something together you might choose the more fragile approach of parsing the HTML
package com.jbirdvegas.q41231970;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class Download {
public static void main(String[] args) {
Download download = new Download();
// go through each number 1 - 20
IntStream.range(1, 20)
// parse the image url from the html page
.mapToObj(download::findImageLinkFromHtml)
// download and save each item in the image url list
.forEach(download::downloadImage);
}
/**
* Warning manual HTML parsing below...
* <p>
* get XKCD image url for a given pageNumber
*
* #param pageNumber index of a give cartoon image
* #return url of the page's image
*/
private String findImageLinkFromHtml(int pageNumber) {
// text we are looking for
String textToFind = "Image URL (for hotlinking/embedding):";
String url = String.format("https://xkcd.com/%d/", pageNumber);
try (InputStream inputStream = new URL(url).openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {
Stream<String> stream = reader.lines();
String foundLine = stream.filter(lineOfHtml -> lineOfHtml.contains(textToFind))
.collect(Collectors.toList()).get(0);
String[] split = foundLine.split(":");
return String.format("%s:%s", split[1], split[2]);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
/**
* Download a url to a file
*
* #param url downloads an image to a local file
*/
private void downloadImage(String url) {
try {
System.out.println("Downloading image url: " + url);
URL image = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(image.openStream());
String[] urlSplit = url.split("/");
FileOutputStream fos = new FileOutputStream(urlSplit[urlSplit.length - 1]);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Outputs:
Downloading image url: http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/tree_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/island_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/landscape_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/blownapart_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/irony_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/girl_sleeping_noline_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/red_spiders_small.jpg
Downloading image url: http://imgs.xkcd.com/comics/firefly.jpg
Downloading image url: http://imgs.xkcd.com/comics/pi.jpg
Downloading image url: http://imgs.xkcd.com/comics/barrel_mommies.jpg
Downloading image url: http://imgs.xkcd.com/comics/poisson.jpg
Downloading image url: http://imgs.xkcd.com/comics/canyon_small.jpg
Downloading image url: http://imgs.xkcd.com/comics/copyright.jpg
Downloading image url: http://imgs.xkcd.com/comics/just_alerting_you.jpg
Downloading image url: http://imgs.xkcd.com/comics/monty_python.jpg
Downloading image url: http://imgs.xkcd.com/comics/what_if.jpg
Downloading image url: http://imgs.xkcd.com/comics/snapple.jpg
Downloading image url: http://imgs.xkcd.com/comics/george_clinton.jpg
Also note there are plenty of issues with parsing websites... xkcd particularly likes helping parser developers find bugs :D see 859 for an example https://xkcd.com/859/

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.
The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Converting a docx containing a chart to PDF

I've got a docx4j generated file which contains several tables, titles and, finally, an excel-generated curve chart.
I have tried many approaches in order to convert this file to PDF, but did not get to any successful result.
Docx4j with xsl-fo did not work, most of the things included in the docx file are not yet implemented and show up in red text as "not implemented".
JODConverter did not work either, I got a resulting PDF in which everything was pretty good (just little formatting/styling issues) BUT the graph did not show up.
Finally, the closest approach was using Apache POI: The resulting PDF was identical to my docx file, but still no chart showing up.
I already know Aspose would solve this pretty easily, but I am looking for an open source, free solution.
The code I am using with Apache POI is as follows:
public static void convert(String inputPath, String outputPath)
throws XWPFConverterException, IOException {
PdfConverter converter = new PdfConverter();
converter.convert(new XWPFDocument(new FileInputStream(new File(
inputPath))), new FileOutputStream(new File(outputPath)),
PdfOptions.create());
}
I do not know what to do to get the chart inside the PDF, could anybody tell me how to proceed?
Thanks in advance.
I don't know if this helps you but you could use "jacob" (I don't know if its possible with apache poi or docx4j)
With this solution you open "Word" yourself and export it as pdf.
!Word needs to be installed on the computer!
Heres the download-page: http://sourceforge.net/projects/jacob-project/
try {
if (System.getProperty("os.arch").contains("64")) {
System.load(DLL_64BIT_PATH);
} else {
System.load(DLL_32BIT_PATH);
}
} catch (UnsatisfiedLinkError e) {
//TODO
} catch (IOException e) {
//TODO
}
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
oleComponent.setProperty("Visible", false);
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch activeDoc = Dispatch.call(document, "Open", fileName).toDispatch();
// https://msdn.microsoft.com/EN-US/library/office/ff845579.aspx
Dispatch.call(activeDoc, "ExportAsFixedFormat", new Object[] { "path to pdfFile.pdf", new Integer(17), false, 0 });
Object args[] = { new Integer(0) };//private static final int DO_NOT_SAVE_CHANGES = 0;
Dispatch.call(activeDoc, "Close", args);
Dispatch.call(oleComponent, "Quit");

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

Inputstream handled by different objects depending on the content

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.
Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,
String contentType = url.openConnection().getContentType();
doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.
public Parser createParser(){
InputStream inputStream = null;
String contentType = null;
String contentEncoding = null;
ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
Parser parser = null;
try {
inputStream = new BufferedInputStream(this.url.openStream());
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
assert (contentType != null);
inputStream = new BufferedInputStream(this.url.openStream());
if (contentType.equals(ContentTypes.rss))
{
logger.info("RSS feed detected");
parser = new RssParser(this.url);
parser.parse(inputStream);
}
else if (contentType.equals(ContentTypes.atom))
{
logger.info("Atom feed detected");
parser = new AtomParser(this.url);
}
else if (contentType.equals(ContentTypes.html))
{
logger.info("html detected");
parser = new HtmlParser(this.url);
parser.setContentEncoding(contentEncoding);
}
else if (contentType.equals(ContentTypes.UNKNOWN))
logger.debug("Unable to recognize content type");
if (parser != null)
parser.parse(inputStream);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return parser;
}
Basically, I am looking for a solution that allows me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".
Any help would be greatly appreciated!
Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.
Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.
Can you just call mark and reset?
inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
inputstream.reset(); // Let the parser have a crack at it now
Perhaps your ContentTypeParser should cache the content internally and feed it to the appropiate ContentParser instead of reacquiring data from InputStream.

Categories

Resources