How to download embedded images from websites Java

How to download embedded images from websites Java - java

I am trying to download the first 20 images/comics from xkcd website.
The code I've written allows me to download a text file of the website or image if I change the fileName to "xkcd.jpg" and the URL to "http://imgs.xkcd.com/comics/monty_python.jpg"
The problem is that I need to download the embedded image on the site, without having to go back and forth copying the Image URLS of each comic over and over, that defeats the purpose of this program. I am guessing I need a for-loop at some point but I can't do that if I don't know how to download the embedded image on the website itself.
I hope my explanation isn't too complicated
Below is my code
String fileName = "xkcd.txt";
URL url = new URL("http://xkcd.com/16/");
InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
int n = 0;
while (-1 != (n = in.read(buf))) {
out.write(buf, 0, n);
}
out.close();
in.close();
byte[] response = out.toByteArray();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(response);
fos.close();

This can be solved using the debugging-console of your browser and JSoup.
Finding the Image-URL
What we get from the debugging-console (firefox here, but should work with any brower):
This already shows pretty clearly the path to the comic itself would be the following:
html -> div with id "middleContainer" -> div with id "comic" -> image element
Just use "Inspect Element" or whatever it's called in your browser from the context-menu, and the respective element should be highlighted (like in the screenshot).
I'll leave figuring out how extracting the relevant elements and attributes can be done to you, since it's already covered in quite a few other questions and I don't want to ruin your project by doing all of it ;).
Now creating a list can be done in numerous ways:
The simple way:
Posts all come with a sequential ID. Simply start with the number of the first question and extract that ID and decrement the respective number. This works, if you have a hard-coded link pointing to a specific comic.
A bit harder, but more generic
Actually these are two ways, assuming you start from xkcd.com:
1.)
There's a bit of text on the site, that helps finding the ID of the respective comic:
Extracting the ID from from the plain-text-HTML isn't too hard, since it's pre-/ and postfixed by some text that should be pretty unique on the site.
2.)
Directly extracting the path of the previous or next comic from the elements of the buttons for going to the next/previous comic. As shown above, use the development console to extract the respective information from the HTML-file. This method should be more bulletproof than the first, as it only relies on the structure of the HTML-file, contrary to the other methods.
Note though that any of the above methods only work by downloading the HTML-file in which a specific comic is embedded. The image-URL won't be of much help (other than brute-force searching, which you shouldn't do for a number of reasons).

You could use JSoup... and it would probably be a more stable option but if you just wanted to hack something together you might choose the more fragile approach of parsing the HTML
package com.jbirdvegas.q41231970;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class Download {
public static void main(String[] args) {
Download download = new Download();
// go through each number 1 - 20
IntStream.range(1, 20)
// parse the image url from the html page
.mapToObj(download::findImageLinkFromHtml)
// download and save each item in the image url list
.forEach(download::downloadImage);
}
/**
* Warning manual HTML parsing below...
* <p>
* get XKCD image url for a given pageNumber
*
* #param pageNumber index of a give cartoon image
* #return url of the page's image
*/
private String findImageLinkFromHtml(int pageNumber) {
// text we are looking for
String textToFind = "Image URL (for hotlinking/embedding):";
String url = String.format("https://xkcd.com/%d/", pageNumber);
try (InputStream inputStream = new URL(url).openConnection().getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {
Stream<String> stream = reader.lines();
String foundLine = stream.filter(lineOfHtml -> lineOfHtml.contains(textToFind))
.collect(Collectors.toList()).get(0);
String[] split = foundLine.split(":");
return String.format("%s:%s", split[1], split[2]);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
/**
* Download a url to a file
*
* #param url downloads an image to a local file
*/
private void downloadImage(String url) {
try {
System.out.println("Downloading image url: " + url);
URL image = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(image.openStream());
String[] urlSplit = url.split("/");
FileOutputStream fos = new FileOutputStream(urlSplit[urlSplit.length - 1]);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Outputs:
Downloading image url: http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/tree_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/island_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/landscape_cropped_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/blownapart_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/irony_color.jpg
Downloading image url: http://imgs.xkcd.com/comics/girl_sleeping_noline_(1).jpg
Downloading image url: http://imgs.xkcd.com/comics/red_spiders_small.jpg
Downloading image url: http://imgs.xkcd.com/comics/firefly.jpg
Downloading image url: http://imgs.xkcd.com/comics/pi.jpg
Downloading image url: http://imgs.xkcd.com/comics/barrel_mommies.jpg
Downloading image url: http://imgs.xkcd.com/comics/poisson.jpg
Downloading image url: http://imgs.xkcd.com/comics/canyon_small.jpg
Downloading image url: http://imgs.xkcd.com/comics/copyright.jpg
Downloading image url: http://imgs.xkcd.com/comics/just_alerting_you.jpg
Downloading image url: http://imgs.xkcd.com/comics/monty_python.jpg
Downloading image url: http://imgs.xkcd.com/comics/what_if.jpg
Downloading image url: http://imgs.xkcd.com/comics/snapple.jpg
Downloading image url: http://imgs.xkcd.com/comics/george_clinton.jpg
Also note there are plenty of issues with parsing websites... xkcd particularly likes helping parser developers find bugs :D see 859 for an example https://xkcd.com/859/

Related

Reading Binary Picture Data from exiftool?

I'm working on a .opus music library software which converts audio/video files to .opus files and tags them with metadata automatically.
Previous versions of the program have saved the album art as binary data apparently as revealed by exiftool.
The thing is that when I run the command to output data as binary using the -b option, the entire thing is in binary seemingly. I'm not sure how to get the program to parse it. I was kind of expecting an entry like Picture : 11010010101101101011....
The output looks similar to this though:
How can I parse the picture data so I can reconstruct the image for newer versions of the program? (I'm using Java8_171 on Kubuntu 18.04)

It looks like you're trying to open the raw bytes in a text editor, which will of course give you gobble-dee-gook since those raw bytes do not represent characters that can be displayed by any text editor. I can see from your output from exiftool that you are able to know the length of the image in bytes. Providing you know the beginning byte position in the file, this should make your task relatively easy with a little bit of Java code. If you can get the starting position of the image inside your file, you should be able to do something like:
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
public class SaveImage {
public static void main(String[] args) throws IOException {
byte[] imageBytes;
try (RandomAccessFile binaryReader =
new RandomAccessFile("your-file.xxx", "r")) {
int dataLength = 0; // Assign this the byte length shown in your
// post instead of zero
int startPos = 0; // I assume you can find this somehow.
// If it's not at the beginning
// change it accordingly.
imageBytes = new byte[dataLength];
binaryReader.read(imageBytes, startPos, dataLength);
}
try (InputStream in = new ByteArrayInputStream(imageBytes)) {
BufferedImage bImageFromConvert = ImageIO.read(in);
ImageIO.write(bImageFromConvert,
"jpg", // or whatever file format is appropriate
new File("/path/to/your/file.jpg"));
}
}
}

How to set default zoom value of pdf response in java

I have written restful webservice which will return PDF file and this PDF will be appearing in IFrame in browser.
This part is good going.
But, the thing on which I am facing difficulties is, PDF file is opening on browser with selected zoom vale 'Automatic Zoom' but I wanted to show this PDF with zoom value 'Page Width' selected.
Please find below method which return PDF.
/**
* #param file
* #return Response object.
*/
private Response processRequest(final String filePath)
{
File file = new File(filePath);
PDPageFitDestination dest = new PDPageFitDestination();
PDActionGoTo action = new PDActionGoTo();
action.setDestination(dest);
ByteArrayOutputStream output = new ByteArrayOutputStream();
PDDocument pd=null;
try
{
pd = PDDocument.load(file);
pd.getDocumentCatalog().setOpenAction(action);
pd.save(output);
}
catch(IOException e)
{
e.printStackTrace();
}
catch(COSVisitorException e)
{
e.printStackTrace();
}
//ResponseBuilder responseBuilder = Response.ok((Object)file);
ResponseBuilder responseBuilder = Response.ok(output.toByteArray());
responseBuilder.header("Content-Type", "application/pdf; filename=return.pdf");
responseBuilder.header("Content-Disposition", "inline");
return responseBuilder.build();
}
I think by providing any header value specific to zoom value will return PDF with zoom value 'Page Width' but not getting which header related to it.
Please provide your suggestions in this regard.

I solved my problem. Just I needed to use pdf specific parameters in request URL. For detail go to PDFOpenParameters

Error while retrieving images from pdf using Itext

I have an existing PDF from which I want to retrieve images
NOTE:
In the Documentation, this is the RESULT variable
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
I am not getting why this image is needed?I just want to extract the images from my PDF file
So Now when I use MyImageRenderListener listener = new MyImageRenderListener(RESULT);
I am getting the error:
results\part4\chapter15\Img16.jpg (The system
cannot find the path specified)
This is the code that I am having.
package part4.chapter15;
import java.io.IOException;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
/**
* Extracts images from a PDF file.
*/
public class ExtractImages {
/** The new document to which we've added a border rectangle. */
public static final String RESOURCE = "resources/pdfs/samplefile.pdf";
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
/**
* Parses a PDF and extracts all the images.
* #param src the source PDF
* #param dest the resulting PDF
*/
public void extractImages(String filename)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, listener);
}
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new ExtractImages().extractImages(RESOURCE);
}
}

You have two questions and the answer to the first question is the key to the answer of the second.
Question 1:
You refer to:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
And you ask: why is this image needed?
That question is wrong, because Img%s.%s is not a filename of an image, it's a pattern of the filename of an image. While parsing, iText will detect images in the PDF. These images are stored in numbered objects (e.g. object 16) and these images can be exported in different formats (e.g. jpg, png,...).
Suppose that an image is stored in object 16 and that this image is a jpg, then the pattern will resolve to Img16.jpg.
Question 2:
Why do I get an error:
results\part4\chapter15\Img16.jpg (The system cannot find the path specified)
In your PDF, there's a jpg stored in object 16. You are asking iText to store that image using this path: results\part4\chapter15\Img16.jpg (as explained in my answer to Question 1). However: you working directory doesn't have the subdirectories results\part4\chapter15\, hence an IOException (or a FileNotFoundException?) is thrown.
What is the general problem?
You have copy/pasted the ExtractImages example I wrote for my book "iText in Action - Second Edition", but:
You didn't read that book, so you have no idea what that code is supposed to do.
You aren't telling the readers on StackOverflow that this example depends on the MyImageRenderer class, which is where all the magic happens.
How can you solve your problem?
Option 1:
Change RESULT like this:
public static final String RESULT = "Img%s.%s";
Now the images will be stored in your working directory.
Option 2:
Adapt the MyImageRenderer class, more specifically this method:
public void renderImage(ImageRenderInfo renderInfo) {
try {
String filename;
FileOutputStream os;
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
filename = String.format(path,
renderInfo.getRef().getNumber(), image.getFileType());
os = new FileOutputStream(filename);
os.write(image.getImageAsBytes());
os.flush();
os.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
iText calls this class whenever an image is encountered. It passed an ImageRenderInfo to this method that contains plenty of information about that image.
In this implementation, we store the image bytes as a file. This is how we create the path to that file:
String.format(path,
renderInfo.getRef().getNumber(), image.getFileType())
As you can see, the pattern stored in RESULT is used in such a way that the first occurrence of %s is replaced with a number and the second occurrence with a file extension.
You could easily adapt this method so that it stores the images as byte[] in a List if that is what you want.

Saving the first Image from URL

Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.

A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;

An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));

When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.

Strange Whitespace Error when Accessing RSS Feed

I'm not sure if anyone else has encountered or asked about this before, but for my application I make use of two Yahoo! RSS Feeds: Top News and Weather Forcast. I'm new to the idea of using these in the first place, but from what I've read, I simply need to make an HTTP GET request to a specific URL to retrieve an XML file which I can parse for the information I want. I have the parser working just fine, for I tested it with a sample XML file from each feed; however, a strange error is occuring when I use the AJAX GET call to the urls:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
Whitespace is not allowed at this location.
Error processing resource 'http://localhost:8080/BBS/fservlet?p=n'. Line 28, P...
for (i = 0; i < s.length; i++){
-------------------^
Note that I have this applciation "BBS" currently deployed on my local system with Tomcat. I looked into whitespace errors like this, and most seem to point to some line within the XML file itself that's having a problem. In most cases, it had something to do with escaping the "&" symbol, but it appears as though IE is telling me that the error is within a for-loop. I'm no XML expert, but I've never seen a for-loop within an XML. Even so, I've gone to the url directly in my browser and viewed the XML file (its the one I used to test my parsing) and found no such line. In addition, no such loop exists anywhere in my code. In other words, I'm not sure if this is an error on my end, or some configuration setting. Here's the code I'm working with, however:
jQuery Code
// Located in my JSP file
var baseContext = "<%=request.getContextPath()%>";
$(document).ready(function() {
ParseWeather();
ParseNews();
}
// Located in a separate JS file
function ParseWeather() {
$.get(baseContext + "/servlet?p=w", function(data) {
// XML Parser
}
// Data Manipulation
}
function ParseNews() {
$.get(baseContext + "/servlet?p=n", function(data) {
// XML Parser
}
// Data Manipulation
}
Java Code
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import javax.servlet.http.HttpServlet;
import java.net.URL;
public class FeedServlet extends HttpServlet {
protected void doGet(final HttpServletRequest request, final HttpServletResponse response) throws ServletException, IOException {
try {
response.setContentType("text/xml");
final URL url;
String line = "";
if(request.getParameter("p").equals("w")) {
// Configuration setting that returns: "http://xml.weather.yahoo.com/forecastrss?p=USOR0186"
url = new URL(AppConfiguration.getInstance().getForcastUrl());
} else {
// Configuration setting that returns: "http://news.yahoo.com/rss/"
url = new URL(AppConfiguration.getInstance().getNewsUrl());
}
final BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream());
final PrintWriter writer = response.getWriter();
while((line = reader.readLine()) != null) {
writer.println(line);
writer.flush();
}
writer.close();
} catch(IOException e) {
e.printStackTrace();
}
}
}
My company has a AppConfiguration class that allows for certain variables, like the URL's, to be changed through the configuration page. At any rate, those two calls simple return the urls...
Yahoo! Forcast RSS Feed:
http://xml.weather.yahoo.com/forecastrss?p=USOR0186
Yahoo! News: Top Stories Feed:
http://news.yahoo.com/rss/
Anyway, any help would be incredibly helpful.

for (i = 0; i < s.length; i++){
The error is at the less-than symbol, which means that the XML parser is reading your source code! Use WGET to get the resource and check that actual XML is returned and not source code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.