I have this code where I am trying to read an image from Url:
public class question_insert {
public static String latex(String tex) throws IOException {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl="+tex;
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine.toString());
}
in.close();
return inputLine;}
But what I am getting is unreadable code. The url gives only one image try this:
http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}
What should I do to embed the image into Html?
First of all it is not clear what you mean by image in Html format ? You could Base64 encode its binary data, but is that what you really want?
How do you expect to output a PNG picture returned by your URL to a text console (that is System.out)?
Second, the way you're retrieving the image is not functional even if you were to store it on a disk as a PNG file, because Reader and its derivatives like BufferedReader are used to read character data. From Reader API:
Abstract class for reading character streams
You need to read binary (byte) data, so you need to stick with BufferedInputStream
After some thinking I realized that embedding image into HTML is what you really want:
public static void main(String[] args) throws Exception {
String urltext = "http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\\frac{3}{4}";
URL url = new URL(urltext);
BufferedInputStream bis = new BufferedInputStream(url.openStream());
byte[] imageBytes = new byte[0];
for(byte[] ba = new byte[bis.available()];
bis.read(ba) != -1;) {
byte[] baTmp = new byte[imageBytes.length + ba.length];
System.arraycopy(imageBytes, 0, baTmp, 0, imageBytes.length);
System.arraycopy(ba, 0, baTmp, imageBytes.length, ba.length);
imageBytes = baTmp;
}
System.out.println("<img src='data:image/png;base64," + DatatypeConverter.printBase64Binary(imageBytes) + "'>");
}
The result is:
<img src=''>
Isn't that great? Anything for you!
Well, I don't know if that is what you want because it seems that nobody does. But if you want to get this output
<img style="-webkit-user-select: none"
src="http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}" />
you will have to use this code
public static String latex(String tex) {
String url = "http://chart.apis.google.com/chart?cht=tx&chl=" + tex;
return "<img style=\"-webkit-user-select: none\" src=\"" + url + "\"/>";
}
Also you might have to escape some characters like \ in the tex parameter.
To get your image, you should try to use ImageIO API like this
try {
URL url = new URL(urltext);
BufferedImage img = ImageIO.read(url);
} catch (IOException e) {
e.printStackTrace();
}
http://chart.apis.google.com/chart?cht=tx&chl=2+2%20\frac{3}{4}
Note that this URL is wrong. This shows 22 3/4 instead of the intended 2 + 2 3/4.The request parameter containing special characters needs to be URL-encoded as follows.
http://chart.apis.google.com/chart?cht=tx&chl=2%2B2%20%5Cfrac%7B3%7D%7B4%7D
You can achieve this with URLEncoder#encode().
String chl = "2+2 \\frac{3}{4}";
String url = "http://chart.apis.google.com/chart?cht=tx&chl=" + URLEncoder.encode(chl, "UTF-8");
Back to your functional requirement:
What should I do to embed the image into Html?
If your sole functional requirement is to display the image as available behind the mentioned URL by an HTML <img> element in a HTML/JSP page, then you need to use JSTL <c:url> tag to URL-encode request parameters containing special characters.
<%#taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
...
<c:url var="url" value="http://chart.apis.google.com/chart">
<c:param name="cht" value="tx" />
<c:param name="chl" value="2+2 \\frac{3}{4}" />
</c:url>
Then you can just refer it as ${url} (as declared in var attribute of <c:url>) in the src attribute of the HTML <img> element:
<img src="${url}" />
Reading a binary image stream from an URL as a character stream and storing in a string as you initially attempted makes completely no utter sense. You also wouldn't open image files in notepad, for example.
Related
I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection
String strUrl = request.getParameter("url");
URL url = new URL(strUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
fWriter = new FileWriter(new File("f:\\new.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
In the resulting html page, every css and js and images were missing as they are pointed to get from local.
for example, js is placed as followed in my generated html page.
<script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
But this actual src is as follows,
<script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I know that there are many solution to replace all src, href with url host. Found many answers related to that.
I used a solution as follows,
if (s.contains(("href="))) {
if (s.contains("\"../") || s.contains("\"/")) {
s = s.replace("\"../", "\"http://" + url.getHost() + "/");
s = s.replace("\"/", "\"http://" + url.getHost() + "/");
writer.write(s);
out.println(s);
}
}
Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.
In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.
http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2
தை தை தை
If, I am adding host to this href it becomes as follows,
தை தை தை
And this is not available. Because, the actual url is
தை தை தை
There are essentially two ways to get the absolute URL:
Using Jsoup's abs:href attribute getter. It works like this:
Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL
The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:
String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
Which would print:
http://example.com/script.js
To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.
In your case, you could use it like:
String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
Which will print:
https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js
The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.
Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).
I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();
Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.
A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;
An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.
I am using JSoup to get the H1 tag value from a webpage, this tag contains the following HTML.
Hexyl β-D-glucopyranoside
When I use the .text() method I get the following. (Note the ?) I assume this is because it cannot work out the HTML for the "β" character. How do I get this value as rendered on a webpage.
Hexyl ?-D-glucopyranoside
Do I need to do some kind of conversion after I have picked up the text I want?
Here is my code.
String check = "<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>";
Document doc3 = Jsoup.parse(check);
doc3.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc3.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc3.html());
//doc3.outputSettings().charset("ISO 8859-1");
doc3.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc3.html());`
-----Output at console-----
UTF-8: <html>
<head>
<title>Hexyl ?-D-glucopyranoside ?98.0% (TLC) | ? ? </title>
</head>
<body></body>
</html>
ASCII: <html>
<head>
<title>Hexyl β-D-glucopyranoside ≥98.0% (TLC) | ≥ ≥</title>
</head>
<body></body>
</html>
Looks like the IDE you're using is using the wrong character encoding.
It's nothing to do with your code as I've ran it and it's fine (outputs the weird characters). If you're using Eclipse go to the run configuration settings for that particular project and click the 'common' tab then choose UTF-8.
It's too late to set charset after parsing a document. I had the same problem once, tried to do it your way and failed miserably.
This worked for me:
String url = "url to html page";
InputStream is is =new URL(url).openStream();
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(is , "ISO-8859-2", url);
If I have html text only as string, I convert it to InputString first (http://www.kodejava.org/examples/265.html)
InputStream is = new ByteArrayInputStream(text.getBytes("UTF-8"));
then read it with correct charset:
BufferedReaderr = new BufferedReader(new InputStreamReader(is, "UTF-8"), 4*1024);
StringBuilder total = new StringBuilder();
String line = "";
while ((line = r.readLine()) != null) {
total.append(line);
}
r.close();
is.close();
String html = total.toString();
...and parse:
doc = org.jsoup.Jsoup.parse(html);
The important thing is to somehow get InputStream object and from here there're ways to use your desired charset with it. Maybe it can be done in a more strightforward way. But it works.
what's the best approach to read an image via URL and render it on a JSP page?
so far, I've coded two JSP pages.
EDIT START:
*Experimental: Obviously the ImageServ will be a servlet, not a jsp.
EDIT END:
index.jsp
<%page ....
<html>
......
<img src="ImageServ.jsp?url=http://serveripaddress/folder/image.jpg" />
.....
ImageServ.jsp
<%#page import="javax.imageio.ImageIO"%>
<%#page import="java.net.URL"%>
<%#page import="java.io.*, java.awt.*, java.awt.image.*,com.sun.image.codec.jpeg.*" %>
<%
try {
String urlStr = "";
if(request.getParameter("url") != null)
{
urlStr = request.getParameter("url");
URL url = new URL(urlStr);
BufferedImage img = null;
try{
img = ImageIO.read(url);
out.println(" READ SUCCESS" + "<br>");
}catch(Exception e) {
out.println("READ ERROR " + "<br>");
e.printStackTrace(new PrintWriter(out));
}
try {
response.setContentType("image/jpeg");
JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(response.getOutputStream());
encoder.encode(img);
}catch(Exception ee) {
response.setContentType("text/html");
out.println("ENCODING ERROR " + "<br>");
ee.printStackTrace(new PrintWriter(out));
}
}
} catch (Exception e) {
e.printStackTrace(new PrintWriter(out));
}
%>
But this doesn't seem to be working:
all the time i see this error:
READ SUCCESS
ENCODING ERROR
java.io.IOException: reading encoded JPEG Stream
at sun.awt.image.codec.JPEGImageEncoderImpl.writeJPEGStream(Native Method)
at sun.awt.image.codec.JPEGImageEncoderImpl.encode(JPEGImageEncoderImpl.java:476)
at sun.awt.image.codec.JPEGImageEncoderImpl.encode(JPEGImageEncoderImpl.java:228)
Any ideas on how to get this working???
Your image data is already encoded so you can simply write it: ImageIO.write(img, "jpeg", response.getOutputStream());. You don't need to (and can't) use JPEGImageEncoder.
Classic question. Here's an example: http://www.exampledepot.com/egs/javax.servlet/GetImage.html
Also, don't do all that coding in a JSP - keep that for front-end rendering coding only; do the Java coding in a backend class.
Terrible and awful code. NEVER EVER write controller logic in a JSP that's why I have JSP to the guts. You cannot write binary data to a JSP output stream. The stream has already been initialized for text output. Put your logic in a servlet and pipe the input stream to the response output stream with Commons IO. This will work. If you still insist on that crappy solution, you will need to write a filter which completely wraps the response and serves binary data instead. See this for reference and examine its code. Good luck.
Edit:
doGet(...) {
response.setContentType("image/jpeg");
String url = request.getParameter("url");
...
InputStream is = ....getInputStream();
IOUtils.copy(is, response.getOutputStream());
// cleanup
} // done
This is how I pipe PDF from local disk but there is no difference to serving from a URL.