Here's my problem. I have a txt file called "sites.txt" . In these i type random internet sites. My Goal is to save the first image of each site. I tried to filter the Server response by the img tag and it actually works for some sites, but for some not.
The sites where it works the img src starts with http:// ... the sites it doesnt work start with anything else.
I also tried to add the http:// to the img src images which didnt have it, but i still get the same error:
Exception in thread "main" java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(Unknown Source)
My current code is:
public static void main(String[] args) throws IOException{
try {
File file = new File ("sites.txt");
Scanner scanner = new Scanner (file);
String url;
int counter = 0;
while(scanner.hasNext())
{
url=scanner.nextLine();
URL page = new URL(url);
URLConnection yc = page.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine = in.readLine();
while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
in.close();
String[] parts = inputLine.split(" ");
int i=0;
while(!parts[i].contains("src"))i++;
String destinationFile = "image"+(counter++)+".jpg";
saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
String tmp=scanner.nextLine();
System.out.println(url);
}
scanner.close();
}
catch (FileNotFoundException e)
{
System.out.println ("File not found!");
System.exit (0);
}
}
public static void saveImage(String imageUrl, String destinationFile) throws IOException {
// TODO Auto-generated method stub
URL url = new URL(imageUrl);
String fileName = url.getFile();
String destName = fileName.substring(fileName.lastIndexOf("/"));
System.out.println(destName);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(destinationFile);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
}
I also got a tip to use the apache jakarte http client libraries but i got absolutely no idea how i could use those i would appreciate any help.
A URL (a type of URI) requires a scheme in order to be valid. In this case, http.
When you type www.google.com into your browser, the browser is inferring you mean http:// and automatically prepends it for you. Java doesn't do this, hence your exception.
Make sure you always have http://. You can easily fix this using regex:
String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");
or
if(!stringUrl.startsWith("http://"))
stringUrl = "http://" + stringUrl;
An alternative solution
Simply try with ImageIO that contains static convenience methods for locating ImageReaders and ImageWriters, and performing simple encoding and decoding.
Sample code:
// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
.read(new URL(
"https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));
// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));
When you're scraping the site's HTML for image elements and their src attributes, you'll run into several different representations of URLs.
Some examples are:
resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png
For the second through fifth ones, you'll need to build the rest of the URL.
The second one may be more difficult to differentiate from the fourth and fifth ones, but I'm sure there are workarounds. The URL Standard leads me to believe you won't see it as often, because I don't think it's technically valid.
The third case is pretty simple. If the resource variable starts with //, then you just need to prepend the protocol/scheme to it. You can do this with the site object you have:
url = site.getProtocol() + ":" + resource
For the fourth and fifth cases, you'll need to prepend the resource with the entire site's URL.
Here's a sample application that uses jsoup to parse the HTML, and a simple utility method to build the resource URL. You're interested in the buildResourceUrl method. Also, it doesn't handle the second case; I'll leave that to you.
import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class SiteScraper {
public static void main(String[] args) throws IOException {
URL site = new URL("https://google.com/");
Document doc = Jsoup.connect(site.toString()).get();
Elements images = doc.select("img");
for (Element image : images) {
String src = image.attr("src");
System.out.println(buildResourceUrl(site, src));
}
}
static URL buildResourceUrl(URL site, String resource)
throws MalformedURLException {
if (!resource.matches("^(http|https|ftp)://.*$")) {
if (resource.startsWith("//")) {
return new URL(site.getProtocol() + ":" + resource);
} else {
return new URL(site.getProtocol() + "://" + site.getHost() + "/"
+ resource.replaceAll("^/", ""));
}
}
return new URL(resource);
}
}
This obviously won't cover everything, but it's a start. You may run into problems when the URL you're trying to access is in a subdirectory of the root of the site (i.e., http://some.place/under/the/rainbow.html). You may even encounter base64 encoded data URI's in the src attribute... It really depends on the individual case and how far you're willing to go.
Related
My Java program downloads static map images from Google maps which shows the route line. If I go to this link in my browser, I get the correct image which is a map with directional polyline.
But when I download the image from the same URL with my Java program, I get this instead:
Both URLs look the same to me, I can't work out what's wrong. Here is my code if anyone can spot something out of place?
Code:
try {
String mapImgUrl = "https://maps.googleapis.com/maps/api/staticmap?size=300x300&path=enc:" + polyline + "&key=AIzaSyBn2qYJcHoNCgNQZv1mcycnUo06sJDZPBs";
String imageFileName = houseNumber + " " + address + ".jpg";
URL url = new URL(mapImgUrl);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(imageFileName);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
ImageIcon imgIcon = new ImageIcon((new ImageIcon(imageFileName))
.getImage().getScaledInstance(400, 400, java.awt.Image.SCALE_SMOOTH));
SwingUtilities.invokeLater(new Runnable() {
public void run() {
JLabel labelMap = new JLabel();
labelMap.setIcon(imgIcon);
panelMap.add(labelMap);
}
});
The polyline data is correct, I have compared the data I get from the JSON in my browser with the data I get from my program, and they match. This is the polyline data straight from the API if it helps anyway:
c{utHdfqJJaA`AoI\\oATo#Xe#P[R_#NYFYHs#AGAQDe#LYHGNCF?h#[Za#bBuDtA_Dp#gAR[J[#i#?SF]FIJEJAj#m#f#iAbByDlQoa#Pm#By#CS?YBSFOHKLCXOr#SbE}G~#gBfAiBdH_MjAwBFa#j#_ARYQg#kAkDIBIEEI?M#GMMQYgBiEaD{HJOH[`Hy`#d#iC{CgBgEcC}CeBb#gC
I think the issue is you haven't URL encoded the poly line.
Use: java.net.URLEncoder
String mapImgUrl = "https://maps.googleapis.com/maps/api/staticmap?size=300x300&path=enc:"
+ URLEncoder.encode(polyline, "utf-8") + "&key=<key>";
Note the encode(String) method which doesn't require the character encoding is deprecated. Also, make sure you import from the public package, java.net
I found the problem. The polyline portion of the URL was null due to the thread that gets the polyline data not finishing before the thread that sets up the URL started. I solved this by joining the threads.
I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection
String strUrl = request.getParameter("url");
URL url = new URL(strUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
fWriter = new FileWriter(new File("f:\\new.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
In the resulting html page, every css and js and images were missing as they are pointed to get from local.
for example, js is placed as followed in my generated html page.
<script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
But this actual src is as follows,
<script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
I know that there are many solution to replace all src, href with url host. Found many answers related to that.
I used a solution as follows,
if (s.contains(("href="))) {
if (s.contains("\"../") || s.contains("\"/")) {
s = s.replace("\"../", "\"http://" + url.getHost() + "/");
s = s.replace("\"/", "\"http://" + url.getHost() + "/");
writer.write(s);
out.println(s);
}
}
Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.
In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.
http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2
தை தை தை
If, I am adding host to this href it becomes as follows,
தை தை தை
And this is not available. Because, the actual url is
தை தை தை
There are essentially two ways to get the absolute URL:
Using Jsoup's abs:href attribute getter. It works like this:
Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL
The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:
String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
Which would print:
http://example.com/script.js
To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.
In your case, you could use it like:
String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
Which will print:
https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js
The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.
Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).
I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();
I was curious how to pull information from a website with Java, and I found JSoup ( HTML Parser) Was a popular suggestion. I have found quite a few examples online but nothing really explaining how to use it. Say I wanted to get the temperature for Toronto using this url, http://weather.gc.ca/city/pages/on-143_metric_e.html , how would I go about doing so?
I guess you have to specify tags, but in the html for that site, the information I want is in a tag, but so is more inforation so when when I run my code
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("dd").text();
System.out.println("Title: " + temp);
I get a lot more information than I want.
For the temperature try this:
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("p").get(1).text();
System.out.println("Temperature: " + temp);
For formulating the CSS queries refer to the syntax sheet: http://jsoup.org/cookbook/extracting-data/selector-syntax
Also try: http://try.jsoup.org/, great for testing!
Let say I want to read the contents of mywebsite.com. This is how i'll do it:
import java.net.*;
import java.io.*;
class MyClass {
public static void main(String[] arg) throws Exception {
URL u = new URL("http://www.mywebsite.com");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader br = new BufferedReader(isr);
System.out.println(br.readLine());
}
}
Hopefully this should get you started..
So I was attempting to use this String in a URL :-
http://site-test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
In this code: -
String fileToDownloadLocation = //The above string
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
But at this point I get the error: -
java.net.URISyntaxException: Illegal character in query at index 169:Blahblahblah
I realised with a bit of googling this was due to the characters in the URL (guessing the &), so I then added in some code so it now looks like so: -
String fileToDownloadLocation = //The above string
fileToDownloadLocation = URLEncoder.encode(fileToDownloadLocation, "UTF-8");
URL fileToDownload = new URL(fileToDownloadLocation);
HttpGet httpget = new HttpGet(fileToDownload.toURI());
However, when I try and run this I get an error when I try and create the URL, the error then reads: -
java.net.MalformedURLException: no protocol: http%3A%2F%2Fsite-test.testsite.com%2FMeetings%2FIC%2FDownloadDocument%3FmeetingId%3Dc21c905c-8359-4bd6-b864-844709e05754%26itemId%3Da4b724d1-282e-4b36-9d16-d619a807ba67%26file%3D%5C%5Cs604132shvw140%5CTest-Documents%5Cc21c905c-8359-4bd6-b864-844709e05754_attachments%5C7e89c3cb-ce53-4a04-a9ee-1a584e157987%myDoc.pdf
It looks like I can't do the encoding until after I've created the URL else it replaces slashes and things which it shouldn't, but I can't see how I can create the URL with the string and then format it so its suitable for use. I'm not particularly familiar with all this and was hoping someone might be able to point out to me what I'm missing to get string A into a suitably formatted URL to then use with the correct characters replaced?
Any suggestions greatly appreciated!
You need to encode your parameter's values before concatenating them to URL.
Backslash \ is special character which have to be escaped as %5C
Escaping example:
String paramValue = "param\\with\\backslash";
String yourURLStr = "http://host.com?param=" + java.net.URLEncoder.encode(paramValue, "UTF-8");
java.net.URL url = new java.net.URL(yourURLStr);
The result is http://host.com?param=param%5Cwith%5Cbackslash which is properly formatted url string.
I have the same problem, i read the url with an properties file:
String configFile = System.getenv("system.Environment");
if (configFile == null || "".equalsIgnoreCase(configFile.trim())) {
configFile = "dev.properties";
}
// Load properties
Properties properties = new Properties();
properties.load(getClass().getResourceAsStream("/" + configFile));
//read url from file
apiUrl = properties.getProperty("url").trim();
URL url = new URL(apiUrl);
//throw exception here
URLConnection conn = url.openConnection();
dev.properties
url = "https://myDevServer.com/dev/api/gate"
it should be
dev.properties
url = https://myDevServer.com/dev/api/gate
without "" and my problem is solved.
According to oracle documentation
Thrown to indicate that a malformed URL has occurred. Either no legal protocol could be found in a specification string or the string
could not be parsed.
So it means it is not parsed inside the string.
You want to use URI templates. Look carefully at the README of this project: URLEncoder.encode() does NOT work for URIs.
Let us take your original URL:
http://site-test.test.com/Meetings/IC/DownloadDocument?meetingId=c21c905c-8359-4bd6-b864-844709e05754&itemId=a4b724d1-282e-4b36-9d16-d619a807ba67&file=\s604132shvw140\Test-Documents\c21c905c-8359-4bd6-b864-844709e05754_attachments\7e89c3cb-ce53-4a04-a9ee-1a584e157987\myDoc.pdf
and convert it to a URI template with two variables (on multiple lines for clarity):
http://site-test.test.com/Meetings/IC/DownloadDocument
?meetingId={meetingID}&itemId={itemID}&file={file}
Now let us build a variable map with these three variables using the library mentioned in the link:
final VariableMap = VariableMap.newBuilder()
.addScalarValue("meetingID", "c21c905c-8359-4bd6-b864-844709e05754")
.addScalarValue("itemID", "a4b724d1-282e-4b36-9d16-d619a807ba67e")
.addScalarValue("file", "\\\\s604132shvw140\\Test-Documents"
+ "\\c21c905c-8359-4bd6-b864-844709e05754_attachments"
+ "\\7e89c3cb-ce53-4a04-a9ee-1a584e157987\\myDoc.pdf")
.build();
final URITemplate template
= new URITemplate("http://site-test.test.com/Meetings/IC/DownloadDocument"
+ "meetingId={meetingID}&itemId={itemID}&file={file}");
// Generate URL as a String
final String theURL = template.expand(vars);
This is GUARANTEED to return a fully functional URL!
Thanks to Erhun's answer I finally realised that my JSON mapper was returning the quotation marks around my data too! I needed to use "asText()" instead of "toString()"
It's not an uncommon issue - one's brain doesn't see anything wrong with the correct data, surrounded by quotes!
discoveryJson.path("some_endpoint").toString();
"https://what.the.com/heck"
discoveryJson.path("some_endpoint").asText();
https://what.the.com/heck
This code worked for me
public static void main(String[] args) {
try {
java.net.URL url = new java.net.URL("http://path");
System.out.println("Instantiated new URL: " + url);
}
catch (MalformedURLException e) {
e.printStackTrace();
}
}
Instantiated new URL: http://path
Very simple fix
String encodedURL = UriUtils.encodePath(request.getUrl(), "UTF-8");
Works no extra functionality needed.