URL is accessable with browser but still FileNotFoundException with URLConnection - java

I use a HttpURLConnection to connect to a website and receive an ResponseCode=404 (HTTP_NOT_FOUND). However I have no problem opening the website in my browser (IE).
Why the difference, and what can I do about it?
Regards, Pavan
This is my Program
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;
public class TestGet {
private static URL source;
public static void main(String[] args) {
doGet();
}
public static void doGet() {
try {
source = new URL("http://localhost:8080/");
System.out.println("Url is" + source.toString());
URLConnection connection = source.openConnection();
connection.setRequestProperty("User-Agent","Mozilla/5.0 ( compatible ) ");
connection.setRequestProperty("Accept","*/*");
connection.setDoInput(true);
connection.setDoOutput(true);
System.out.println(((HttpURLConnection) connection).getResponseCode());
BufferedReader rdr = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
StringBuffer b = new StringBuffer();
String line = null;
while (true) {
line = rdr.readLine();
if (line == null)
break;
b.append(line);
}
} catch (Exception e) {
e.printStackTrace();
System.err.println(e.toString());
}
}
}
Stack Trace
Url ishttp://localhost:8080/
404
java.io.FileNotFoundException: http://localhost:8080/
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at TestGet.doGet(TestGet.java:28)
at TestGet.main(TestGet.java:11)
Caused by: java.io.FileNotFoundException: http://localhost:8080/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at TestGet.doGet(TestGet.java:26)
... 1 more
java.io.FileNotFoundException: http://localhost:8080/

You are getting 404 error that means the response for the request is not found. First you need to make sure that there is a server serving at http://localhost:8080/ and it must return some content with code 200. If not, then there is nothing we can help you.
The easiest way to test whether there is anything at the url is to paste the url on the web browser address bar and click go. However, this does not guarantee that the Java code will be able to access it. For example, if the server is designed to response 404 if it cannot find the web browser User-Agent header.
Since the server returns a status code, either 200 or 404, it means this is not a firewall problem.
According to your latest edition of the question, you can view it with the web browser but cannot download it with your java code and the header seems to be set correctly. There are only two problem I can see:
You should not set connection.setDoOutput(true); to true. This will enforce the connection to do HTTP POST instead of GET and the server may not support POST.
Your server may be always returning 404 even if it should have been 200. Since the web browser doesn't care about the error status and tries to render all the content so it seems to be working from the web browser. If so, you should fix the server to reponse correctly first, otherwise try getting error stream instead HttpURLConnection#getErrorStream()

I had a similar issue. For me it helped to inspect the packets using RawCap. RawCap is one of the few Windows packet sniffers that lets you sniff localhost.
In my cases the server was returning a 404 due to an authentication issue.

If the url http://localhost:8080/ can be accessed well in the web browser, the code should work well. I run the program in my machine, it works well. So you must check whether the webserver service is ok.

I know this is very late in the game, but I was just recently having the same issue and none of the solutions here worked for me. In my case, I actually had another process running on the same port that was stealing the requests from the java app. Using yair's answer here you can check for a process running on the same port like this: In the command prompt, do netstat -nao | find "8080" on Windows or netstat -nap | grep 8080 on Linux. It should show a line with LISTENING and 127.0.0.1:8080 and next would be the process ID. Just terminate the process and you should be good to go.

I had the problem too. In my case i had a invisible unicode character in the url string. So connection couldnt open it (FileNotFound indicates that). I removed it and it worked.

I had a similar scenario where the web service processed POST requests from the browser (in my case Postman, an API testing Chrome extension) correctly, but HttpURLConnection kept failing with a 404 for large payloads. I mistakenly assumed that the problem must be in my HttpURLConnection client code.
When I later tried to replicate the request from cUrl with a large payload, I got the same 404 error. Even though I used the cUrl code generated by Postman, which therefore should be identical to Postman's request, there was a difference in how the web service reacted to both requests. Some client middleware on Postman may have intercepted and modified the requests.
TL;DR
Check the web service. It may be the culprit. Try another non-browser barebones Http client like cUrl to see how the web service reacts to it.

Related

Parse https with jsoup (java)

i try to parse a document with jsoup (java). This is my java-code:
package test;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class crawler{
private static final int TIMEOUT_IN_MS = 5000;
public static void main(String[] args) throws MalformedURLException, IOException
{
Document doc = Jsoup.parse(new URL("http://www.internet.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
}
}
Ok, this works. But when i want to parse a https site, i get this error message:
Document doc = Jsoup.parse(new URL("https://www.somesite.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.somesite.com/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)
at org.jsoup.Jsoup.parse(Jsoup.java:183)
at test.crawler.main(crawler.java:14)
I only get this error messages, when i try to parse https. http is working.
Jsoup supports https fine - it just uses Java's URLConnection under the hood.
A 403 server response indicates that the server has 'forbidden' the request, normally due to authorization issues. If you're getting a HTTP response status code, the TLS (https) negotiation has worked.
The issue here is probably not related to HTTPS, it just that the URL you're having troubles fetching happens to be HTTPS. You need to understand why the server is giving you a 403 - my guess is either you need to send some authorization tokens (cookies or URL params), or it is blocking the request because of the user agent (which defaults to "Java" unless you specify it). Lots of services block requests that way. Look to set the useragent to a common browser string. Use the Jsoup.Connect methods to do that.
(People won't be able to help you more without real example URLs, because we can't tell what the server is doing just with this info.)
You would need to provide authentication when hitting the URL. Also try the solution in 403 Forbidden with Java but not web browser? if the request works in a browser and not using JAVA code.
You could also just ignore SSL certificate if it's required
Jsoup.connect("https://example.com").validateTLSCertificates(false).get()

HTTP 302 redirects fail when using URLConnection in Java, possibly due to load balancer

I have the following code snippet
...
URLConnection conn = sampleURL.openConnection();
int len = conn.getContentLength();
...
Most of the times, when sampleURL sends a 302 redirect response, I see that the above code successfully calls the new URL and fetches the response from there.
However, there are times when the above code fails (i.e) a 302 redirect response is received from sampleURL but the actual redirect never happens. In the Apache logs on the calling side, I see the 302 redirect response being logged, but nothing after that. The call to the new URL is never logged.
I have seen this happening primarily when the application which runs the above code is behind a load balancer, but I'm not sure if this is actually the reason for the problem.
Has anyone faced this issue ? Thanks for your help.

Piwik returns 400 bad request when trying to send tracking data via API

I'm developing an app which I plan on implementing Piwik Analytics into. Although there seems to be no real documentation (unless I'm blind) on how to use the Java Tracking API (this one), I think I managed to figure it out, but when I try it, I get this error
WARNING: Warning:400 Bad Request
org.piwik.PiwikException: error:400 Bad Request
at org.piwik.SimplePiwikTracker.sendRequest(SimplePiwikTracker.java:878)
at test.Main.main(Main.java:32)
Here is my code:
SimplePiwikTracker tracker = new SimplePiwikTracker("http://example.com/piwik/piwik.php");
tracker.setIdSite(2);
tracker.setPageUrl("http://example.com/javatest/test");
tracker.setPageCustomVariable("test", "10");
//This is line 32
tracker.sendRequest(tracker.getLinkTrackURL("http://example.com/piwik/piwik.php"));
I also tried this:
String datURL = tracker.getLinkTrackURL("http://example.com/piwik/piwik.php").toString();
URL url = new URL(datURL);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println(connection.getResponseCode());
connection.disconnect();
But this just prints out response code 404 which doesn't make any sense because if I paste the link generated by tracker.getLinkTrackURL into my browser it works perfectly and logs correctly in Piwik (I have debug mode on).
So my question is, why is Piwik returning a bad request and why can't Java find the URL (404) when it works perfectly fine in my browser?
I figured it out. Instead of having the URL point to the http://example.com/piwik/piwik.php, it should just be going to http://example.com/piwik

Infinite redirect loop in HTTP request

I'm getting a too many redirects redirect error from URLConnection when trying to fetch www.palringo.com
URL url = new URL("http://www.palringo.com/");
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
outputs the dreaded:
Exception in thread "main" java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
According to wget there is just one redirect, from www.palringo.com to www.palringo.com/en/gb/
Any ideas why my request using URLConnection for /en/gb results in another 302 response for the same resource ?
The problem is exemplified by:
URL url = new URL("http://www.palringo.com/en/gb/");
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// Just for testing, use Chrome header, to eliminate "anti-crawler" response!
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30");
System.out.println("Response code = " + connection.getResponseCode());
This outputs:
Response code = 302
Redirected to /en/gb/
hence an infinite redirect loop.
Interestingly although browsers and wget handle it, curl does not:
joel#bohr:/tmp$ curl http://www.palringo.com/en/gb/
curl: (7) couldn't connect to host
A request for /en/gb/ is redirected to /en/gb/ precisely once.
The problem is that your HttpURLConnection (or whatever code you use -- sorry, I'm NOT familiar with Java) does not use cookies.
Disable cookies in browser and observe exactly the same behaviour -- infinite redirect.
The reason: Server checks if cookie is set. If not set -- it sets it and redirects. Because cookies are not supported/disabled, script on server side redirects over and over again.
Solution: Enable/add cookie support to your code and try again.
I think that redirect is defined with pattern like /* -> /en/gb
So, when you arrive to /en/gb the redirect rule works again.
Check your redirect rules. Where are they defined? In apache web server or in other place? Check all. Verify that this is (or is not) a case and fix the rules accordingly.
The problem is on the server side. It might be a broken Apache httpd rewrite rule that is sending redirects that loop back to the same place. It might be something else. Whatever it is, you are unlikely to be able to fix it on the client side.
I'm basically running a crawler and just noticed this issue.
Ah.
It is possible that it is an anti-crawler defence measure. "Hmmm ... looks like one of those pesky crawlers who ignore my robots.txt file, waste all of my bandwidth and steal my precious content. Lets cause him some pain with a redirect loop!!".
Check that your crawler is obeying the "robots.txt" protocol. Check the ToS for the site you are crawling to see if what you are doing is allowed.
You could be right, but if so how come wget and browsers handle this with just the one redirect?
Maybe because the server is looking at the request headers, or at your pattern of requests.
The Terms of Service (that I see) say this:
"You agree to not use the Service to: ... xiii - Run any automated systems, processes, scripts or bots for any purpose without the express written permission of Palringo."
Arguably, crawling their site is in violation of that.
You will also get this error if you're trying to connect to a service that requires authentication and you provide wrong username and password.

java.io.IOException: Server returned HTTP response code: 403 for URL [duplicate]

This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
My code goes like this:
URL url;
URLConnection uc;
StringBuilder parsedContentFromUrl = new StringBuilder();
String urlString="http://www.example.com/content/w2e4dhy3kxya1v0d/";
System.out.println("Getting content for URl : " + urlString);
url = new URL(urlString);
uc = url.openConnection();
uc.connect();
uc.getInputStream();
BufferedInputStream in = new BufferedInputStream(uc.getInputStream());
int ch;
while ((ch = in.read()) != -1) {
parsedContentFromUrl.append((char) ch);
}
System.out.println(parsedContentFromUrl);
However when I am trying to access the URL through browser there is no problem , but when I try to access it through a java program, it throws expection:
java.io.IOException: Server returned HTTP response code: 403 for URL
What is the solution?
Add the code below in between uc.connect(); and uc.getInputStream();:
uc = url.openConnection();
uc.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
However, it a nice idea to just allow certain types of user agents. This will keep your website safe and bandwidth usage low.
Some possible bad 'User Agents' you might want to block from your server depending if you don't want people leeching your content and bandwidth. But, user agent can be spoofed as you can see in my example above.
403 means forbidden. From here:-
10.4.4 403 Forbidden
The server understood the request, but
is refusing to fulfill it.
Authorization will not help and the
request SHOULD NOT be repeated. If the
request method was not HEAD and the
server wishes to make public why the
request has not been fulfilled, it
SHOULD describe the reason for the
refusal in the entity. If the server
does not wish to make this information
available to the client, the status
code 404 (Not Found) can be used
instead.
You need to contact the owner of the site to make sure the permissions are set properly.
EDIT I see your problem. I ran the URL through Fiddler. I noticed that I am getting a 407 which means below. This should help you go in the right direction.
10.4.8 407 Proxy Authentication Required
This code is similar to 401
(Unauthorized), but indicates that the
client must first authenticate itself
with the proxy. The proxy MUST return
a Proxy-Authenticate header field
(section 14.33) containing a challenge
applicable to the proxy for the
requested resource. The client MAY
repeat the request with a suitable
Proxy-Authorization header field
(section 14.34). HTTP access
authentication is explained in "HTTP
Authentication: Basic and Digest
Access Authentication"
Also see this relevant question.
java.io.IOException: Server returned HTTP response code: 403 for URL
IF the browser can access the page, and your code cannot, then there's something different between the browser request and your request. You can look at the browser request, using, say, Firebug, to see what the differences are. Some things I can think of are:
The site sets a
cookie (maybe during login). You may be able to handle
this in code, you will have to
explicitly add support for passing
the cookie. This is most likely.
The site filters based on user agents. You can set the user agent. This is not as likely.

Categories

Resources