Can't read in HTML content from valid URL

Can't read in HTML content from valid URL - java

I am trying out a simple program for reading the HTML content from a given URL. The URL I am trying in this case doesn't require any cookie/username/password, but still I am getting a io.IOException: Server returned HTTP response code: 403 error. Can anyone tell me what am I doing wrong here? (I know there are similar question in SO, but they didn't help):
import java.net.*;
import java.io.*;
import java.net.MalformedURLException;
import java.io.IOException;
public class urlcont {
public static void main(String[] args) {
try {
URL u = new URL("http://www.amnesty.org/");
URLConnection uc = u.openConnection();
uc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
uc.connect();
InputStream in = uc.getInputStream();
int b;
File f = new File("C:\\Users\\kausta\\Desktop\\urlcont.txt");
f.createNewFile();
OutputStream s = new FileOutputStream(f);
while ((b = in.read()) != -1) {
s.write(b);
}
}
catch (MalformedURLException e) {System.err.println(e);}
catch (IOException e) {System.err.println(e);}
}
}

If you can fetch the URL in a browser, but not via Java, that indicates, to me, that they are blocking programmatic access to the page via user-agent filtering. Try setting the user-agent on your connection so that your code appears, to the webserver, to be a web-browser.
See this thread for help on that: What is the proper way of setting headers in a URLConnection?

There is a permission problem:
A web server may return a 403 Forbidden HTTP status code in response to a request from a client for a web page or resource to indicate that the server refuses to allow the requested action

you are not doing anything "wrong", the server you are trying to access is blocking your request, as you are not allowed to access the file
Http-Error 403 means Forbidden --> the remote server blocks the request.
check if you need to give authentification to access the document you want and in that case provide it with the request ;)

Related

RestApi post request to specific URL

Im working on integration with some rest API and i need to make calls to their URLS to receive the data.
Im just wondering if its possible to use a REST web-service which will be mapped to that certain URL instead of the local one and later on I will write the client side that will be mapped to these calls.
for example:
#Path("/URL")
public class MessageRestService {
#GET
#Path("/{param}")
public Response printMessage(#PathParam("param") String msg) {
String result = "Restful example : " + msg;
return Response.status(200).entity(result).build();
}
}
I cant make straight API calls from client side for example using AngularJs because i get this error:
Response to preflight request doesn't pass access control check: No 'Access- Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:63342' is therefore not allowed access. The response had HTTP status code 400.
I did find code samples for straight API calls to URLS from java, but it looks messy especially when you have to create it for a lot of API calls:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
public class Connection {
public static void main(String[] args) {
try {
URL url = new URL("INSERT URL HERE");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setDoOutput(true);
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/json");
String messageToPost = "POST";
OutputStream os = conn.getOutputStream();
os.write(input.getBytes());
os.flush();
conn.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(
(conn.getInputStream())));
String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null) {
System.out.println(output);
}
conn.disconnect();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

You are facing a same origin policy issue.
This is because your client-side (web browser) application is fetched from Server-A, while it tries to interact with data on Server-B.
Server-A is wherever you application is fetched from (before it is displayed to the user on their web browser).
Server-B is localhost, where your mock service is deployed to
For security reasons, by default, only code originating from Server-B can talk to Server-B (over-simplifying a little bit). This is meant to prevent malicious code from Server-A to hijack a legal application from Server-B and trick it into manipulating data on Server-B, behind the user's back.
To overcome this, if a legal application from Server-A needs to talk to Server-B, Server-B must explicitly allow it. For this you need to to implement CORS (Cross Origin Resource Sharing) - Try googling this, you will find plenty of resources that explain how to do it. https://www.html5rocks.com/en/tutorials/cors/ is also a great starting point.
However, as your Server-B/localhost service is just a mock service used during development and test, if your application is simple enough, you may get away with the mock service simply adding the following HTTP headers to all its responses:
Access-Control-Allow-Origin:*
Access-Control-Allow-Headers:Keep-Alive,User-Agent,Content-Type,Accept [enhance with whatever you use in you app]
As an alternative solution (during dev/tests only!) you may try forcing the web browser to disregard the same origin policy (eg: --disable-web-security for Chrome) - but this is dangerous if you do not pay attention to use separate instances of the web browser for your tests and for you regular web browsing.

Why does Java properly fetch one webpage's content, but not another?

I'm trying to fetch a CSV-formatted webpage to use as a rudimentary database. The test page is at http://prog.bhstudios.org/bhmi/database/get, and browsers open it no problem. However, when I run the following code, Java throws a 403 error:
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Main
{
static
{
Logger.getGlobal().setLevel(Level.ALL);
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws IOException
{
InputStream is = null;
try
{
System.out.println("Starting...");
URL url = new URL("http://prog.bhstudios.org/prog/bhmi/database/get/");
URLConnection urlc = url.openConnection();
urlc.connect();
is = urlc.getInputStream();
int data;
while ((data = is.read()) != -1)
{
System.out.print((char)data);
}
System.out.println("\r\nSuccess!");
}
catch (IOException ex)
{
Logger.getGlobal().log(Level.SEVERE, ex.getMessage(), ex);
System.out.println("\r\nFailure!");
}
if (is != null)
is.close();
}
}
Here's the console output:
Starting...
Nov 18, 2013 3:01:48 PM org.bh.mi.Main main
SEVERE: Server returned HTTP response code: 403 for URL: http://prog.bhstudios.org/prog/bhmi/database/get/
java.io.IOException: Server returned HTTP response code: 403 for URL: http://prog.bhstudios.org/prog/bhmi/database/get/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
at org.bh.mi.Main.main(Main.java:36)
Failure!
Note that 403 means the server is on and properly accepted the request, but refuses to do anything further.
Now here's the kicker: If I get, say, http://example.com, it works just fine!
How can I get my Java app to read this file from my webserver?

I tested against your server and if I submit the request - using TamperData - with User-Agent: Java/1.6.0_14 (I just picked a random java version), your webserver responds with 403 Forbidden.
My browser shows the following error message:
Error 1010
Access denied
What happened?
The owner of this website (prog.bhstudios.org) has banned your access based on your browser's signature (cf7ab9f58210755-ua21).
In other words, your server (or more likely: your proxy, as the headers both indicate use of cloadflare-nginx and ASP.net) filters based on user agent strings. This is probably done to prevent bots and screenscrapers from accessing your websites.
You either need to drop this filter (ask your proxy adminstrator), or set a different user agent for URLConnection, see Setting user agent of a java URLConnection and How to modify the header of a HttpUrlConnection

Your server for some reason is configured to forbid access when the request header
User-Agent: Java/...
is present. I was able to reproduce the problem and also got it to work by doing
URLConnection urlc = url.openConnection();
urlc.setRequestProperty("User-Agent", "");
urlc.connect();

How can I prevent a 403 HTTP error code in Java?

I use simple code to get html for http://www.ip-adress.com, but it shows error http code 403.
I try it in other website like google.com in program, it can work. i can also open www.ip-adress.com in browse, why i can't use it in java program.
public class urlconnection
{
public static void main(String[] args)
{
StringBuffer document = new StringBuffer();
try
{
URL url = new URL("http://www.ip-adress.com");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line = null;
while ((line = reader.readLine()) != null)
document.append(line + " ");
reader.close();
}
catch (MalformedURLException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
System.out.println(document.toString());
}
}
java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.ip-adress.com/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at urlconnection.main(urlconnection.java:14)

This is the line you required
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
refer this

The web-server can detect that you are not actually trying to access it via HTTP, so it rejects your request. There are ways to fake that to trick the server into thinking that you are a browser.

I suppose the site checks user agent header and blocks what it seems to be "a robot". You need to mimic normal browser. Check this solution Setting user agent of a java URLConnection or try to use commons http client AND set user agent.

I don't believe that this is fundamentally a Java problem. You're doing the right thing to make an HTTP connection, and the server is doing "the right thing" from its perspective by responding to your request with a 403 response.
Let's be clear about this - the response you're getting is due to whatever logic is being employed by the target webserver.
So if you were to ask "how can I modify my request so that http://www.ip-address.com returns a 200 response", then people may be able to come up with workarounds that keep that server happy. But this is a host-specific process; your Java code is arguably correct, though it should have better error handling because you can always get non-2xx responses.

Try to change Connection User-Agent to something like Browsers, most of times I use Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1

opening URLConnection with a particular web resource

I am not able to open a URLConnection with a particular web resource . I am getting
" java.net.ConnectException: Connection timed out:" . Is it because of that domain is blocking the direct URL connection ? If so how they are blocking this ? below is the code snippet i wrote .
Code snippet:
import java.io.;
import java.net.;
public class TestFileRead{
public static void main(String args[]){
try{
String serviceUrl = "http://xyz.com/examples.zip";
HttpURLConnection serviceConnection = (HttpURLConnection) new URL(serviceUrl).openConnection();
System.out.println(serviceConnection);
serviceConnection.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)");
DataInputStream din=new DataInputStream(serviceConnection.getInputStream());
FileOutputStream fout=new FileOutputStream("downloaded");
DataOutputStream dout=new DataOutputStream(fout);
int bytes;
while(din.available()>0){
bytes=din.readByte();
dout.write(bytes);
}
}catch(Exception ex){
ex.printStackTrace();
}
}
}

You are probably using the proxy setup in your browser to access the Yahoo home page which explains why it works in your browser and not in your code. You require a proxy configuration for your Java application.
The simplest way would be to set the system property http.proxyHost and http.proxyPort when running the code (in Eclipse or when running from command line just add -Dhttp.proxyHost=your.host.com -Dhttp.proxyPort=80) and you should be good to go. Pick up the proxy settings from your browser configuration/settings.
EDIT: This link does a decent job of explaining the possible solutions when dealing with proxies in Java.

Try this, it works fine for me, returning the index page.
String serviceUrl = "http://yahoo.com";
HttpURLConnection serviceConnection = (HttpURLConnection) new URL(serviceUrl).openConnection();
serviceConnection.addRequestProperty("User-Agent", "blah"); //some sites deny access to some pages when User-Agent is Java
BufferedReader in = new BufferedReader(new InputStreamReader(serviceConnection.getInputStream()));

Cookies turned off with Java URLConnection

I am trying to make a request to a webpage that requires cookies. I'm using HTTPUrlConnection, but the response always comes back saying
<div class="body"><p>Your browser's cookie functionality is turned off. Please turn it on.
How can I make the request such that the queried server thinks I have cookies turned on. My code goes something like this.
private String readPage(String page) throws MalformedURLException {
try {
URL url = new URL(page);
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
InputStream in = uc.getInputStream();
int v;
while( (v = in.read()) != -1){
sb.append((char)v);
}
in.close();
uc.disconnect();
} catch (IOException e){
e.printStackTrace();
}
return sb.toString();
}

You need to add a CookieHandler to the system for it handle cookie. Before Java 6, there is no CookieHandler implementation in the JRE, you have to write your own. If you are on Java 6, you can do this,
CookieHandler.setDefault(new CookieManager());
URLConnection's cookie handling is really weak. It barely works. It doesn't handle all the cookie rules correctly. You should use Apache HttpClient if you are dealing with sensitive cookies like authentication.

I think server can't determine at the first request that a client does not support cookies. So, probably server sends redirects. Try to disable redirects:
uc.setInstanceFollowRedirects(false);
Then you will be able to get cookies from response and use them (if you need) on the next request.

uc.getHeaderFields()
// get cookie (set-cookie) here
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2");
conn.addRequestProperty("Referer", "http://xxxx");
conn.addRequestProperty("Cookie", "...");

If you're trying to scrape large volumes of data after a login, you may even be better off with a scripted web scraper like WebHarvest (http://web-harvest.sourceforge.net/) I've used it to great success in some of my own projects.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't read in HTML content from valid URL - java

There is a permission problem: A web server may return a 403 Forbidden HTTP status code in response to a request from a client for a web page or resource to indicate that the server refuses to allow the requested action

Related

RestApi post request to specific URL

Why does Java properly fetch one webpage's content, but not another?

How can I prevent a 403 HTTP error code in Java?

opening URLConnection with a particular web resource

Cookies turned off with Java URLConnection

Categories

Resources