Trying to get html texts from google url, but error 401

Trying to get html texts from google url, but error 401 - java

I am trying to retrieve some html texts from a list of google returned pages. most of them work fine, but for urls such as https://www.google.com/patents/US6034687 always gives 401 error see below
Server returned HTTP response code: 401 for URL: https://www.google.com/patents/US6034687
I am using java and I did look up on this error code, it seems authentication related, but this kind of URL can be accessed from any browsers without asking for login. So I am confused, how come only this kind of URL does not work for me.
here is my code for retrieving html
URL u=new URL(url);
StringBuilder html =new StringBuilder();
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Accept", "text/html");
BufferedReader br;
try {
br = new BufferedReader(new InputStreamReader((conn.getInputStream())));
String out="";
while ((out= br.readLine()) != null) {
// System.out.println(out);
html.append(out+"\n");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Any idea?
thanks

Try sending a User-Agent header in the request. That 401 status is misleading. Some servers do not allow requests from non-browser clients.
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.2; rv:21.0) Gecko/20100101 Firefox/21.0");
BTW, when you do openConnection() for an https scheme, the return value is HttpsURLConnection, which extends HttpURLConnection.

The request requires user authentication. The response MUST include a WWW-Authenticate header field containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field. If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication

Related

HTTP Response code 401 on HttpUrlConnection.getInputStream

Please consider the following code snippet:
URL url = new URL("https://wfs.geodienste.ch/av/deu?&LAYERS=LCSFC,SOLI,SOSFC,LOCPOS,HADR,LNNA,OSNR,RESF,OSBP,MBSF&FORMAT=image%2Fjpeg&DPI=100&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3A2056&BBOX=2637158.3069220297,1236087.7425729688,2639935,1237632&WIDTH=2463&HEIGHT=1369");
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
InputStream is = uc.getInputStream();
For the given URL I get a 401 Exception:
java.io.IOException: Server returned HTTP response code: 401 for URL: https://wfs.geodienste.ch/av/deu?&LAYERS=LCSFC,SOLI,SOSFC,LOCPOS,HADR,LNNA,OSNR,RESF,OSBP,MBSF&FORMAT=image%2Fjpeg&DPI=100&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3A2056&BBOX=2637158.3069220297,1236087.7425729688,2639935,1237632&WIDTH=2463&HEIGHT=1369
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
at HTTPTestMain.doit(HTTPTestMain.java:43)
at HTTPTestMain.main(HTTPTestMain.java:28)
Now I'd expect the URL to ask for some credentials when using it with a browser. Surprisingly there are no credentials needed and the response in e.g. Firefox is 200.
Just for curiousity I've added the following line of code:
uc.setRequestProperty("Authorization", "Basic ");
Still the same 401 response.
Can you tell me what's needed to get the Response right with Java?
Kind regards
Klib

401 means "Unauthorized", so there must be something with your credentials.You could use an Authenticator
enter code here
Authenticator.setDefault(new Authenticator(){
#Override
protected PasswordAuthentication getPasswordAuthentication(){
return new PasswordAuthentication(login, password.toCharArray());
}
});

I was also able to access that page from a browser without any password.
It is possible that they are treating requests from a web browser differently from programmatic requests; i.e. from an automated scraper. For example, they may be looking at the "User-Agent" header in your requests.
But if you try to reverse engineer this to evade possible restrictions, they may decide to block you using other mechanisms.
I think you need to contact the site's support: https://geodienste.ch/support. Ask them how to deal with this. Find out if you need an account, and how to get one.

How to make a get MS Graph Request with access code?

I'm trying to list items from some list in Office 365 SharePoint from a java native Windows app.
I'm using deprecated office-365-java-sdk to authenticate and get an access token. Yes, this SDK is deprecated but authentication still works. So, I have an access token.
So, next step is to make GET request. In Graph Explorer this URL works fine:
/v1.0/sites/root/lists/{site-id}/items
I followed documentation to build the request and I have to add a header with authentication token so this is my code:
StringBuilder result = new StringBuilder();
URL url = new URL("https://graph.microsoft.com/v1.0/sites/root/lists/{0a506dcb-ecbc-40ed-bf2c-5912e78e3ca8}/items");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Authorization", "Bearer " + access_token);
conn.setRequestProperty("Content-type", "application/json");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null)
{
result.append(line);
}
rd.close();
System.out.println(result.toString());
Authentication is working, because if access token header is not added, it returns a status error code of 401 Required authentication information is either missing or not valid for the resource. But with an access code, it returns error code 400 Cannot process the request because it is malformed or incorrect.
I'm stuck with this, I read the documentation again and again and after checking URL is right using Graph Explorer, I don't know if this is not the right way to include headers or what....

The correct header to pass is Accept: application/json.
So, replace the conn.setRequestProperty("Content-type","application/json"); with
conn.setRequestProperty("Accept","application/json");

How to authenticate spring boot endpoint programmatically?

In Spring Boot application, I enable one endpoint i.e. metrics endpoint. At same moment I don't want to make it public so I configured it with the following setting:
security.user.name=admin
security.user.password=ENC(l2y+PuJeGIOMshbv+ddZgK8lOe2TRdt9YIuMwB5g5Ws=)
security.basic.enabled=false
management.context-path=/manager
management.port=8082
management.address=127.0.0.2
management.security.enabled=false
management.security.roles=SUPERUSER
management.ssl.enabled=true
management.ssl.key-store=file:keystore.jks
management.ssl.key-password=ENC(l2y+PuJeGIOMshbv+ddZgK8lOe2TRdt9YIuMwB5g5Ws=)
endpoints.metrics.id=metrics
endpoints.metrics.sensitive=true
endpoints.metrics.enabled=true
Basically, if someone trying to access https://127.0.0.2:8082/manager/metrics URI throw any browser then he/she needs to supply a username (security.user.name=admin) and password (security.user.password=ENC(l2y+PuJeGIOMshbv+ddZgK8lOe2TRdt9YIuMwB5g5Ws=)) in a popup.
Now I have java client which is running on the same machine(in future it may run in a remote location) but with a different host and port i.e. 127.0.0.1:8081 trying to access the above URI programmatically, but unable to do so and end up with response code 401.
401 is the response code for UNAUTHORISED access which is obvious. My query is is it possible to supply username and password programmatically to access the above URI i.e. https://127.0.0.2:8082/manager/metrics? or firewall is the only way to secure it?.
My java client code:
System.setProperty("javax.net.ssl.trustStore","D://SpringBoot/SpringWebAngularJS/truststore.ts");
System.setProperty("javax.net.ssl.trustStorePassword","p#ssw0rd");
try {
HttpsURLConnection.setDefaultHostnameVerifier(new HostnameVerifier(){
public boolean verify(String hostname,SSLSession sslSession) {
return hostname.equals("127.0.0.2");
}
});
URL obj = new URL("https://127.0.0.2:8082/manager/metrics");
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
con.setRequestProperty( "Content-Type", "application/json" );
con.setRequestProperty("Accept", "application/json");
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Mozilla/5.0");
int responseCode = con.getResponseCode();
System.err.println("GET Response Code :: " + responseCode); //Response code 401
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Based on the described behavior of the browser, i assume that the endpoint is secured by basic authentication.
Basic authentication expects an "Authorization" Header containing the username and password in the following form encoded in base64: "username:password"
If you use java 8 you can use the base64 encoder provided in the java util package as follows:
import java.util.Base64;
con.setRequestProperty("Authorization", Base64.getEncoder().encode("yourUsername:yourPassword".getBytes());
Just to provide a little further information:
This is the exact same thing your browser does. It sends the request and gets a 401 Response containing a header WWW-Authenticate: Basic. So the browser knows that the authentication method is basic auth and asks you to provide your username and password which then will be encoded by base64 and added to the authorization header when the browser performs the same request a second time.

java.io.IOException: Server returned HTTP response code: 403 for URL [duplicate]

This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
My code goes like this:
URL url;
URLConnection uc;
StringBuilder parsedContentFromUrl = new StringBuilder();
String urlString="http://www.example.com/content/w2e4dhy3kxya1v0d/";
System.out.println("Getting content for URl : " + urlString);
url = new URL(urlString);
uc = url.openConnection();
uc.connect();
uc.getInputStream();
BufferedInputStream in = new BufferedInputStream(uc.getInputStream());
int ch;
while ((ch = in.read()) != -1) {
parsedContentFromUrl.append((char) ch);
}
System.out.println(parsedContentFromUrl);
However when I am trying to access the URL through browser there is no problem , but when I try to access it through a java program, it throws expection:
java.io.IOException: Server returned HTTP response code: 403 for URL
What is the solution?

Add the code below in between uc.connect(); and uc.getInputStream();:
uc = url.openConnection();
uc.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
However, it a nice idea to just allow certain types of user agents. This will keep your website safe and bandwidth usage low.
Some possible bad 'User Agents' you might want to block from your server depending if you don't want people leeching your content and bandwidth. But, user agent can be spoofed as you can see in my example above.

403 means forbidden. From here:-
10.4.4 403 Forbidden
The server understood the request, but
is refusing to fulfill it.
Authorization will not help and the
request SHOULD NOT be repeated. If the
request method was not HEAD and the
server wishes to make public why the
request has not been fulfilled, it
SHOULD describe the reason for the
refusal in the entity. If the server
does not wish to make this information
available to the client, the status
code 404 (Not Found) can be used
instead.
You need to contact the owner of the site to make sure the permissions are set properly.
EDIT I see your problem. I ran the URL through Fiddler. I noticed that I am getting a 407 which means below. This should help you go in the right direction.
10.4.8 407 Proxy Authentication Required
This code is similar to 401
(Unauthorized), but indicates that the
client must first authenticate itself
with the proxy. The proxy MUST return
a Proxy-Authenticate header field
(section 14.33) containing a challenge
applicable to the proxy for the
requested resource. The client MAY
repeat the request with a suitable
Proxy-Authorization header field
(section 14.34). HTTP access
authentication is explained in "HTTP
Authentication: Basic and Digest
Access Authentication"
Also see this relevant question.
java.io.IOException: Server returned HTTP response code: 403 for URL

IF the browser can access the page, and your code cannot, then there's something different between the browser request and your request. You can look at the browser request, using, say, Firebug, to see what the differences are. Some things I can think of are:
The site sets a
cookie (maybe during login). You may be able to handle
this in code, you will have to
explicitly add support for passing
the cookie. This is most likely.
The site filters based on user agents. You can set the user agent. This is not as likely.

Cookies turned off with Java URLConnection

I am trying to make a request to a webpage that requires cookies. I'm using HTTPUrlConnection, but the response always comes back saying
<div class="body"><p>Your browser's cookie functionality is turned off. Please turn it on.
How can I make the request such that the queried server thinks I have cookies turned on. My code goes something like this.
private String readPage(String page) throws MalformedURLException {
try {
URL url = new URL(page);
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
InputStream in = uc.getInputStream();
int v;
while( (v = in.read()) != -1){
sb.append((char)v);
}
in.close();
uc.disconnect();
} catch (IOException e){
e.printStackTrace();
}
return sb.toString();
}

You need to add a CookieHandler to the system for it handle cookie. Before Java 6, there is no CookieHandler implementation in the JRE, you have to write your own. If you are on Java 6, you can do this,
CookieHandler.setDefault(new CookieManager());
URLConnection's cookie handling is really weak. It barely works. It doesn't handle all the cookie rules correctly. You should use Apache HttpClient if you are dealing with sensitive cookies like authentication.

I think server can't determine at the first request that a client does not support cookies. So, probably server sends redirects. Try to disable redirects:
uc.setInstanceFollowRedirects(false);
Then you will be able to get cookies from response and use them (if you need) on the next request.

uc.getHeaderFields()
// get cookie (set-cookie) here
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2");
conn.addRequestProperty("Referer", "http://xxxx");
conn.addRequestProperty("Cookie", "...");

If you're trying to scrape large volumes of data after a login, you may even be better off with a scripted web scraper like WebHarvest (http://web-harvest.sourceforge.net/) I've used it to great success in some of my own projects.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trying to get html texts from google url, but error 401 - java

Related

HTTP Response code 401 on HttpUrlConnection.getInputStream

How to make a get MS Graph Request with access code?

How to authenticate spring boot endpoint programmatically?

java.io.IOException: Server returned HTTP response code: 403 for URL [duplicate]

Cookies turned off with Java URLConnection

Categories

Resources