How can I parse HTML data like I were in other country?
I've tried to use proxy (code):
System.setProperty("http.proxyHost", "some proxy");
System.setProperty("http.proxyPort", "some port");
but it doesn't work properly. I still get data in my country language.
I've also tried using VPN, but when I do my program (Jsoup parser) doesn't download anything.
EDIT:
Thanks for your time, the marked answer helped me to solve the problem. The complete solution I found there .
That depends on the site you're trying to download. If the site is using IP geolocation, the only solution is to use appropiate proxy: https://stackoverflow.com/a/1433296/1608594
If the site is only using HTTP headers to determine language, you can send Accept-Language, Accept-Charset and Accept-Encoding headers with the proper values.
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Request_fields
Related
I write a java program like I saw here
How to read the https page content using java?
but for some sites the code does not work.
I got Error Server returned HTTP response code: 403 for URL: https://research.investors.com/stock-quotes/nyse-sailpoint-tech-holdings-sail.htm
It works for
url = "https://maven.apache.org/guides/mini/guide-repository-ssl.html";
Can someone help me ?
403 HTTP status stands for "Forbidden", most likely investors.com can check your request headers and deny the resource.
Try modifying the request headers using an User-Agent that site might accept.
403 Forbidden
The request contained valid data and was understood by the server, but the server is refusing action. This may be due to the user not having the necessary permissions for a resource or needing an account of some sort, or attempting a prohibited action (e.g. creating a duplicate record where only one is allowed). This code is also typically used if the request provided authentication by answering the WWW-Authenticate header field challenge, but the server did not accept that authentication. The request should not be repeated.
So probably website, which you want to scrape, just restricted requests like yours (i mean requests, that was made not from browser).
But you can try Selenium.
OK , I solved.
I use con.setRequestProperty and set "User-Agent", "Accept", "Content-Type", "Accept-Language".
Thank you.
i checked this page and got some usefull code for using a proxy in java code when connecting to a webpage.
I can confirm that pages like whatsmyip do indeed tell me that proxy is working - it is showing proxy ip. The problem is that the page i am accesing to in java code, somehow detects my true ip and blocks content. I do know how it does that (header, return ip, etc.), what i do not know is how to bypass that.
Maybe another interesting thing is that this page works with no problems using 1 of the best known online proxy sites - it shows content. Now what is even more interesting is that i tried taking that sites ip and used it as proxy in my program, but there it didn't work - true ip got detected, which is really strange.
edit: This is my new code:
System.setProperty("java.net.useSystemProxies","false");
System.setProperty("http.proxyHost", "94.230.208.147");
System.setProperty("http.proxyPort", "9001");
System.setProperty("http.nonProxyHosts", "localhost|127.0.0.1");
System.setProperty("https.proxyHost", "94.230.208.147");
System.setProperty("https.proxyPort", "9001");
System.setProperty("https.nonProxyHosts", "localhost|127.0.0.1");
I can confirm that https://whatsmyip.com/ isn't fooled by this proxy and can see my true ip. What did i forget to include ?
Add this at the end of your code:
System.setProperty("http.nonProxyHosts", "localhost|127.0.0.1");
That indicates the hosts that should be accessed without going through the proxy. Typically this defines internal hosts. The value of this property is a list of hosts, separated by the '|' character.
I was expecting this code to return a 404, however it produces the output :
"Response code is 200"
Would it be possible to learn how to differentiate between existent and non-existent web pages . . . thanks so much,
try
{
// create the HttpURLConnection
URL url = new URL("http://www.thisurldoesnotexist");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code is " + connection.getResponseCode());
}
EDIT: I see you've call openConnection() but not connect() - could that be the problem? I would expect getResponseCode() to actually make the request if it hasn't already, but it's worth just trying that...
That suggests you've possible got some DNS resolver which redirects to a "helper" (spam) page, or something like that.
The easiest way to see exactly what's going on here is to use Wireshark - have that up and capturing traffic (HTTP-only, to make life easier) and then run your code. You should be able to see what's going on that way.
Note that I wouldn't have expected a 404 - because that would involve being able to find a web server to talk to to start with. If you're trying to go to a host which doesn't involve, there shouldn't be an HTTP response at all. I'd expect connect() to throw an exception.
try adding a "connection.connect();" or look at the contents returned...
it could be a dns issue, ie: your dns is being sent to a parking place... for example: freedns does this.
You could:
Resolve the IP from the host of the page
Try to connect to port 80 on the resolved IP using plain sockets
This is a bit low level however and will add complexity since you will need to make a simple GET request through the socket. Then validate the response so you're sure that its actually a HTTP server running on port 80.
NMap might be able to help you here.
Ideally you should be getting this error:
java.net.UnknownHostException: www.thisurldoesnotexist
But it looks like your URL is resolved by you DNS provider.
For instance on my company's network running your code with URI "http://profile/" displays
the employee profile.
Please also check etc.home file if you are on windows to check if any settings have been changed.
Like #spgennard - I think this is most likely a DNS issue.
The URL you have chosen is owned by a DNS speculator.
The URL you have chosen is "parked" by a DNS provider.
Your ISP is messing with your DNS results to send your browser to some search page.
It is also possible that you are accessing the web via a proxy, and the proxy is doing something strange.
The way to diagnose this is to look at the other information in the HTTP responses you are getting, particularly the response body.
I've had to update a previous java application that requests a SOAP response from an external web service. This service is outside our firewall which now requires us to go through a proxy instead of hitting the URL directly.
Currently the Java App uses URLEndpoint that takes a string for the URL. Usually when I am getting to a URL through a proxy I create a URL like so:
URL url = new URL("http", "theproxy.com", 5555, finalUrl);
The problem is URLEndpoint only takes a string for the url, I tried to convert URL to string using toExternalForm() but it malformed the URL.
Any ideas as to a way around this?
EDIT: I can't use System.setProperty as this runs with a whole heap of other Java applications in tomcat.
second edit: I can't set a system properties as it will override all other applications running on the server, I can't use jsocks as the proxy we run through squid proxy which does not support socks4/5
Any help appreciated.
That's not how proxy's work. The way a proxy works is that you take your normal URL:
http://example.com/service
and instead of looking up "example.com" and port 80, the message is sent to your proxy host instead (http://theproxy.com:5555).
Java has built in proxy handling using http.proxyHost and http.proxyPort System properties.
So in your case you would need to do:
System.setProperty("http.proxyHost", "theproxy.com");
System.setProperty("http.proxyPort", "5555");
Then your code should, ideally, "Just Work".
Here is a page documenting the proxy properties.
Use Apache HttpClient and do as show in this example.
About the URL constructor with individual proxy setting:
http://edn.embarcadero.com/article/29783
(sorry don't have privileges to comment)
HI,
Is there any way I can find whether the cookies are disabled or not on the client browser. I have seen some posts saying to find using redirect URL, but there is no code how to do that . Can anyone please help me with a sample code to check this.
Please note that I want this to be done using Java only (no javascript please)
Thanks!
Srinivas
you could set a cookie on the startsite and try to read it on the following sites, if your cookie can't be read the user has either disabled them or has deleted your cookie