JSoup.connect with some requests has 403 error - java

I try to get HTML source by Jsoup.connect of this page: https://bitskins.com/?market_hash_name=SSG+08+%7C+DARK+WATER+%28Field-Tested%29&is_stattrak=0&has_stickers=0&sort_by=bumped_at&order=desc
but, I have the error: Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://bitskins.com/?market_hash_name=SSG+08+%7C+DARK+WATER+%28Field-Tested%29&is_stattrak=0&has_stickers=0&sort_by=bumped_at&order=desc
My code is:
Document doc = Jsoup.connect("https://bitskins.com/?market_hash_name=SSG+08+%7C+DARK+WATER+%28Field-Tested%29&is_stattrak=0&has_stickers=0&sort_by=bumped_at&order=desc")
.data(":authority", "bitskins.com")
.data(":method", "GET")
.data(":path", "/?market_hash_name=SSG+08+%7C+DARK+WATER+%28Field-Tested%29&is_stattrak=0&has_stickers=0&sort_by=bumped_at&order=desc")
.data(":scheme", "https")
.data("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.data("accept-encoding", "gzip, deflate, sdch, br")
.data("accept-language:", "ru,en-US;q=0.8,en;q=0.6")
.data("cache-control", "max-age=0")
.data("cookie", "__cfduid=d76231c8cccdbd5303a7d4feeb3f3a11f1466541718; _gat=1; _ga=GA1.2.1292204706.1466541721; request_method=POST; _session_id=5dc49c7814d5087ac51f9d9da20b2680")
.data("dnt", "1")
.data("upgrade-insecure-requests", "1")
.data("user-agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36")
.post();
What is the problem???

The problem is, .data() adds to the form data, not the header. So you need to use the appropriate methods to set the related information. Refer to below to fix your code:
To set the header:
.header("key", "value")
To set the form data:
.data("key", "value")
To set user agent:
.userAgent("Mozilla...")

Related

Java Htmlunit Cloudflare protection stuck on redirecting to the final page

So, I am trying to access a website (using Java Htmlunit - version 2.20 - can´t update it, company policies) - it is a government website - https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/ - hosted by Cloudflare.
When accessing via normal Browser, everything is normal. When accessing via htmlunit, Cloudflare starts the proccess of Checking if the site connection is secure.
I can proceed to the next step, which is Connection is secure - Proceeding...
But it just stucks there.
Please, how can I bypass it correctly.
P.S.: There is no way to have my IP on a whitelist for this. I must go through this verification and redirection on my own.
Thanks in advance!
Some code sample:
BrowserVersion chrome = new BrowserVersion(
"Netscape",
"5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
108);
chrome.setApplicationCodeName("Mozilla");
chrome.setVendor("Google Inc.");
chrome.setHtmlAcceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
chrome.setImgAcceptHeader("image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8");
chrome.setCssAcceptHeader("text/css,*/*;q=0.1");
chrome.setScriptAcceptHeader("*/*");
chrome.setBrowserLanguage("en-US,en;q=0.9,pt;q=0.8,mt;q=0.7");
chrome.setPlatform("Windows");
chrome.setUserLanguage("pt-BR");
chrome.setSystemLanguage("pt-BR");
try (WebClient webClient = new WebClient(chrome)) {
String url = "https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/";
// parâmetros do webclient
webClient.getOptions().setCssEnabled(true);
webClient.setJavaScriptTimeout(0);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setUseInsecureSSL(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(0);
CookieManager cookies = new CookieManager();
cookies.setCookiesEnabled(true);
webClient.setCookieManager(cookies);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.waitForBackgroundJavaScript(10000);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
webClient.getCache().setMaxSize(0);
HtmlPage page = webClient.getPage(url);
webClient.getRefreshHandler().handleRefresh(page, new URL(url), 10);
synchronized(page) {
page.wait(10000);
}
URL _url = new URL(url);
for(Cookie c : webClient.getCookies(_url)) {
if (c.getName().contains("cf_chl_2")) {
Cookie cook = new Cookie(c.getDomain(), c.getName(), c.getValue(), c.getPath(), -1, false);
webClient.getCookieManager().removeCookie(c);
webClient.getCookieManager().addCookie(cook);
}
}
if (page.asXml().contains("Checking if the site connection is secure")) {
log.info(page.asXml());
page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
}
(Some parts of this code, I got from this question/answer)
Some parts of the log I got so far...
As you can see, Checking if the site connection is secure and Proceeding... appears. But... It´s just it.... (Error: 1020, Firewall stuff...)...
P.S.: This manually added cookie is to replace a cookie that was returning an error of negative max age.
[com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl] (JS executor for com.gargoylesoftware.htmlunit.WebClient#724c8784) set-cookie http-equiv meta tag: invalid cookie 'cf_chl_2=; Max-Age=-99999999;'; reason: 'Negative 'max-age' attribute: -99999999'.
So please, in the name of the Gods of Programming Languages, how can I procceed to the final webpage?
Thanks!!!!
There was an update/fix regarding the cookie processing in HtmlUnit. The latest 2.68.0-SNAPSHOT or all future releases should fix that.
See https://github.com/HtmlUnit/htmlunit/issues/524 for more.
This is also related to HtmlUnit returning empty list of DomElements

Error when fetching URL: One of the sub-directories is removed on get request

When attempting to send a get request to the url https://student.utm.utoronto.ca/timetable/api/courses/20199/, I get this error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://student.utm.utoronto.ca/api/courses/20199
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:760)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:757)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:705)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:295)
at com.github.Bb0lt.utmtimetablebuilder.model.CourseFinder.main(CourseFinder.java:30)
I noticed that in the error, the URL is not the same as the original one specified in the request. "/timetable" has been removed from the directory.
This probably has to do with the fact that the website is blocking webscraping, as the robots.txt file at https://student.utm.utoronto.ca/robots.txt is specified. I tried setting a user-agent, but I still get this error.
Here is the code I've tried:
Map<String, String> headers = new HashMap<>();
headers.put("Host", "student.utm.utoronto.ca");
headers.put("Sec-Fetch-Mode", "navigate");
headers.put("Sec-Fetch-Site", "none");
headers.put("Upgrade-Insecure-Requests", "1");
Connection.Response loginForm = Jsoup.connect("https://student.utm.utoronto.ca/timetable/api/courses/20199/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36")
.headers(headers)
.method(Connection.Method.GET)
.execute();
What can I do to get access to the page? do I need to use a proxy?

How do i recreate a html request header?

I'm currently writing a Java Application that remotely controls my Roku. I found this website and used it to control my Roku. From Chromes developer tools i watched its data traffic and found the html request that controlled the Roku. The Header was this.
POST /keydown/Play HTTP/1.1
Host: 192.xxx.x.82:8060
Connection: keep-alive
Content-Length: 0
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Origin: http://remoku.tv
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://remoku.tv/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
I then tried to recreate this POST request within Java and it ended up looking like this:
HttpURLConnection urlConn;
URL url = new URL("html://192.xxx.x.82:8060/keydown/Play");
urlConn = (HttpURLConnection) url.openConnection();
urlConn.setRequestProperty("Connection", "keep-alive");
urlConn.setRequestProperty("Content-Length", "0");
urlConn.setRequestProperty("Cache-Control", "max-age=0");
urlConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
urlConn.setRequestProperty("Origin", "http://192.xxx.x.254");
urlConn.setRequestProperty("Upgrade-Insecure-Requests", "1");
urlConn.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");
urlConn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
urlConn.setRequestProperty("Referer", "http://192.xxx.x.254");
urlConn.setRequestProperty("Accept-Encoding", "gzip, deflate");
urlConn.setRequestProperty("Accept-Language", "en-US,en;q=0.8");
I'm not 100% sure this is the correct way to recreate the request because it does not have the same effect as the the original (working). However, this may be because I changed a few minor details that may have actually be important. So my question to you is if this the correct way to recreate a request and if it is why is it not working? If not what is? Any help is appreciated.
Thanks to tgkprog's comment i edited my code to this:
HttpURLConnection urlConn;
URL url = new URL("http://192.xxx.x.82:8060/keypress/Right");
urlConn = (HttpURLConnection) url.openConnection();
urlConn.setRequestMethod("POST");
urlConn.setDoOutput(true);
try(DataOutputStream wr = new DataOutputStream(urlConn.getOutputStream())) {
wr.writeChars("");
}
System.out.println(urlConn.getResponseCode());
and now it works perfectly and I can control my Roku the problem was i was not using the correct keys in the header as they where not in caps lock in Chrome (edit: they are not needed).

Cannot get 'location' header in response using HttpClient

The location header is there, I can see it in browser:
I'm using org.apache.httpcomponents.httpclient to send http request with cookie:
```
URI uri = new URIBuilder().setScheme("https").setHost("api.weibo.com").setPath("/oauth2/authorize").setParameter("client_id","3099336849").setParameter("redirect_uri","http://45.78.24.83/authorize").setParameter("response_type", "code").build();
HttpGet req1 = new HttpGet(uri);
RequestConfig config = RequestConfig.custom().setRedirectsEnabled(false).build();
req1.setConfig(config);
req1.setHeader("Connection", "keep-alive");
req1.setHeader("Cookie", cookie);
req1.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36");
response = httpclient.execute(req1);
```
I googled a lot and tried enable/disable auto redirect,but it doesn't seem to work for me.So can somebody tell me how to get the location header in response just like the browser did?
You cannot see 'location' header, because HttpClient followed that redirect immediately - even before giving you that response.
Try disabling redirect while setting up your HttpClient:
HttpClient instance = HttpClientBuilder.create().disableRedirectHandling().build();
Check this URL and you'll see the Location Header:
URI uri = new URIBuilder().setScheme("https").setHost("www.googlemail.com").setPath("/").build();
I found out my real question...I didn't pass the auth process in my code,so I keep getting oauth2 page.After I set all the headers in my request just like the browser did and finally I get the right response.

Jsoup authentication failed

I'm trying to connect this website : https://ent.enteduc.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=1 with the following code :
Connection.Response response = Jsoup.connect("https://ent.enteduc.fr/CookieAuth.dll?GetLogon?curl=Z2F&flags=0&forcedownlevel=0&formdir=1&username=XXX&password=XXX&trusted=4&SubmitCreds.x=36&SubmitCreds.y=7&SubmitCreds=Ouvrir+une+session")
.method(Connection.Method.GET)
.execute();
Document Doc = Jsoup.connect("https://ent.enteduc.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=1")
.data("username","myusername")
.data("password","mypassword")
.data("curl","Z2F")
.data("flags","0")
.data("forcedownlevel","0")
.data("formdir","1")
.data("trusted","4")
.data("SubmitCreds.x","40") //Seems to send the coordinates of the cursor
.data("SubmitCreds.y","12") //Seems to send the coordinates of the cursor
.data("SubmitCreds","Ouvrir une session")
.cookies(response.cookies())
.post();
Log.e("Body", Doc.body().toString());
But The displayed "Body" is still the authentication page (No error in the Logcat)
What's wrong ?
Here are the details of the connection, get with the Chromes's Console
Remote Address:85.90.60.205:443
Request URL:https://ent.enteduc.fr/CookieAuth.dll?Logon
Request Method:POST
Status Code:302 Moved Temporarily
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:165
Content-Type:application/x-www-form-urlencoded
Cookie:ISAWPLB{FE9B5C07-18E7-4D86-BC7C-2F0AFE4F36BF}={8A3F320B-C8EB-40F9-A11E-D036A91F953F}; __utma=136247269.742318163.1408441429.1408445338.1408450626.3; __utmb=136247269.5.10.1408450626; __utmc=136247269; __utmz=136247269.1408441429.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); WSS_KeepSessionAuthenticated=; logondata=acc=0&lgn=*********
Host:ent.enteduc.fr
Origin:https://ent.enteduc.fr
Referer:https://ent.enteduc.fr/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=1
User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36
Query String Parametersview sourceview URL encoded
Logon:
Form Dataview sourceview URL encoded
curl:Z2F
flags:0
forcedownlevel:0
formdir:1
username:myusername
password:mypass
trusted:4
SubmitCreds.x:53
SubmitCreds.y:12
SubmitCreds:Ouvrir une session
Response Headersview source
Connection:close
Content-Length:0
Location:https://ent.enteduc.fr/
Set-Cookie:cadata6A45CD714D774496A399F96AC521E21E....
There is nothing wrong with your code. It works. I tried the default user name and password you supplied. This is what the site does...
You login successfully and it sends a HTTP 302 to the path / and also gives you a cookie that identifies you.
HTTP/1.1 302 Moved Temporarily
Location: https://ent.enteduc.fr/
Set-Cookie: XXX
The browser requests for / and the server responds with another HTTP 302
HTTP/1.1 302 Found
Connection: Keep-Alive
Location: /etabs/0680001F/Pages/Accueil.aspx
Requesting for /etabs/0680001F/Pages/Accueil.aspx results in a 200 OK with HTML content written in french. Excusez moi ! Je ne parle pas francais.
Change your code to follow the redirects and set the cookies on each step and you should be fine.
[EDIT]
When you're done please remove the authentication info you supplied on this post.

Categories

Resources