Jsoup: error 307 when trying to access a page

Jsoup: error 307 when trying to access a page - java

I'm trying to access the page http://www.betbrain.com with jsoup, but this give me error 307. Anyone knows how I can fix this?
String sURL = "http://www.betbrain.com";
Connection.Response res = Jsoup.connect(sURL).timeout(5000).ignoreHttpErrors(true).followRedirects(true).execute();

HTTP status code 307 is not an error, it's an information saying that the server is making a temporary redirect to another page.
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for info about HTTP Status codes.
The response returned from your request holds the value for the redirect inside the headers.
To get the header-values you could something like this:
String[] headerValues = res.headers().values().toArray(new String[res.headers().values().size()]);
String[] headerKeys = res.headers().keySet().toArray(new String[res.headers().keySet().size()]);
for (int i = 0; i < headerValues.length; i++) {
System.out.println("Key: " + headerKeys[i] + " - value: " + headerValues[i]);
}
You need you own code of course for this, as you need your response.
Now when you look at the headers written to the console you will see a key:
Location which has a value of http://www.betbrain.com/?attempt=1.
This is your URL to redirect to, so you would do something like:
String newRedirectedUrl = res.headers("location");
Connection.Response newResponse = Jsoup.connect(newRedirectUrl).execute();
// Parse the response accordingly.
I am not sure why jsoup isn't following this redirect correctly, but it seems like it could have something to do with the standard Java implementation of HTTP redirects.

Related

Extract data from returned string

I'm using Katalon Studio and using it to send an API request. The request is basically returning information I want to use in the HTTP Header. I can use Groovy or Java to extract this but not sure how I can do it.
I've tried create_game_response.getHeadewrFields(GameCode) in order to get the GameCode but it won't work.
Here is the code I use
WS.sendRequest(findTestObject('UserRestService/Create Game'))
WS.verifyResponseStatusCode(create_game_response, 201)
def header_text = create_game_response.getHeaderFields()
println(header_text)
def game_code = create_game_response.getHeaderFields();
String game_code_list = game_code.toString()
println(game_code_list)
And this is the response:
{GameCode=[1jwoz2qy0js], Transfer-Encoding=[chunked], null=[HTTP/1.1 201 Created]}
I'm trying to extract "1jwoz2qy0js" from the game code and use it as a string, how can I do this?

getHeaderFields() returns a Map of the headers where each header is a List. Rather than converting that to a String and attempting to parse it, just get the field you want:
Map headers = create_game_response.getHeaderFields()
List gameCodes = headers["GameCode"]
And then select the first one, if that's all there is:
assert gamesCodes[0] == "1jwoz2qy0js"

Groovy code below:

str = '{GameCode=[1jwoz2qy0js], Transfer-Encoding=[chunked], null=[HTTP/1.1 201 Created]}'
left_idx = str.indexOf('[') + 1
right_idx = str.indexOf(']')
print str.substring(left_idx,right_idx)
Output:
1jwoz2qy0js

Getting bad response using JRAW

I am trying to read data from reddit using java. I am using JRAW.
Here is my code:
public class Main {
public static void main(String args[]) {
System.out.println('a');
String username = "dummyName";
UserAgent userAgent = new UserAgent("crawl", "com.example.crawl", "v0.1", username);
Credentials credentials = Credentials.script(username, <password>,<clientID>, <client-secret>);
NetworkAdapter adapter = new OkHttpNetworkAdapter(userAgent);
RedditClient reddit = OAuthHelper.automatic(adapter, credentials);
Account me = reddit.me().about();
System.out.println(me.getName());
SubmissionReference submission = reddit.submission("https://www.reddit.com/r/diabetes/comments/9rlkdm/shady_insurance_work_around_to_pay_for_my_dexcom/");
RootCommentNode rcn = submission.comments();
System.out.println(rcn.getDepth());
System.out.println();
// Submission submission1 = submission.inspect();
// System.out.println(submission1.getSelfText());
// System.out.println(submission1.getUrl());
// System.out.println(submission1.getTitle());
// System.out.println(submission1.getAuthor());
// System.out.println(submission1.getCreated());
System.out.println("-----------------------------------------------------------------");
}
}
I am making two requests as of now, first one is reddit.me().about(); and the second is reddit.submission("https://www.reddit.com/r/diabetes/comments/9rlkdm/ shady_insurance_work_around_to_pay_for_my_dexcom/");
The output is:
a
[1 ->] GET https://oauth.reddit.com/api/v1/me?raw_json=1
[<- 1] 200 application/json: '{"is_employee": false, "seen_layout_switch": true, "has_visited_new_profile": false, "pref_no_profanity": true, "has_external_account": false, "pref_geopopular": "GL(...)
dummyName
[2 ->] GET https://oauth.reddit.com/comments/https%3A%2F%2Fwww.reddit.com%2Fr%2Fdiabetes%2Fcomments%2F9rlkdm%2Fshady_insurance_work_around_to_pay_for_my_dexcom%2F?sort=confidence&sr_detail=false&(...)
[<- 2] 400 application/json: '{"message": "Bad Request", "error": 400}'
Exception in thread "main" net.dean.jraw.ApiException: API returned error: 400 (Bad Request), relevant parameters: []
at net.dean.jraw.models.internal.ObjectBasedApiExceptionStub.create(ObjectBasedApiExceptionStub.java:57)
at net.dean.jraw.models.internal.ObjectBasedApiExceptionStub.create(ObjectBasedApiExceptionStub.java:33)
at net.dean.jraw.RedditClient.request(RedditClient.kt:186)
at net.dean.jraw.RedditClient.request(RedditClient.kt:219)
at net.dean.jraw.RedditClient.request(RedditClient.kt:255)
at net.dean.jraw.references.SubmissionReference.comments(SubmissionReference.kt:50)
at net.dean.jraw.references.SubmissionReference.comments(SubmissionReference.kt:28)
at Main.main(Main.java:36)
Caused by: net.dean.jraw.http.NetworkException: HTTP request created unsuccessful response: GET https://oauth.reddit.com/comments/https%3A%2F%2Fwww.reddit.com%2Fr%2Fdiabetes%2Fcomments%2F9rlkdm%2Fshady_insurance_work_around_to_pay_for_my_dexcom%2F?sort=confidence&sr_detail=false&raw_json=1 -> 400
... 6 more
As it can been that my first request gives me a response of my username but in the second response i am getting a bad request 400 error.
To check whether my client ID and client secret were working correctly I did the same request using python PRAW library.
import praw
from praw.models import MoreComments
reddit = praw.Reddit(client_id=<same-as-in-java>, client_secret=<same-as-in-java>,
password=<same-as-in-java>, user_agent='crawl',
username="dummyName")
submission = reddit.submission(
url='https://www.reddit.com/r/redditdev/comments/1x70wl/how_to_get_all_replies_to_a_comment/')
print(submission.selftext)
print(submission.url)
print(submission.title)
print(submission.author)
print(submission.created_utc)
print('-----------------------------------------------------------------')
This gives the desired result without any errors so the client secret details must be working.
The only doubt I have is in the user agent creation in java UserAgent userAgent = new UserAgent("crawl", "com.example.crawl", "v0.1", username);.
I followed the following link.
What exactly does the target platform, the unique ID or the version mean. I tried to keep the same format as in the link. Also using the same username as in other places. On the other hand the user_agent in python was a string crawl.
Please tell me if I am missing anything and what could be the issue.
Thank you
P.S. I want to do this in java. not python.

Since your first query is working the credentials are correct. In JRAW don't give the whole URL but only the id in the submission function.
Change this
SubmissionReference submission = reddit.submission("https://www.reddit.com/r/diabetes/comments/9rlkdm/shady_insurance_work_around_to_pay_for_my_dexcom/");
to this
SubmissionReference submission = reddit.submission("9rlkdm");
where the id is the random string after /comment/ in the URL.
Hope this helps.

How to get Facebook Rate Limit Header using Facebook4J?

According to Facebook Docs
If your app is making enough calls to be considered for rate limiting by our system, we return an X-App-Usage HTTP header. [...] When any of these metrics exceed 100 the app will be rate limited.
I am using Facebook4J to connect my application to the Facebook API. But I could not find any documentation about how I can get the X-App-Usage HTTP header after a Facebook call, in order to avoid being rate limited. I want to use this header to know dinamically if I need to increase or decrease the time between each API call.
So, my question is: using Facebook4J, is possible to check if Facebook returned the X-App-Usage HTTP header and get it? How?

There is a getResponseHeader method for the response of BatchRequests in facebook4j see Facebook4j code examples
You could try getResponseHeader("X-App-Usage")
// Executing "me" and "me/friends?limit=50" endpoints
BatchRequests<BatchRequest> batch = new BatchRequests<BatchRequest>();
batch.add(new BatchRequest(RequestMethod.GET, "me"));
batch.add(new BatchRequest(RequestMethod.GET, "me/friends?limit=50"));
List<BatchResponse> results = facebook.executeBatch(batch);
BatchResponse result1 = results.get(0);
BatchResponse result2 = results.get(1);
// You can get http status code or headers
int statusCode1 = result1.getStatusCode();
String contentType = result1.getResponseHeader("Content-Type");
// You can get body content via as****() method
String jsonString = result1.asString();
JSONObject jsonObject = result1.asJSONObject();
ResponseList<JSONObject> responseList = result2.asResponseList();
// You can map json to java object using DataObjectFactory#create****()
User user = DataObjectFactory.createUser(jsonString);
Friend friend1 = DataObjectFactory.createFriend(responseList.get(0).toString());
Friend friend2 = DataObjectFactory.createFriend(responseList.get(1).toString());

How to check last modified time of a pdf file on a website using jsoup

I want to check last modified time of a pdf file on a particular page.
The pdf link is http://www.nfib.com/Portals/0/PDF/sbet/sbet201402.pdf
I am trying to do this :
Connection.Response rs2 = Jsoup.connect("http://www.nfib.com/Portals/0/PDF/sbet/sbet201402.pdf").execute();
System.out.println("Header = " + rs2.header("Last-Modified"));
I get this error
UnsupportedMimeTypeException

If it doesn't have to be done with Jsoup you can just use standard URL and URLConnection classes like
URL url = new URL("http://www.nfib.com/Portals/0/PDF/sbet/sbet201402.pdf");
URLConnection connection = url.openConnection();
System.out.println("Header = " + connection.getHeaderField("Last-Modified"));
You need to remember that Jsoup was designed to parse HTML/XML, so by default it requires types of
text/*, application/xml, or application/xhtml+xml
not
application/pdf.
If you take a look at code which handles it, it looks like
if (contentType != null && !req.ignoreContentType() && (!(contentType.startsWith("text/") || contentType.startsWith("application/xml") || contentType.startsWith("application/xhtml+xml"))))
throw new UnsupportedMimeTypeException("Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml",
contentType, req.url().toString());
But !req.ignoreContentType() test gives us hint that we can turn of requirements or purely XML/HTML type input. To do so you can just add
ignoreContentType(true)
to your connection settings, like
Connection.Response rs2 = Jsoup.connect("http://www.nfib.com/Portals/0/PDF/sbet/sbet201402.pdf")
.ignoreContentType(true)
.execute();
and you should be able to read returned headers
System.out.println("Header = " + rs2.header("Last-Modified"));
output:
Header = Mon, 10 Feb 2014 22:54:15 GMT

Handling non-english URL with Spring and App Engine Task Queue

I have this problem where I need to queue a page link with TaskQueue:
Queue queue = QueueFactory.getDefaultQueue();
for (String href : hrefs){
href = baseUrl + href;
pageLinks = pageLinks + "\n" + href;
queue.add(TaskOptions.Builder
.withUrl("/crawler")
.param("url", href));
l("Added to queue url=["+href+"]");
}
The problem here is that, I think the URL that gets passed into the queue contains ?'s for Arabic characters. As it keeps on rescheduling.
The String pageLinks however is outputed in the browser through Spring MVC, and I can properly see the Arabic character being displayed. So I'm pretty the links are ok.
If I copy one of the links output on the browser, and paste it to the browser URL it works fine. So I'm pretty sure that the reason that the queue keeps on recheduling because it gets the wrong URL.
What could I be missing here? Do I need to convert the String href before passing it into the queue?
The crawl service looks like this:
#RequestMapping(method = RequestMethod.GET, value = "/crawl",
produces = "application/json; charset=iso-8859-6")
public #ResponseBody String crawl(HttpServletRequest req, HttpServletResponse res,
#RequestParam(value="url", required = false) String url) {
l("Processs url:" + url);
}
Also do I need to convert the #QueryParam String url here to Arabic or not?

You must Url-encode the parameters. See this question: Java URL encoding of query string parameters

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup: error 307 when trying to access a page - java

I'm trying to access the page http://www.betbrain.com with jsoup, but this give me error 307. Anyone knows how I can fix this? String sURL = "http://www.betbrain.com"; Connection.Response res = Jsoup.connect(sURL).timeout(5000).ignoreHttpErrors(true).followRedirects(true).execute();

Related

Extract data from returned string

Getting bad response using JRAW

How to get Facebook Rate Limit Header using Facebook4J?

How to check last modified time of a pdf file on a website using jsoup

Handling non-english URL with Spring and App Engine Task Queue

Categories

Resources