Jsoup, Reddit, OAuth2, and 429 HTTP Errors - java

So I'm trying to write an executable JAR for a small subreddit I run.
I have a post that Jsoup connects to and reads all the URLs on that page. In another method, it then connects to all those URLs (that are just comments on the post) and gets the HTML from the comments and saves them to a HashMap.
This is great however I am getting a 429 HTTP Error. So to resolve this, I added a short 5 second wait. Now I'm getting a SocketTimeoutException "Read timed out". Once I lowered the time down to 3 seconds, I was bouncing between the two.
Now I run a few Reddit bots with Python and I'm able to make a lot more requests than what I'm doing here. I actually have a single bot that makes thousands of requests every minute. So I know it's possible to make these requests.
My question essentially is, how am I able to make multiple requests to Reddit and avoid the 429 HTTP Error? I'm using Jsoup to connect and read the HTML.
While I'm sure connecting to Reddit via. their OAuth2 API will fix the issues, I have no idea how to actually use OAuth2 in Java (I actually use a wrapper in Python so it's fair to say I don't know at all) and I don't know how to then use that with Jsoup.

My question essentially is, how am I able to make multiple requests to Reddit and avoid the 429 HTTP Error?
You answer this yourself:
While I'm sure connecting to Reddit via. their OAuth2 API will fix the issues,
As specified in the API documentation, you get twice as many requests per second if authenticated using OAuth.
Have you looked around for examples on how to handle OAuth flows in Java?
You might also find it easier to use one of the wrapper libraries for Java, instead of handling all this yourself.

Just set header and you can easily pass it
User-Agent: super happy flair bot by /u/spladug

Related

Http Get Request - what data is actually send?

I'm currently building a web spider with java apache commons. I'm crawling basic google search queries like https://google.com/search?q=word&hl=en
Somehow after about 60 queries I get blocked, it seems they recognize me as a bot and I get a 503 Service Unavailable response
Now the important part:
If I visit the same site with firefox/chrome I get the desired result.
If I make a GET Request with my Application using the same http header (user-agent, cookies, cache etc.) I am still blocked.
HOW does Google know whether I'm connecting via Application or Chrome-Browser, when there is only the IP and the HTTP-Header as Information?(maybe I'm wrong?)
Are there more parameters to recognize my App? Something that Google sees and I don't?
(Maybe important: I'm using Chrome Developer Tools and httpbin.org to compare the headers of Browser and Application.)
Thanks a lot
Since you have not specified how quickly you send the 60 queries, I am assuming at a high rate. This is why google is blocking you. Several times I have rapidly done google searches from chrome and it asks for a captcha after a while and then blocks soon after.
Please see the API on Custom Search and this post about terms of Service Replacement for Google API
FAQ on blocked searches: Google FAQ

Managing Cookies from external backend-service

Let's say I've created a mobile application named 'Foo'(iOS). This app talks to a Java-running backend at 'java.com' and works perfectly. Now, I'm trying to create the website 'Foo.com' to let users enjoy the 'same' service on a browser/computer. So far, I've found that almost all calls needed to the API from the website can be done in JavaScript directly to the backend at 'java.com', including a login-function.
On the backend, I've implemented the standard 'doPost'-method to handle the login, and I create a Cookie to attach to the request.
The problem, I think, is that the users get the JavaScript from 'Foo.com', and the JavaScript tries to log in by using an AJAX-call to 'java.com', thus the cookie will be 'stamped' by www.java.com', not by 'www.foo.com', and the user will never receive the cookie. (At least, I don't receive a cookie now)
I've been trying to find a way to accept cookies from 'api.com' into the application, but it doesn't look good. Honestly, I'm not even sure this is the actual problem causing me to not receive a cookie, but I've read several places that cross-domain-cookies aren't allowed. So I ask the general question, how should I proceed?
I've been toying with the idea to add a .php-page to the server-side of the website 'foo.com', and from there handle the requests from client to API, hopefully causing the cookies to be 'stamped' as 'foo.com' instead of 'java.com'. (In that case, I'd also wonder if the .php can forward the information in the cookie or something similar).
But I really want to avoid as much traffic on the webhost as possible. An all-script-website would be optimal, but I don't really see how cookies can work with that.
Is there anything else I can do to handle this? If I simply want a persistent login-function from a client of 'foo.com' handled at 'java.com', are there any options, with or without the use of cookies?

Google App Engine authentication using OAuth

There are a lot of forums and samples out there, but all of them either outdated or just not understandable.
I understand that to authenticate requests to AppEngine I need to log in to a google account using AccountManager, get a token using GoogleAuthUtil.getToken, get an AuthCookie, and than do whatever I want on the AppEngine using my token.
Now, the last 2 parts are the ones I don't understand:
what is the AuthCookie? I need to get a new one every launch? is it a temporary "permission" to request authenticated requests from AppEngine? is the first token I received is a permanent one or I should get a new one every launch too?
My current request is "endpoint.list().execute()". Where is the authentication comes in here? I've seen a lot of weird HTTP request samples but none of them used the AppEngine endpoints.
I'm sorry if it's too basic stuff but I really just started using the AppEngine and I couldn't find any clear explanation on how it works from beginning to end.
Thank you.
Since you are using Endpoints, have you read this?
https://developers.google.com/appengine/docs/java/endpoints/consume_android#making-authenticated-calls
It is up-to-date and I think it is reasonably clear (and it includes a sample).
I believe it is the nature of OATH that you need to get a new token for every session.

What's the best way to let the Ajax app know of the errors back at server?

Hi
I'm working on an application with Java as it's server-side language and for the client-side I'm using Ajax.
But I'm fairly new to ajax applications so I needed some opinions on the issue I've faced.
I'm using Spring Security for my authentication and authorization services and by reading spring forums I've managed to integrate Spring Security with Ajax application in a way that ajax requests can be intercepted and relevant action be taken.
Here's the issue: What is the best way to let the ajax application know that an error has occurred back at server. What I've been doing so far is that by convention I make random http 500+ errors. e. g. to prompt for login I return 550, and 551 for other issue and so forth. But I think this is not the right approach to this. What is the best approach for dealing with this situation?
If standard HTTP error codes (eg 401 Unauthorized) are rich enough, use them. Best not to make up your own HTTP error codes, they're meant to be fixed. If you need more info to be returned, you should return a richer object in the response body (serialized as eg JSON or XML) and parse the object on the client side.
In my experience, making up your own HTTP error codes is not the best approach.
I've known client and server-side HTTP protocol stacks to treat non-standard HTTP status codes as protocol errors.
A non-standard code is likely to lead to confusing error messages if they end up being handled as non-AJAX responses.
Similarly, using the "reason phrase" part of the response can be problematic. Some server-side stacks won't let you set it, and some client-side stacks discard it.
My preferred way of reporting errors in response to an AJAX request is to send a standard code (e.g. 400 - BAD REQUEST) with an XML, JSON or plain text response body that gives details of the error. (Be sure to set the response content type header ...)
If this a bug in your application or a hack that you want to protect from, just return a generic access error. Don't give detail of the error on the client as it could be used by the hacker to better understand how to abuse your API. This would confuse normal users anyway.
If this is to be normal application behaviour, it might be better to be sure that you fail gracefully by allowing to retry later (if it make sence), reconnect or reauthenticate. You should at least recognise if it is a disconnected error or an insuffiscient rights error, and display a nice looking explanation to the user.

Possible to Authenticate with an website with POST / Download CAPTCHA

I've often wanted to create applications that provide a simpler front-end to other websites that require users to login before the pages I want to use can be accessed. I was wondering, if
(1) any website with a POST to an http page can be authenticated by POSTing
postField1name=pf1Value&postField2name=pf2Value
to the website, if that's true how can you inspect the HTML to POST correctly?
(2) I wanted to know if you could parse HTML, say for a sign up form, and display all the fields in an application UI, including downloading a Captcha, and displaying it to the user, and allowing them to type the value in, to send back to the website, and process the response.
Also if anyone knows how I might accomplish (2) using Apache HTTP Client in java, I'd greatly appreciate it!
http://hc.apache.org/httpcomponents-client/httpclient/index.html
(1) An easy way to find out what's actually being POST'd is to look at the actual HTTP requests. You can do that with a tool like LiveHTTPHeaders. Then have your script simulate that.
(2) Yes. You can use cURL, which is excellent for things like this.
(1) Try FireBug. There's actually a lot of options for authentication.
(2) Try JTidy

Categories

Resources