Minimum Delay betweeen consecutive request to server by a web crawler

Minimum Delay betweeen consecutive request to server by a web crawler - java

I have built a multi threaded web crawler which makes requests to fetch the web pages from corresponding servers. As it is multi threaded it can make overburden a server. Due to which server can block the crawler(politeness).
I just want to add functionality of minimum delay between consequtive request to same server. Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?
What if no delay is specified in robot.txt ?

The defacto standard robots.txt file format doesn't specify a delay between requests. It is a non-standard extension.
The absence of a "Crawl-delay" directive does not mean that you are free to hammer the server as hard as you like.
Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?
That is not sufficient. You also need to implement a minimum time between requests for cases where the robots.txt doesn't use the non-standard directive. And you should also respect "Retry-After" headers in 503 responses.
Ideally you should also pay attention to the time taken to respond to a request. A slow response is potential indication of congestion or server overload, and a site admin is more likely to block your crawler if it is perceived to be the cause of congestion.

I use 0.5 seconds as delay on my web crawler. Use that as default, and if it is specified you should use that.

Related

Adding an artificial delay in a Spring Security AuthenticationProvider without making a DOS attack easier

OWASP suggests that one possible countermeasure for preventing brute force password guessing is adding an artificial delay when checking passwords.
Setting aside any question about the effectiveness of such an approach vs temporarily locking accounts etc, how would I implement it inside a Spring Security AuthenticationProvider without creating a situation where it becomes easy for an attacker to quickly consume every web server thread? (which is what I imagine would happen if I just added a Thread.sleep() inside the authenticate method)

EDIT 2.0
With the advent of Servlet 3.0, support was added for async processing support, it was made simple to add artificial delay.
PROBLEM: intermediate filter that don't support the async processing
SOLUTION: Additional AttmeptFilter at base of filterchain, before any incompatible filter sneaks in and apply our logic of adding delay.
In our previous approach we relied on just timeout, that didn't help cause it's terminating request abnormally. In revised approached I keep a map of AsyncContext objects with username and host address combination. And A scheduled job keep watch on which of these requests are waited for desired time and dispatches to appropriate destination.
The approach we are currently using is an additional filter where we check the failed number of attempts against a principal with combination of remote host address, once a threshold is meet.
The number of failed attempts are managed by a dedicated service, which listens to and each AbstractAuthenticationEvent and manipulates attempts according the result.
The attempts for specific principal and host address are blocked by the outermost filter with possible minimal checks (without even reaching spring security) for specified time and the blocking time increases with increasing number of failed attempts.
As the remote host and principal combination is blocked, it will not impact genuine user web experience due to account locking and such other solutions.
I have made crude sample application to make the idea clearer for you
check out code at github
https://github.com/yourarj/spring-security-prevent-brute-force
Crude Architectural overview

HTTP requests for real time application, performance tips

I'm using Java's OkHttp3 to send multiple POST requests to the same REST endpoint, which is a third party AWS server on the same region as mine. I need those requests to be processed as fast as possible (even 1ms counts).
Right now the only performance tips I'm applying are very basic: I'm using HTTP2 so the connection socket is reused and I'm sending the requests asynchronously so it doesn't wait for any response until all requests are sent.
What are other tips I should consider to improve the performance?
EDIT: In case this is important for any reason, I'm currently passing all params through the URL, the body of the requests is empty. I may pass them as part of the body but I arbitrarily decided not to.

OkHttp is a good choice for low-latency. Netty may be a better choice for high-concurrency, but that's not your use-case.
You may want to measure what happens when you disable gzip. You’ll need to remove the accept-encoding request header in a network interceptor. That might make things faster since but only because you're on a fast link.
One other thing to research is disabling Nagle’s algorithm. You'll need to call Socket.setTcpNoDelay() which you can do with a custom SocketFactory.
The next release of OkHttp will support unencrypted HTTP/2. If you're okay with this (it is almost always a bad idea), removing TLS might buy you a (small) gain. Be very careful here; plaintext comms are bad news.

SPNEGO: Subsequent Calls after a Successful Negotiation and Authentication

Over the last few days I have built a proof-of-concept demo using the GSS-API and SPNEGO. The aim is to give users single-sign-on access to services offered by our custom application server via Http RESTful web-services.
A user holding a valid Kerberos Ticket Granting Ticket (TGT) can call the SPNEGO enabled web-service, the Client and Server will negotiate,
the user will be authenticated (both by Kerberos and on application level), and will (on successful authentication) have a Service Ticket for my Service Principal in his Ticket Cache.
This works well using CURL with the --negotiate flag as a test client.
On a first pass CURL makes a normal HTTP request with no special headers. This request is rejected by the Server,
which adds "WWW-Authenticate: Negotiate" to the response headers, suggesting negotiation.
CURL then gets a Service Ticket from the KDC, and makes a second request, this time with Negotiate + the wrapped Service Ticket in the request header (NegTokenInit)
The Server then unwraps the ticket, authenticates the user, and (if authentication was successful) executes the requested service (e.g. login).
The question is, what should happen on subsequent service calls from the client to the server? The client now has a valid Kerberos Service Ticket, yet additional calls via CURL using SPNEGO makes the same two passes described above.
As I see it, I have a number of options:
1) Each service call repeats the full 2 pass SPNEGO negotiation (as CURL does).
While this maybe the simplest option, at least in theory there will some overhead: both the client and the server are creating and tearing down GSS Contexts, and the request is being sent twice over the network, probably ok for GETs, less so for POSTs, as discusses in the questions below:
Why does the Authorization line change for every firefox request?
Why Firefox keeps negotiating kerberos service tickets?
But is the overhead* noticeable in real-life? I guess only performance testing will tell.
2) The first call uses SPNEGO negotiation. Once successfully authenticated, subsequent calls use application level authentication.
This would seem to be the approach taken by Websphere Application Server, which uses Lightweight Third Party Authentication (LTPA) security tokens for subsequent calls.
https://www.ibm.com/support/knowledgecenter/SS7JFU_8.5.5/com.ibm.websphere.express.doc/ae/csec_SPNEGO_explain.html
Initially this seems to be a bit weird. Having successfully negotiated that Kerberos can be used, why fall back to something else? On the other hand, this approach might be valid if GSS-API / SPNEGO can be shown to cause noticeable overhead / performance loss.
3) The first call uses SPNEGO negotiation. Once successfully authenticated and trusted, subsequent calls use GSS-API Kerberos.
In which case, I might as well do option 4) below.
4) Dump SPNEGO, use "pure" GSS-API Kerberos.
I could exchange the Kerberos tickets via custom Http headers or cookies.
Is there a best practice?
As background: Both the client and server applications are under my control, both are implemented in java, and I know both "speak" Kerberos.
I chose SPNEGO as "a place to start" for the proof-of-concept, and for worming my way into the world of Kerberos and Single Sign On, but it is not a hard requirement.
The proof-of-concept runs on OEL Linux servers with FreeIPA (because that is what I have in our dungeons), but the likely application will be Windows / Active Directory.
* or significant compared to other performance factors such as the database, use of XML vs JSON for the message bodies etc.
** If in the future we wanted to allow browser based access to the web services, then SPNEGO would be the way to go.

To answer your first question, GSS-SPNEGO may include multiple round trips. It is not limited to just two. You should implement a session handling and upon successful authentication issue a session cookie that client should be presenting on each request. When this cookie is invalid, server would force you to re-authenticate. This way you would only incur negotiate costs when that is really needed.
Depending on your application design you can choose different approaches for authentication. In FreeIPA we have been recommending to have a front-end authentication and allow applications to re-use the fact that front-end did authenticate the user. See http://www.freeipa.org/page/Web_App_Authentication for detailed description of different approaches.
I would recommend you to read the link referenced above and also check materials done by my colleague: https://www.adelton.com/ He is author of a number of Apache (and nginx) modules that help to decouple authentication from actual web applications when used with FreeIPA.

On re-reading my question, the question I was really asking was:
a) Is the overhead of SPNEGO significant enough that it makes sense to use if for a authorisation only, and that "something else" should be used for subsequent service calls?
or
b) Is the overhead of SPNEGO NOT significant in the greater scheme of things, and can be used for all service calls?
The answer is: It depends on the case; and key to finding out, is to measure the performance of service calls with and without SPNEGO.
Today I ran 4 basic test cases using:
a simple "ping" type web-service, without SPNEGO
a simple "ping" type web-sevice, using SPNEGO
a call to the application's login service, without SPNEGO
a call to the application's login service, using SPNEGO
Each test was called from a ksh script looping for 60 seconds, and making as many calls via CURL as possible in that time.
In the first test runs I measured:
Without SPNEGO
15 pings
11 logins
With SPENGO
8 pings
8 logins
This initially indicated that using SPNEGO I could only make half as many calls. However, on reflection, the overall volume of calls measured seemed low, even given the poorly specified Virtual Machines being used.
After some googling and tuning, I reran all the tests calling CURL with the -4 flag for IPV4 only. This gave the following:
Without SPNEGO
300+ pings
100+ logins
With SPNEGO
19 pings
14 logins
This demonstrates a significant difference with and without SPNEGO!
While I need to do further testing with real-world services that do some backend processing, this suggests that there is a strong case for "using something else" for subsequent service calls after authentication via SPNEGO.
In his comments Samson documents a precedent in the Hadoop world, and adds the additional architectural consideration of Highly Distributed / Available Service Principals

server restart notify to clients

I am working on web based applications.
Server : Tomcat
Technology : Simple Jsp/Servlet/JQuery
I need to restart my server as an when new updates are there. And these changes are frequent almost every 1 or 2 day. I think to create some mechanism like I can say to every logged in user to save their changes and server will start in few minutes. (Timer will be there). Popup should be open though user is ideal.
Is there any direct way to do this so? Or I need to implement ajax call on every few seconds to server on every jsp page to check if any message is there on server???
Any idea will be appreciated.
Thanks in advance.

For the approach you are taking, I would suggest you to use Async Serlvets(Req. min Servlet API 3.0) or Apache Tomcat's comet technology(Kind of Async Servlet).
You will make ajax call on every page when it(page) loads(ajax onload() for eg.) to Async Servlet and will idle until response from server comes. This Async servlet should send Server Restart notification to all connected clients- whenever you trigger notification manually. Once ajax client gets notification, it will display the Warning(or user friendly message).
This will remove the need to make unnecessary polling to server after fixed internal - A big plus resource wise.
Personally I wont suggest this way, better get agreed on specific timeframe for deployment everyday(every two days) with clients and perform deployments in this time.
If you are from India- You must be aware about IRCTC website- Which is not available for train reservation every night for 1 hour.

Google App Engine and HTTPS Strategy

I am designing my first GAE app and obviously need to use HTTPS for the login functionality (can't be sending my User's UIDs and passwords in cleartext!).
But I'm confused/nervous about how to handle requests after the initial login. The way I see it, I have 2 strategies:
Use HTTPS for everything
Switch back from HTTPS (for login) to plain ole' HTTP
The first option is more secure, but might introduce performance overhead (?) and possibly send my service bill through the roof. The second option is quicker and easier, but less secure.
The other factor here is that this would be a "single-page app" (using GWT), and certain sections of the UI will be able to accept payment and will require the secure transmission of financial data. So some AJAX requests could be HTTP, but others must be HTTPS.
So I ask:
GAE has a nifty table explaining incoming/outgoing bandwidth resources, but never concretely defines how much I/O bandwidth can be dedicated for HTTPS. Does anybody know the restrictions here? I'm planning on using "Billing Enabled" and paying a little bit for the app (and for higher resource limits).
Is it possible to have a GWT/single-page app where some portions of the UI use HTTP while others utilize HTTPS? Or is it "all or nothing"?
Is there any real performance overheard to utilizing an all-HTTPS strategy?
Understanding these will help me decide between a HTTP/S hybrid solution, or a pure HTTPS solution. Thanks in advance!

If you start mixing http and https request you are as secure as you would be using http, because any http request can be intercepted and can introduce possible XSS attacks.
If you are serious about your security read up on it, assuming that you only require https for sensible data and transmitting the rest with http will bring you in a lot of trouble.

You pay for http and https the same for incoming bandwidth and you should see any difference in instances hours. The only difference is the one time pay (per month) that you need to pay for SNI or VIP

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.