I am willing to implement a chat website on App Engine. But I found that App Engine will not allow me to go with server push. (as it will kill the response after 30 sec).
So whats the other method that can
be used? Will polling cause bad user
experience? Meaning will the user
have to wait for some time to
retrieve new messages from the server?
What will be the ideal polling
interval?
If you use very small polling intervals, will my bandwidth get exhausted? Will I suffer performance problems?
This is a quite old question now, but I was looking for a similar answer. I think the Channel API (http://code.google.com/appengine/docs/java/channel/) is much better suited for the task. For what I understand, XMPP is good to interact with the app, but not with other users. Channel API implement push notifications via HttpRequest. I just found an example of a chat room here: https://bitbucket.org/keakon/channelchat
Can't you just use XMPP instead of a website? It would be a much better approach. Polling certainly isn't going to scale very well and will definitely not give a good user experience.
XMPP with appengine
I've heard of people working around that by holding the connection (i.e. sending no response) until it dies then reestablishing it. 30 seconds is not that much though.
If done this way it would still feel more responsive to the user than polling every 30 secs.
About the bandwith usage: Depending on the payload "typical" HTTP requests can range from a few hundred bytes to some kBytes especially with cookies.
With an average size of let's say 5kB (pessimistic) every 30 sec that would sum up to around 14 MB per 24 hrs. Maybe you can cut down the size by setting a path in your cookies so they don't get send for these connections. Maybe you don't need to send the whole payload again every 30 secs.
yeah channel api is the best solution, with gwt is even better
http://www.dev-articles.com/article/Google-App-Engine-sending-messages-with-XMPP-393002
Related
I'm trying to gather data by scraping webpages using Java with Jsoup. Ideally, I'd like about 8000 lines of data, but I was wondering what the etiquette is when it comes to accessing a site that many times. For each one, my code has to navigate to a different part of the site, so I would have to load 8000 (or more) webpages. Would it be a good idea to put delays between each request so I don't overload the website? They don't offer an API from what I can see.
Additionally, I tried running my code to just get 80 lines of data without any delay, and my internet is out. Could running that code have caused it? When I called the company, the automated message made it sound like service was out in the area, so maybe I just didn't notice it until I tried to run the code. Any help is appreciated, I'm very new to network coding. Thanks!
Here are couple of things you should consider and things that I learned while writing a super-fast web scraper with Java and Jsoup:
Most important one is legal aspect, whether website allows crawling and till what extent they allow using their data.
Putting delays is fine but adding custom user agents that are compatible with robots.txt is more preferred. I had seen significant increase in response times with changing user agent from default to robots.txt.
If site allows and you need to crawl big number of pages, which was allowed for one of my previous project, you could use theadexecutor to load N number of pages simultaneously. It turns hours of data gathering job with single threaded java web scraper into just couple of minutes.
Many ISP's blacklist users who are doing programmable repeatative tasks such as web crawling and setting up email servers. Its varies from ISP to ISP. I previously avoided this by using proxies.
With a website having a response time of 500ms per request, my web scraper was able to scrape data from 200k pages with 50 threads, 1000 proxies in 3 minutes over a connection of 100MBPS.
Should there be delays between requests?
Answer: It depends if website allows you constantly hitting then you don't need it but should be better to have it. I had a delay of 10ms between each request.
I tried running my code to just get 80 lines of data without any delay, and my internet is out?
Answer: Most probably. ISP may assume you are doing DOS attack against the website and may have temporarily/permanently put a limit on your connection.
We are developing a site that will allow users to send semi-real-time events to other users. The UI will display an icon when there is a new event for a user (pretty standard stuff).
I have read that periodic short polling does not scale as well as websockets because it puts more pressure on the web server. I am not quite sure why this would be the case?
We are using tomcat NIO (which does not have a one-to-one connection per thread ratio). As I understand it, Tomcat NIO is pretty good at handling longer HTTP connection timeouts with a small number of threads.
So, if the periodic polling time is less than the connection timeout, then the polling should not have to create another TCP handshake, as it will just reuse an existing HTTP 1.1 connection.
Thus, the above does not seem like it would create too much pressure on the server. It may not be as real-time as long polling or websockets, but I do not see why it should not scale (assuming that the server can quickly respond with a response indicating a new event or not – we use an in memory concurrent hashmap, so this should be pretty fast with no necessary DB access).
Am I missing anything?
Thanks,
-Adam
Short polling may not be as trendy as long polling and web sockets but it works and works everywhere.
Trello (backed by some of the same people as SO) normally uses web sockets but when they encountered a crippling bug in their web sockets implementation on launch day they were saved by short polling:
We hit a problem right after launch. Our WebSocket server implementation started behaving very strangely under the sudden and heavy real-world usage of launching at TechCrunch disrupt, and we were glad to be able to revert to plain polling and tune server performance by adjusting the active and idle polling intervals. It allowed us to degrade gracefully as we increased from 300 to 50,000 users in under a week. We’re back on WebSockets now, but having a working short-polling system still seems like a very prudent fallback.
The full story is well worth a read.
I'd particularly highlight,
The use of HAProxy to terminate the client connection. Meaning that internal web servers are shielded from slow and misbehaving clients and the overhead of repeatedly creating connections becomes less of an issue due to HAProxy's scalability/efficiency;
Trello's polling frequency was adjustable meaning that under heavy load they could tell all clients to poll less frequently thus exchanging responsiveness for increased capacity.
In Brazil at least there are many retail trading platforms that use short polling, with very short polling intervals for rapid publication of stock prices, and regularly support thousands of concurrent users.
Unlike long polling and web sockets, short polling doesn't require a persistent connection so with something like HAProxy in the middle your maximum number of "connections" could actually be greater than the number of concurrent sockets supported by your hardware (although at that point you'd probably be seeing some degradation in responsiveness).
A while back I tried implementing a crawler in Java and left the project for a while (made a lot of progress since). Basically I have implemented a crawler with circa 200-400 threads, each thread connects and downloads the content of one page (simplified for clarity, but that's basically it):
// we're in a run() method of a truely generic Runnable.
// _url is a member passed to the Runnable object beforehand.
Connection c = Jsoup.connect(_url).timeout(10000);
c.execute();
Document d = c.response().parse();
// use Jsoup to get the links, add them to the backbone of the crawler
// to be checked and maybe later passed to the crawling queue.
This works. The problem is I only use a very small fraction of my internet bandwidth. Having the ability to download at >6MB/s, I've identified (using NetLimiter and my own calculations) that I only use about 1MB/s at best when downloading pages sources.
I've done a lot of statistics and analyses and it is somewhat reasonable - if the computer cannot efficiently support over ~400 threads (I don't know about that also, but a larger number of threads seems to be ineffective) and each connection takes about 4 seconds to complete, then I'm supposed to download 100 pages per second which is indeed what happens. The bizarre thing is that many times while I run this program, the internet connection is completely clogged - neither I nor anyone else on my wifi connection can access the web normally (when I'm only using 16%! which does not happen when downloading other files, say movies).
I've spent literally weeks calculating, analyzing and collecting various statistics (making sure all threads are operating with VM monitor, calculating mean run time for threads, excel charts...), before coming here, but I've ran out of answers. I wonder if this behavior could be explained. I realize there's a lot of "ifs" in this question, but it's the best I can do without it turning into an essay.
My computer specs are i5 4460 with 8GB DDR3-1600 and a 100Mb/s (effectively around 8MB/s) internet connection, connected directly via LAN to the crawler. I'm looking for general directions - where else should I look
(I mean obvous stuff that are clear to experienced developers and not myself) in order to either:
Improve the download speed (maybe not Jsoup? different number of threads? I've already tried using selectors instead of threads and it was slower), or:
Free up the internet when I'm running this program.
I've thought about the router itself (Netgear N600) limiting the number of outgoing connections (seems odd), so I'm saturating the number of connections, and not the bandwidth, but couldn't figure out if that's even possible.
Any general direction / advice would be warmly welcomed :) feel free to point out newish mistakes, that's how I learn.
Amir.
The issue was not DNS resolutions, as creating the connections with an IP address (I stored all addresses in advance then used those) resulted in the exact same response times and bandwidth use. Nor was it the threads issue.
I suspect now it was the netlimiter program's "fault". I've measured directly the number of bytes received and outputted these to disk (I've done this before but apprantly I've made some changes in the program). It seems I'm really saturating the bandwidth. Also, when switching to HttpUrlConnection objects instead of Jsoup, the netlimiter program does show a much larger bandwidth usage. Perhaps it has some issue with Jsoup.
I'm not sure this was the problem, but empirically, the program downloads a lot of data. So I hope this helps anyone who might encounter a similar issue in the future.
I written a web service (DropWizard) that takes accepts requests via POST to perform operations that may take considerable time. Considerable time meaning that it could take anywhere from 1-5 minutes to complete.
That said the caller doesn't need a response. Just a simple 200 to acknowledge receipt of the message is enough. It's actually a PayPal IPN WebHook for anybody who is curious.
I only want to perform one of these operations at a time (with the option to increase in future) so that my system doesn't overload.
What kind of queue mechanism should I consider using. This probably goes without say but I must assume that the API instance can be killed at any time, thus clearing memory. I will need a temporary place to store the queue so I can resume where the server left off when restarted.
Thank you.
You could use Apache Kafka Queue. The documentation is pretty clear. It should help you out.
http://kafka.apache.org/
Hope that helps!
you can use, activeMQ with persistance. Its very light weight and easy to use. Have a look at http://activemq.apache.org/persistence.html it will guide you through the step by step process.
I'd like to implement a dashboard that is web-based and has a variety of metrics where one changes every minute and others change like twice a day. Via AJAX the metrics should be updated as quick as possible if a change occured. This means the same page would be running for at least several hours.
What would be the most efficient way (technology-/implementation-wise) of dealing with this in the Java world?
Well, there are two obvious options here:
Comet, aka long polling: the AJAX request is held open by the server until it times out after a few minutes or until a change occurs, whichever happens first. The downside of this is that handling many connections can be tricky; aside from anything else, you won't want the typical "one thread per request, handling it synchronously" model which is common.
Frequent polling from the AJAX page, where each request returns quickly. This would probably be simpler to implement, but is less efficient in network terms (far more requests) and will be less immediate; you could send a request every 5 seconds for example, but if you have a lot of users you're going to end up with a lot of traffic.
The best solution will depend on how many users you've got. If there are only going to be a few clients, you may well want to go for the "poll every 5 seconds" approach - or even possibly long polling with a thread per request (although that will probably be slightly harder to implement). If you've got a lot of clients I'd definitely go with long polling, but you'll need to look at how to detach the thread from the connection in your particular server environment.
I think time of Comet has gone. The brand new Socket.IO protocol gaining popularity. And i suggest to use netty-socketio, it supports both long-polling and websocket protocols. javascript, ios, android client libs also available.