I need to prevent duplicate form submissions for my customer's website.
we need some form data from user for order confirm page.
we use load balancing for web server.
Approach 1 : Post/Redirect/Get
(PRG pattern : http://en.wikipedia.org/wiki/Post/Redirect/Get)
I was trying to use PRG pattern at first.
in this case, I think I need to deal with session(or spring flashmap) across multiple web server.
Approach 2 : Disable refresh on client.
one of my colleague suggested this approach.
Approach 3 : Post/Post
another colleague suggested this approach.
I think approach 2, 3 is not a good choice.
but I do not know the specific cons or security risk about these approaches.
I tried to google, but I failed to find answer.
Thank you in advance.
I would like to update the pros and cons.
Approach 1 : Post/Redirect/Get
if you need some form data from user to show it on confirm page, you need to use session ,database or something.
if you use session, and have more than one server, you have to do something to make session available across multiple servers.
Approach 2 : Disable refresh on client.
Users will get upset if you limit the browser standard features, like refresh.
need to consider F5, Ctrl+F5, ⌘ + F5 etc, various refresh icons.
In mobile, many web browser automatically refresh page when user reload browser.
Approach 3 : Post/Post
You don't have to worry about session sharing issue across multiple servers.
Second form submit can fail.
Approach 1 is a pretty straight forward method that solves some duplicate post issues. It won't cope with server lag and which is a reason for duplicate submission.
Approach 2 is nothing but wrong. Users will get upset if you limit the browser standard features, like refresh. That is, if you are even able to do so technically cross browser. You need to consider F5, Ctrl+F5, ⌘ + F5 etc, various refresh icons.
I must admit that I don't fully understand the intent of Approach 3, however, it feels a bit wrong to bounce the user to an empty page.
Another standard approach is to use a nounce with form posts. This will also help you avoid a security risk called Cross Site Request Forgery. It's pretty simple.
Generate a "unique" random string on the server, called nonce.
Insert the nonce into the database.
Attach the nonce to the form as a hidden field (or pass by URL or similar).
Make sure the nonce is sent along in the form post to server.
At server side, validate the nonce, remove nonce, "save form data".
Display confirmation page.
If you get another request with a non existing nonce, then you know it's either a duplicate post or some more evil CSRF attack.
You can probably find some support library that does this for you.
I'm building my own HTTP server in java, but i'm facing with a problem: I would like to build a page dynamically by creating every HTML object at runtime, the question is: how can i determine the screen dimension of the client's browser?
This information is not present in the HTTP header, so I was thinking about writing a "fake" webpage that runs a javascript that tells the server about the screen (it should redirect to something like www.website.com/w:1920,h:1080) but I don't know anything about cookies (that I suppose are essential to store those informations).
Do you think that I should learn somthng about cookies or there's another way?
BTW I'm not using servlets, just Socket, because that's what I know... should I use servlets?
Thanks for your time!
Server knows nothing about client's screen until client send this information. Javascript is easiest way to determine screen size:
AJAX request can be used to send the information to the server where it can be stored in session data and backed in database for example if the user is logged in or identified somehow. In such case you don't need cookies. However solution with cookies is easier, check how to set them via javascript. But I'm afraid such solution would be a bit of non-standard, if your site is gonna depend on javascript why not to use it extensively and generate all objects on client side, get that lazy computer working and save your server's resources :) Just feed data by sending simplest HTML containing script doing the work.
Servlets? Can be really light-weight and done with minimal knowledge if you have time go for it.
I am using facebook login on my site. When I test locally I need to use local.mysite.com, so facebook thinks the request is coming from my site. This works great except when I upload images to blobstore. When uploading images app engine always switches to localhost:888. This makes the browser think cross site scripting is happening and prevents my uploads. How can I force app engine to use local.mysite.com instead of localhost:888
This is the error I am getting:
XMLHttpRequest cannot load http://localhost:8888/_ah/upload/agpidWJwcm9qZWN0chsLEhVfX0Jsb2JVcGxvYWRTZXNzaW9uX18YBQw. Origin http://local.mysite.com:8888 is not allowed by Access-Control-Allow-Origin.
I'm not sure you can actually change that URL.
What you can do though is to use the localhost:8888 for your local tests and create another Facebook application that points to localhost. Afterwards there are two approaches that you can do in order be able to use these two (or possibly even more in the future) Facebook applications in your app.
You can decided based on the requested URL which key to use
Store all the keys in somekind configuration Datatstore that only admins can change them
With the first approach you will have to store somehow all the keys in your code or even worse in the datastore and then decided based on the URL which one to use. This approach is not good and it doesn't scale very well. The second approach is preferable since you don't have to store your keys in the code, it is more secure and it scales much better since you don't need to know up front how many different Facebook applications you have.
You can read the Nick Johnson's answer on how to solve that in Python, but the idea is Java so it shouldn't be that hard.
I have nearly 20+ pages from different Web application that a user can access once he login. I have this 'Recent Activity' section on my Home page where I have to show the last 10 visited pages by the user ( if possible along with date and time of visiting). The pages are jsp pages. I dont know how I can acheive this basically I am more a frontend developer so can I do this with jquery, jsp, js etc.. or anyother technoloiges. We use Java technology also. Please let me know any sample code or way of approach to do it.
I am a php developer, not too familiar with jsp, but i am sure it would be the same logic.
You have 2 option here:
Option 1:
Create a database table and record all the user flow whenever the user access an application.
Option 2:
Save all the flow in a cookie variable so whenever the user logs in you can pull out all his info from the cookie variable.
Personally i rather use the option 1, because if the use clears out the cookie/session variable you will lose lo all the information.
Since i am not a jsp i can't providew with a sample code. Hope this get your started at least.
This information may be stored on Cookies or User Session.
If it's available on cookies, you can access and manipulate it using JavaScript or any other server-side languages.
Do you want some example on how to use it using JSP&Servlets?
There's pros and cons for each approach.
Cookies: User can cleanup private browser data, and cookies go away with it.
Sessions: You can store it in some database or log file, for future load or/and analysis.
Cons is the management of this data in any layer. But it's not a big problem.
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.
It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.
As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.
To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.
I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
I want to write an application using Java6 that can check a users Hotmail inbox for the 'unread message count'!
There is a Javascript API but I will not have a browser instance, and it seems that I need one to use it. (see stakoverflow question: 964392 )
I can use POP3, but since it does not support flags, I can only tell how many 'new' messages there are in the users Inbox since the last time I checked, not how many unread messages there are. ( This is my current implementation, it's not what is required, but is currently all I can achieve )
There is IMAP access, but only for 'premium users'(Hotmail users who pay).
There's also HttpMail access, but this is poorly documented, and from testing, seems it's also only for premium users.
Unfortunately, this similar question on msdn suggests this is impossible
All I can offer is a half-solution. You could create the html page containing the script suggested by the people on MSDN but instead of setting the value of an input box to the number of unread messages - you could use Ajax to post this number back to your application. This is, of course, not a very robust solution since it depends on the browser and may very well not be cross platform. Another thing you can do is read up on running Javascript on the JVM. I don't know how good that solution is, either, but I think it's more robust once (or rather if) you can get it to work.
One potential option could be to use the HTMLUnit Java headless web browser to make the requests. HTMLUnit has very good, but not perfect, JavaScript support to handle creating the dynamic content.