We have a Tapestry-Spring-Hibernate webapp running on Tomcat 6, handing about a few 1000 requests a second. Randomly, for no apparent reason, a page just displays a bunch of random characters on the browser. However, when the page is refreshed, it displays fine. Here is a screen-shot of the source of the garbled page on Chrome:
Here is what I have found so far:
It doesn't seem to be browser specific. I have witnessed this on Chrome and Firefox, but users have also reported this on IE 7 and up.
Load on the server seems to have no correlation to when this happens.
Refreshing the page displays the page normally, as if nothing ever happened.
I don't see anything relevant in the server or the application logs
The content-type tag for the page is <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
There are a couple other webapps deployed on the same container, one being Alfresco, but they don't seem to exhibit this at all.
My question is, has someone run into this before, and if so, can they point me to where I should start looking? Is this a problem with the page having something like the incorrect content-type or the server not being able to handle it for some reason? Or could this be a framework bug in Tapestry or in the application itself? Any pointers are welcome. At this point, I am not sure where the problem is, so I wasn't sure if this goes on ServerFault or stays here.
It is most likely a bug in the application. (Most bugs are ... despite the natural tendency of programmers to blame something else.)
However, could be a bit tricky to track down this problem. I suggest that you start with the standard things:
Look at the server error logs to see if anything strange shows up at the time when one of these "events" occur.
Look at the server access logs to see if you can identify the request that produced the garbage data.
Enable your browser's debugger and see if you can track down the bad request that way.
If you can figure out what the request that produced the bad response was, you'll have more traction in finding the cause.
FWIW - that doesn't look like the result of a character encoding problem. That looks more like binary or compressed data.
Here's one situation that has led me to see a garbled page. On the error page Tapestry sets a response header called X-Tapestry-ErrorMessage. Evidently newlines aren't allowed in headers (at least on some browsers) so if that header has a newline in it, you get the gibberish. One error message we were setting happened to have a trailing new line. I changed it to remove any newlines before setting that header and then the error page showed correctly.
This seems to be related to gzip compression issues in the Tapestry framework (as suggested by #barnyr) and is possibly a regression bug in Tapestry 5.3. To quote Howard from a mailing list thread:
I believe this was a bug where under certain circumstances, a corrupt
GZIP stream of page content would be streamed to the client; this is
fixed in 5.2.6 for sure, but I thought it was fixed in 5.2.5 as well.
The quick fix is to add the following configuration symbol in the contributeApplicationDefaults method of the app's module class:
configuration.add(SymbolConstants.GZIP_COMPRESSION_ENABLED, "false");
This of course disables gzip compression, but might a trade-off worth making.
Possibly related issues:
GZip compression should be disabled if the request is over http 1.0
Is GZIP compression compatible with XmlHttpRequest?
Related
I get blank screen when execute http://localhost:9001/forms/frmservlet.
I am using java 1.8.0_321, WebLogic 12 y forms 12.
I have lost several days trying to fix.
Any recomendation?
A blank page is a 'successful' response from the code running behind /forms/frmservlet.
There are a few things you can check:
Which code is sending the blank page? Is it your code or is your webapp on a different path? Is your servlet mapped correctly?
Look at the HTML source in the browser to understand better the blank page
Check Weblogic's logfiles to see if something unusal is in there
All this may sound pretty generic but you did not give more details either.
I'm doing a file upload with Jersey, but I need only the filename. Internet Explorer sends the entire path, and based on what's in FormDataContentDisposition, Jersey parses out the slashes, so I can't even parse that. Thanks.
Sounds like a difficult issue. The ideal case of course is to grab the string containing the slashes and just use string.split!
Failing that, the only strategy i can begin to think of is to try iterating through the string seeing if folders exist for various lengths of the first part of the string etc. This can cause problems too though if you intend to find a folder "MyFolder (2)" and theres also a "MyFolder". I don't know alot about jersey but i would recommend trying to find a different way to grab the string you need.
Add a change event to the file input tag
Parse out the filename as it still has slashes at this point
Stick that value into a hidden input
That hidden input then gets submitted along with the rest of the form.
#kombat has found this solution and posted it as a comment. For better this is now reposted as a community wiki answer.
It is a bug in Jersey. In Nabble discussion http://jersey.576304.n2.nabble.com/Jersey-truncating-the-slashes-from-the-uploaded-file-name-td5984041.html the author of the bug reveals himself and acknowledges 'reusing the code' for parsing HTTP Headers to parse Content-Disposition. However, the cited RFC 2616 does not specify, that Content-Disposition fields are to be escaped according to the rules specified to HTTP headers. Just the opposite, there is written that:
Content-Disposition is not part of the HTTP standard, but since it
is widely implemented, we are documenting its use and risks for
implementors.
This bug has already an ugly workaround in the class org.glassfish.jersey.media.multipart.internal.MultiPartReaderClientSide in actual version of Jersey, however it doesn't work with IE 11 and Edge because it checks User-Agent part which has changed. There is a pull request with fix: https://github.com/jersey/jersey/pull/233/files, but for almost 2 years nobody has cared to merge it.
You have 3 solutions:
1) Aplly a 'fix' on client side, which is IMHO a wrong approach, because there is no bug on client side, the bug is in Jersey!
2) Change Jersey to other framework, in which the developers take the compatibility issues more seriously instead of concentrating on maximizing code reuse etc.
3) Patch Jersey manually. Download the sources, apply pull request, compile and release with modified version number.
I was getting that error when I tried the Eclipse browser. When I tried my code on Chrome, the FormDataContentDisposition.getFileName() was fine.
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.
It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.
As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.
To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.
I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
I am trying to display dynamic data in jsp. for that I am calling java method inside a jsp using jsp expression. this java method is taking much time but it is returning some value.(I cant reduce the method execution time.)
But my jsp is showing blank.
Can anybody explain what would be the reason and how to resolve.
This code is not written by me But I need to find out the root cause.
my jsp code look like
display.jsp
..... hello......start...
<%= obj.getDynamicData() %>
.....completed .... end
It's likely because you're (ab)using JSP to execute some raw Java code. When an exception is been thrown halfway sending the JSP's output, the remnant of the JSP won't be sent to the browser anymore. But the webserver cannot change the response into an error page with exception details anymore as well and the webbrowser will end up with a halfbaked HTML output which is often displayed as a blank page.
Any uncaught exception is usually logged into server's logfile. You need to dig in the server's logs for the exception and stacktrace so that you can fix the root cause of the problem. Exceptions contain worthful information about the cause of the problem.
A halfbaked HTML page is just an incomplete HTML page which caused the webbrowser not to understand how to display it properly. Rightclick the page in webbrowser and choos View Source. Verify if it is as expected, if necessary with help of the w3 validator.
Further, it may be worth the effort trying in different (better) webbrowsers like Firefox and Chrome. MSIE6/7 is namely known to choke like that when it has received an enormously HTML <table>. It has a poor table rendering engine.
To save yourself from future trouble like this, I suggest to move all that Java code out into a Servlet class so that you can get a more friendly (at least, it's better than digging in server's log files) error page in case of an exception in Java code. See also How to avoid Java code in JSP?
Based on the comments made in BalusC's answer:
When I comment the call to obj.getDynamicData(), jsp page rendered properly.
Either one of two things could be happening:
obj.getDynamicData() is throwing an exception which is not caught and handled
Your servlet container/server may be configured with some sort of "request timeout" that closes the HTTP connection if it takes more than a certain amount of time to process the request, and obj.getDynamicData() takes so long to execute that this timeout is being triggered.
Do you have any sort of logging in your code or JSP that tells you what happens server-side after this method finishes processing? A strong hint that #2 is occuring would be if you continue to see log activity from the thread processing the JSP request (and obj.getDynamicData()) after the browser has stopped waiting for the request / received the blank page.
And to rule out the simple things, are you sure that the server is actually returning an empty response, and not that your browser is showing a blank page because the server returned half an HTML page? Make sure to check View Source, use a tool like Firebug, and/or make the same HTTP request that you do in a browser from a command-line tool like
curl or wget.
I am trying to handle a file upload, and I'm using the
com.oreilly.servlet.multipart.MultipartParser class to parse the posted
data (in cos.jar). However, when I call the constructor for MultipartParser, I get this
exception:
java.io.IOException: Corrupt form data: premature ending
at com.oreilly.servlet.multipart.MultipartParser.<init>(MultipartParser.java:166)
at com.oreilly.servlet.multipart.MultipartParser.<init>(MultipartParser.java:94)
Has anyone seen this before? From what I read, this means that the
data ended before it found the boundary it was looking for. How can I
fix this?
I am using cos.jar version 1.0.
Thanks!
http://www.servlets.com/cos/faq.html
This indicates there was a problem
parsing the POST request submitted by
the client. There can be many causes
for the problem:
The client hit the STOP button (not really a problem, but it does cause a
premature ending)
A bug in the web form
A bug in the servlet
A bug in the web server
A bug in the browser
A bug in the com.oreilly.servlet library itself
History has shown the web server to be
the most frequent cause of problems
probably because there are so many
different servers and few vendors
appear to test their binary upload
capability.
First, make sure your client isn't
hitting the STOP button. Then, check
if your problem is already posted on
the "Servlet bugs you need to know
about" resource on this site. If it's
not well known, then you get to be
among the first to learn about it! And
you can share your discovery with us
here!
Second, see if the upload works using
the provided upload.html form and
DemoRequestUploadServlet.java class.
Some people have found bugs in their
form that caused problems. Testing
this combination will see if that's
the case. One user, Duke Takle, found
this exception was caused by a
redirect: I was experiencing the same
"premature ending" as Albert Smith.
What I've found is that the problem
was isolated to I.E. 5.0. The
application that troubled me was doing
a redirect after the construction of a
MultipartRequest. It looks like this
construction went well except on I.E.
5.0 the browser attempted to make the request again and by that time the
ServletInputStream was empty. I've
modified the application to simply
write the needed response instead of
redirecting. This problem was observed
and fixed as described in Tomcat 4.0
and Weblogic 6.1. Other users have
found bugs in their handling servlet
where they call request.getParameter()
instead of
multipartRequest.getParameter(), and
some servers falsely read the input
stream when their getParameter() is
called causing an "unexpected end of
part".
So, the problem was caused by me calling the MultipartParser constructor twice, by accident. It failed the second time, since the request had already been processed.