Hi stackoverflow users.
When i was doing web scraping, i encountered a problem that, when i scrape through a series of webpages of a particular site, with their URLs being
http://www.somewebsites.com/abc.php?number=0001
http://www.somewebsites.com/abc.php?number=0002
http://www.somewebsites.com/abc.php?number=0003
..
..
http://www.somewebsites.com/abc.php?number=1234
Something like this. Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage. In this way, my scraping program will encounter various exceptions related to the change in syntax structure ( as it is a different page).
I'm wondering if there is a way to check whether a webpage i'm scraping exists or not, to prevent my program from being terminated in this case.
I'm using
Jsoup.connect()
to connect to that page. However, when i visit the failed webpage ( redirected ), i was redirected to another page. In my program, the console do not throw any exception about the connect. Instead, the exception is just an index out of bound exception because the unexpected redirected webpage has a totally different structure.
Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage
In general, when a page on a website is not temporarily available and gets redirected, the client gets the response code as 302 (moved permanetly) or 307 (moved temporarily) with "Location" header that points to the redirected page. It seems you can configure the Connection to not redirect in such cases, by setting the followRedirects to false. Then you can verify the HTTP response code before converting the response to Document for further processing.
Related
During my crawling, there is a page redirecting to 404 error, but when i use "readdb" command, the status of the page is still 302 instead of 404.
Then i looked up the configuration file, and i found option "http.redirect.max". I have already configure "http.redirect.max" for 3, and recrawled the page, but the status of it is still 302.
After i read the source code, i found something like:
Response response = getResponse(u, datum, false);
In method "getProtocolOutput" of HttpBase.java. After i changed "false" to "true" and recompiled nutch, the function works.
So i wonder is this a correct way of enabling nutch to follow redirects? Will this modification leads to some other error while crawling?
I think that in this case, Nutch is working properly http.redirect.max controls if the redirect is followed immediately or if it should be queued for a next round.
If you crawl one URL (A) which contains a redirect to a 404, the first URL still exists with a 30x state, it's the second URL (B) that has a 404 response. From Nutch side, there are two different URLs (which makes sense).
I haven't tested your change (with other scenarios), but if you have a similar case, let's say that page A redirects to a different page C (which is not a 404), in this case, would you expect that the content of page C to be linked to the URL of A and totally ignore the URL of C?
In the browser, this is usually how we perceive it, but underneath there are two different requests/URLs.
Is it possible to block directing to a specific site using servlets. For example, when yahoo.com is typed in url box, the network connection should turn off, whereas when you type some other website url, say, google.com, the network connection should remain intact. (maybe by working on ip filters??)
Servlets would not be a right choice for this kind of requirement.
Servlets are deployed at server and will work only when request is submitted to them from the browser. Typing URL in the browser would not allow servlet to react.
You can install any kind of Proxy at your local network and block the websites accordingly.
In case, if you are talking about any browser request that is calling your servlet and your servlet is redirect the request to any of the websites like Yahoo, Google etc... then the following procedure might work for you.
Maintain a list of blocked websites in a list
Check for the requested website in the request object
If the requested website is not in the list then you can allow the redirect
Else, you can forward to a page where website blocked message is displayed
Hope this answers your question.
Please let me know if you have any further questions.
I am attempting to log into a website using Java's HttpURLConnection. I have figured out how to use a POST request to post to the website and log in, but I have no way of knowing if the login was successful or not.
Looking at some tutorials, I discerned that reloading the page usually works. The problem with this specific implementation is that upon entering credentials, the website opens a pop up window, with the same URL as the parent site.
This can be solved either of two ways. Looking at Chrome's Developer Tools, I realized that the POST request returns whether the login was successful, as seen here
Is it possible to get the popup window or look for the response to the POST request? I'd rather use native java is possible.
Reload will work, if you'll keep the same HTTP session. Actually the website cannot open an popup - the web browser does it according to the login response. You should do the same - that is to check the response. Luckily you don't have to parse the response content, try to check the response code. For login the HTTP 200 may stand for successful login and HTTP 401 for failure.
Let's say someone enters the following URL in their browser:
http://www.mywebsite.com/<script>alert(1)</script>
The page is displayed as normal with the alert popup as well. I think this should result in a 404, but how do I best achieve that?
My webapp is running on a Tomcat 7 server. Modern browser will automatically protect against this, but older ones, I am looking at you IE6, wont.
It sounds like you are actually getting a 404 page, but that page includes the resource (in this case a piece of JavaScript code) and doesn't do any converting of < and > to their respective HTML entities. I've seen this happen on several websites.
The solution would be to create a custom 404 page which doesn't echo back the resource to the page, or that does proper HTML entity conversion beforehand. There are plenty of tutorials you can find through Google which should help you do this.
Here's what I did:
Created a high level servlet filter which uses OWASP's HTML sanitizer to check for dodgy characters. If there are any, I redirect to our 404 page.
You should put a filter in your webapp to protect against an XSS attack.
Get all the parameters from the HttpServletRequest object and replace any parameter with value starting with with spaces in filter code.
This way any harmful JS script won't reach your server side components.
On my MVC spring application I send a form to a page using post method to perform a search with some parameters. The results of the search is a list and, for every entry is possible to navigate to a details page. It works fine but, when an user try to come back to result page with back browser button (or a Javascript), I get an "Expire page" error. Refreshing the page, it rexecutes the post submit and it works fine.
To prevent this error I added
response.setHeader("Cache-Control", "no-cache");
to the search controller and everything works fine with Safari and Firefox but I get the same "Expire page" error when I try to execute the web application with IE (8 and 9).
What's the right way to go to a detail page and come back without getting any error?
thanks for your time!Andrea
The right way is to use GET instead of POST: searching is an idempotent operation which doesn't cause any change on the server, and such operations should be done with a GET.
Not expiring a POST request seems to undermine the very idea of a POST:
Per RFC 2616, the POST method should be used for any context in which a request is non-idempotent: that is, it causes a change in server state each time it is performed, such as submitting a comment to a blog post or voting in an online poll.
Instead, I would restructure how you view the detail records, maybe with an overlay DIV that doesn't refresh the underlying page that shows the results. With many modern web apps, using the BACK button can cause issues with page flow.