How to enable Nutch follow http redirect?

How to enable Nutch follow http redirect? - java

During my crawling, there is a page redirecting to 404 error, but when i use "readdb" command, the status of the page is still 302 instead of 404.
Then i looked up the configuration file, and i found option "http.redirect.max". I have already configure "http.redirect.max" for 3, and recrawled the page, but the status of it is still 302.
After i read the source code, i found something like:
Response response = getResponse(u, datum, false);
In method "getProtocolOutput" of HttpBase.java. After i changed "false" to "true" and recompiled nutch, the function works.
So i wonder is this a correct way of enabling nutch to follow redirects? Will this modification leads to some other error while crawling?

I think that in this case, Nutch is working properly http.redirect.max controls if the redirect is followed immediately or if it should be queued for a next round.
If you crawl one URL (A) which contains a redirect to a 404, the first URL still exists with a 30x state, it's the second URL (B) that has a 404 response. From Nutch side, there are two different URLs (which makes sense).
I haven't tested your change (with other scenarios), but if you have a similar case, let's say that page A redirects to a different page C (which is not a 404), in this case, would you expect that the content of page C to be linked to the URL of A and totally ignore the URL of C?
In the browser, this is usually how we perceive it, but underneath there are two different requests/URLs.

Related

URL gets cut after the hashtag

Im using Java (Maven), Angular(8.3.20) and a Tomcat server
In the Java I have a sendRedirect for a HttpServletResponse with a URL that contains a hashtag.
So for example: https://localhost:4200/api/hello#world.
But the string after the # (the world part) will not appear in the frontend.
In the console/network as a header, I received https://localhost:4200/api/hello. So the #world is gone.
I have tried to change the hashtag to an encoded value (it will return a %23) but that does not work as well.
How can I get the part after the hashtag being send to the frontend? So from the backend (a url with a pound/hashtag) to the frontend.

I looked for this problem, the solution I found is to code # to %23,
but it seems that it does not work for all browsers (ex: IE, Safari) so try to use google-chrome for example.
more information on:
https://support.google.com/richmedia/answer/190941?hl=en

Routing strategy
What angular exploits with the HashLocationStrategy is the fact that any content after the # symbol isn't send to a server - which makes it ideal to use it for storing application state.
Why is it useful
With hash routes a page reload (or revisit through bookmark) on a subpage like
http://localhost:4200/#/articles/35
doesn't query the server for the subpage, but instead returns the main application page.
http://localhost:4200/
This way the server implementation only needs to know about the root page (which is the only thing that'll ever be queried)
Using the PathLocationStrategy (default) the server needs to be set up to handle requests for every single URL your application implements.
So what can you do is : disable HashLocationStrategy in angular from routes file. Or you can append the url without # and then send it to server and handle the url on sever side.
I had the same problem, what I did was add the part which is infront of # as a query parameter, so that server can access it.
E.g.
(Https://localhost:4200/api/test?remainingDataFromHash=requiredData) like this

jsoup - how to check if a webpage exist or not

Hi stackoverflow users.
When i was doing web scraping, i encountered a problem that, when i scrape through a series of webpages of a particular site, with their URLs being
http://www.somewebsites.com/abc.php?number=0001
http://www.somewebsites.com/abc.php?number=0002
http://www.somewebsites.com/abc.php?number=0003
..
..
http://www.somewebsites.com/abc.php?number=1234
Something like this. Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage. In this way, my scraping program will encounter various exceptions related to the change in syntax structure ( as it is a different page).
I'm wondering if there is a way to check whether a webpage i'm scraping exists or not, to prevent my program from being terminated in this case.
I'm using
Jsoup.connect()
to connect to that page. However, when i visit the failed webpage ( redirected ), i was redirected to another page. In my program, the console do not throw any exception about the connect. Instead, the exception is just an index out of bound exception because the unexpected redirected webpage has a totally different structure.

Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage
In general, when a page on a website is not temporarily available and gets redirected, the client gets the response code as 302 (moved permanetly) or 307 (moved temporarily) with "Location" header that points to the redirected page. It seems you can configure the Connection to not redirect in such cases, by setting the followRedirects to false. Then you can verify the HTTP response code before converting the response to Document for further processing.

Should I use URL rewriting to protect against XSS

Let's say someone enters the following URL in their browser:
http://www.mywebsite.com/<script>alert(1)</script>
The page is displayed as normal with the alert popup as well. I think this should result in a 404, but how do I best achieve that?
My webapp is running on a Tomcat 7 server. Modern browser will automatically protect against this, but older ones, I am looking at you IE6, wont.

It sounds like you are actually getting a 404 page, but that page includes the resource (in this case a piece of JavaScript code) and doesn't do any converting of < and > to their respective HTML entities. I've seen this happen on several websites.
The solution would be to create a custom 404 page which doesn't echo back the resource to the page, or that does proper HTML entity conversion beforehand. There are plenty of tutorials you can find through Google which should help you do this.

Here's what I did:
Created a high level servlet filter which uses OWASP's HTML sanitizer to check for dodgy characters. If there are any, I redirect to our 404 page.

You should put a filter in your webapp to protect against an XSS attack.
Get all the parameters from the HttpServletRequest object and replace any parameter with value starting with with spaces in filter code.
This way any harmful JS script won't reach your server side components.

Spring mvc result page and back browser button

On my MVC spring application I send a form to a page using post method to perform a search with some parameters. The results of the search is a list and, for every entry is possible to navigate to a details page. It works fine but, when an user try to come back to result page with back browser button (or a Javascript), I get an "Expire page" error. Refreshing the page, it rexecutes the post submit and it works fine.
To prevent this error I added
response.setHeader("Cache-Control", "no-cache");
to the search controller and everything works fine with Safari and Firefox but I get the same "Expire page" error when I try to execute the web application with IE (8 and 9).
What's the right way to go to a detail page and come back without getting any error?
thanks for your time!Andrea

The right way is to use GET instead of POST: searching is an idempotent operation which doesn't cause any change on the server, and such operations should be done with a GET.

Not expiring a POST request seems to undermine the very idea of a POST:
Per RFC 2616, the POST method should be used for any context in which a request is non-idempotent: that is, it causes a change in server state each time it is performed, such as submitting a comment to a blog post or voting in an online poll.
Instead, I would restructure how you view the detail records, maybe with an overlay DIV that doesn't refresh the underlying page that shows the results. With many modern web apps, using the BACK button can cause issues with page flow.

How to find whether a page is redirecting or not?

I wanna know that how we can identify whether a page is redirecting to some other web page or not?
I'm using URLConnection class of java.
One way that i know is, to check the headerfield 'Location' of the connection established.
But i'm not getting the solution :-(
Plz assist.

Start using HttpClient and stop worrying about redirects.
Read here for more details.

Java HttpURLConnection follow redirects automatically, unless you call
HttpURLConnection.setFollowRedirects(false);
try calling it before you open the url (you only need to do it once, it's a static function).
I am not sure if the Location header is respected by Java (it can be a browser only thing), but you should see an http response code of 301 or 302 (if I am not mistaken) which indicates a redirect in http level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to enable Nutch follow http redirect? - java

Related

URL gets cut after the hashtag

jsoup - how to check if a webpage exist or not

Should I use URL rewriting to protect against XSS

Spring mvc result page and back browser button

How to find whether a page is redirecting or not?

Categories

Resources