I have an url.
How to know all the existed subUrls of this page.
For example,
http://tut.by/car/12324 - exists
................/car/66666 - doesn`t exist
Desirably, in java.
I have already experimented with almost all from java-source.net/open-source/crawlers - no one can do that, they can only go by hrefs.
Thx in advance!
That's going to be nearly impossible, if there's no index page. While many web servers will create an HTML index page for you if one isn't provided by the site creator, it's a very common practice to disable directory listing, for security reasons.
What you seek is not possible. The server defines the actual meaning of the path in an URL, and it's not possible to 'guess' unless you know a great deal about the server and how it processes the URLs.
I agree, the information you would be seeking would be in an index page. I.e. sometimes you go on a website and delete the "page.html" part. And volia you see all the pages and folders in that directory.
But as mentioned, this is often disabled for security reasons, so users cannot wander around.
Therefore, your other choices are to either
A) Guess, just keep trying different combinations to brute force the page URLs, 00001, 00002, 00003, etc
B) Crawl the website start at its root, looking for links in a page to another page on the website, until all links have been exhausted. Obviously pages on the site will no links to it will never be found.
C) As the owner of the website for the information you require.
Related
Currently my java code uses
response.sendRedirect(request.getRequestUrl().toString());
Which is an open redirect.
I have to fix this but I can not white list it since there are too many URL's are associated with it.
I have tried the following solution with ESAPI but it wont work for me.
ESAPI.httpUtilities().setCurrentHTTP(req, resp);
ESAPI.httpUtilities().sendRedirect(location);
ESAPI.httpUtilities().clearCurrent();
I am new to ESAPI.
[Disclaimer]
I'm project co-lead on ESAPI.
I have to fix this but I can not white list it since there are too
many URL's are associated with it.
Essentially, "I have to fix the problem, but I am restricting myself from the easiest solution."
Here are the best practices enumerated by #jww:
Simply avoid using redirects and forwards.
If used, do not allow the url as user input for the destination. This can usually be done. In this case, you should have a method to validate URL.
If user input can’t be avoided, ensure that the supplied value is valid, appropriate for the application, and is authorized for the user.
It is recommended that any such destination input be mapped to a value, rather than the actual URL or portion of the URL, and that server side code translate this value to the target URL.
Sanitize input by creating a list of trusted URL's (lists of hosts or a regex).
Force all redirects to first go through a page notifying users that they are going off of your site, and have them click a link to confirm.
These are literally all the solutions available to you. Some web frameworks make this easy for you, like Spring MVC with Spring Security.
These lines:
ESAPI.httpUtilities().setCurrentHTTP(req, resp);
ESAPI.httpUtilities().sendRedirect(location);
ESAPI.httpUtilities().clearCurrent();
Don't work because you have to inspect the user input before performing the redirect.
You definitely are going to want to white-list this, at least at a minimum, based on domain names. Restrict it as much as possible. E.g., if your app is hosted at https://myApp.example.com/ redirecting to anywhere on your site is probably okay. (I write probably, because if it can be used as a way to bypass authorization checks, say on a multi-sequence page series, then it might not be okay. But as long as your regular authorization checks pick up and validate the redirect, you generally will be okay.) But what about redirects to https://anotherApp.example.com/? Would those be okay? What about anything in the "example.com" domain? Are their other 3rd party domains that you need to white-list? If so, be sure to list those URLs as well. But the one thing that you want to avoid are completely open redirects and for that you need some type of white-listing. You could build some custom validators using ESAPI to do this, but it's probably just easier to write it without ESAPI. If you have a bunch of URLs that you have to white-list, keep them in a configuration file that's not part of your .war / .ear file so you can easily update it without redeploying your application and just (re)read the config file when it gets updated.
Hope this helps.
-kevin
Thanks for all your suggestions and comments.
I found that the lines
ESAPI.httpUtilities().setCurrentHTTP(req, resp);
ESAPI.httpUtilities().sendRedirect(location);
ESAPI.httpUtilities().clearCurrent();
Is now working fine for me, after a long struggle I found that my code is using latest version of commons-configuration.jar but when I added Esapi as a dependency the Esapi used an old version of the same and that was not compatible with my code so I just excluded the this from Esapi dependency using the exclusion in pom and it worked for me.
I use Wicket's AjaxFallbackLink in a number of places. This works fine for users, but it's giving us some SEO headaches.
When Google crawls one of our pages, it might be hours or days before they return and try crawling the AjaxFallbackLinks on that page. Of course since the links look like this:
http://example.com/?wicket:interface=:1869:mediaPanel:permissionsLink::IBehaviorListener:0:2
... the session is no longer valid by the time the crawler returns. This results in a ton of 404 errors on our site, which presumably harms our SEO.
My question: how can I make the Ajax links "stable" (like a BookmarkablePageLink) for search engines, but still retain the Ajax behavior for interactive users?
You can tell Google to ignore certain URL parameters by using the URL Parameter options in Google Webmaster Tools. As of July 2011, you can even tell Google what to do in the case where changing the URL parameters has an effect on the page content (e.g. paging or sorting).
To access the feature, log into your Google webmaster tools account,
click on the site you want to configure, and then choose Site
configuration > URL parameters. You’ll see a list of parameters Google
has found on the site, along with the number of URLs Google is
“monitoring” that contain this parameter.
The default behavior is “Let Googlebot decide”. This results in Google
figuring out duplicates and clustering them.
http://searchengineland.com/google-adds-url-parameter-options-to-google-webmaster-tools-86769
The question for you is whether the content of the page does change when you ignore the wicket:interface params. If it does, maybe you need to explore moving to a stateless Ajax fallback, such as the one described here:
https://github.com/jolira/wicket-stateless
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.
It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.
As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.
To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.
I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?
If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.
I hope this made some sort of sense.
Kind regards,
Jamie.
No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.
Have you checked this: JSoup?
Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).
Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.
If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.
Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.
You could use htmlunit. It gives you virtual gui less browser.
Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.
BUT:
if you start bombarding a server of a company with moderate resources with downloads or start exhibiting obvious "robot" patterns (e.g. downloading precisely every second), they'll probably block you; so put some sensible constraints on what you do (e.g. every consecutive download to the same server happens at random intervals of between 10 and 20 seconds);
when you make your request, you probably want to set the "referer" request header either to mimic an actual browser, or to be open about what it is (invent a name for your "robot", create a page explaining what it does and include a URL to that page in the referer header)-- many server owners will let through legitimate, well-behaved robots, but block "suspicious" ones where it's not clear what they're doing;
on a similar note, if you're doing things "legally", don't fetch pages that the site's "robot.txt" files prohibits you from fetching.
Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.
I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.