I want to write an application using Java6 that can check a users Hotmail inbox for the 'unread message count'!
There is a Javascript API but I will not have a browser instance, and it seems that I need one to use it. (see stakoverflow question: 964392 )
I can use POP3, but since it does not support flags, I can only tell how many 'new' messages there are in the users Inbox since the last time I checked, not how many unread messages there are. ( This is my current implementation, it's not what is required, but is currently all I can achieve )
There is IMAP access, but only for 'premium users'(Hotmail users who pay).
There's also HttpMail access, but this is poorly documented, and from testing, seems it's also only for premium users.
Unfortunately, this similar question on msdn suggests this is impossible
EDIT:
All I can offer is a half-solution. You could create the html page containing the script suggested by the people on MSDN but instead of setting the value of an input box to the number of unread messages - you could use Ajax to post this number back to your application. This is, of course, not a very robust solution since it depends on the browser and may very well not be cross platform. Another thing you can do is read up on running Javascript on the JVM. I don't know how good that solution is, either, but I think it's more robust once (or rather if) you can get it to work.
One potential option could be to use the HTMLUnit Java headless web browser to make the requests. HTMLUnit has very good, but not perfect, JavaScript support to handle creating the dynamic content.
Related
I've to get all active browsers that are opened in the local machine along with corresponding session ids.
I've made little research and i do find the below link.
It's giving the active browser name which is not enough though
I know there is some plugin's available to get the sessions but i need to do it programmatically.
Thanks in advance!!
Short answer: it's impossible.
Long answer: Nothing distinguish browser from any other program. In fact many programs have browser functionality embedded in it.
If you define browser as one of known program such as "Firefox", "Chrome", "IE" etc. than you can do something to achieve some result.
First of all: forget about Java, you need to work with Win-API and Java is not best solution to do it.
You can enumerate process and find browsers by executable name.
Then you can check each browser documentation to find if there any integration API. Each browser will have its own API (or do not have).
Then using this API you can extract what is available.
So it will be hard if even possible to get what you want.
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.
It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.
As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.
To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.
I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?
If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.
I hope this made some sort of sense.
Kind regards,
Jamie.
No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.
Have you checked this: JSoup?
Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).
Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.
If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.
Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.
You could use htmlunit. It gives you virtual gui less browser.
Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.
BUT:
if you start bombarding a server of a company with moderate resources with downloads or start exhibiting obvious "robot" patterns (e.g. downloading precisely every second), they'll probably block you; so put some sensible constraints on what you do (e.g. every consecutive download to the same server happens at random intervals of between 10 and 20 seconds);
when you make your request, you probably want to set the "referer" request header either to mimic an actual browser, or to be open about what it is (invent a name for your "robot", create a page explaining what it does and include a URL to that page in the referer header)-- many server owners will let through legitimate, well-behaved robots, but block "suspicious" ones where it's not clear what they're doing;
on a similar note, if you're doing things "legally", don't fetch pages that the site's "robot.txt" files prohibits you from fetching.
Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.
I need to understand the directions in need to look into to Writing a program that figures out what all websites have been hit by a user using his browser. I want to write a standalone program. Can anybody direct me to some API which may help me figure this out.
Well, first of all that depends on which browser do you need to check. I'm guessing that you need to check the currently set default system browser. Anyway, that will require a lot of browser research and few JNI calls.
To find a default browser you would need to check HKEY_CLASSES_ROOT\http\shell\open\command (for Windows) and various configuration files under different linux for different window managers.
Then you would need to read the history of the specific browser from the format of that browser. For example, Firefox stores it's history in sqlite format in the profile directory in places.sqlite file. Chrome on other hand stores it in %home%/User Data/Default/history. So you would need a separate parser for each browser.
Basically, if you need a universal browser history reader - it's a load of work and research.
As it was clarified by the author in his comments - he needs to check what is user currently browsing.
The only truly browser and OS independent way is through proxy. You need to create a HTTP(S) proxy with Java (there are some implementations out there already) and then reconfigure the desired browser to use the proxy running at localhost. When your proxy is used - it will be able to track every bit of traffic the user tries to load.
This information is stored in a SQLite database in firefox:
The file "places.sqlite" stores the annotations, bookmarks, favorite
icons, input history, keywords, and browsing history (a record of
visited pages).
http://kb.mozillazine.org/Places.sqlite
Other browsers probably have similar approaches.
Any language with drivers for SQLite, and that includes Java, C, C#, C++, ruby, and, yes, even javascript, should be equally capable of accessing this database.
Speaking for myself, I would be interested collaborating on such a stand-alone program in Java should the OP put his code on github.
I'd like to store then later display user-entered content securely with minimal effort (my goal is a web app not writing a bunch of security-related code).
EDIT: Google App Engine for Java
I'm working with the same issue myself; but I haven't had the chance to get it out into the real world yet; so please just keep in mind that MY ANSWER IS NOT BATTLE TESTED. USE AT YOUR OWN RISK.
First, you need to ask yourself if you're going to be allowing the user to use ANY html markup. So, for example, can the user enter a link? What about make bold text?
If the answer is NO, then it is fairly simple. Here is the idea of how to set the filter up:
http://greatwebguy.com/programming/java/simple-cross-site-scripting-xss-servlet-filter/
But personally, I don't like the filter being used in the first example; I just put it there to show you how to set the filter up.
I would recommend using this filter:
http://xss-html-filter.sourceforge.net/
So basically:
Setup the example from first link, get it working
Download the example from the second link, put it in your project in such a way you can access it from your code.
Rewrite the cleanXSS method to use what you downloaded from the second link. So probably something like:
private String cleanXSS(String value) {
return new HTMLInputFilter().filter( input );
}
If you do want to allow HTML (such as an anchor tag/etc) then it looks like the HTMLInputFilter has mechanisms to allow this; but it isn't documented so you'll have to figure it out by looking at the code yourself or provide your own way of filtering.
user-entered content securely with minimal effort (my goal is a web app not writing a bunch of security-related code).
How much security-related code you need to write depends on how much you are at risk (how likely is it someone would want to attack your site, which it self is related to how popular your site is).
For example if your writing a public notepad, which will have a total of 3 users, you can get away with the bare minimum, if however your writing a we hate China, Iran and all hackers/crackers app dealing with $1,000,000 worth of transactions an hour and 3 billion users, you may be a bit more of a target.
Simply put you shouldn't trust any data that comes from outside your app including from the datastore. All this data should be checked that it's what you expect.
I've not validated incoming Java Strings against XSS however removing HTML is normally good enough, and Jsoup looks interesting for this (See Remove HTML tags from a String )
Also to be sure you should ensure your outputting what you expect to be outputting and not the some JavaScript.
Most templating engines, including django's (which is bundled with App Engine), provide facilities to escape output to make it safe to print in HTML. In newer versions of Django, this is done automatically unless you tell it not to; in 0.9.6 (still the default in webapp), you pass your output values to |escape in the template.
Escaping on output is universally the best way to do this, because it means you have the original unmodified text; if you modify your escaping or output formatting later, you can still format text entered before that.
You can also use a service that will proxy all connections and block any XSS attempts. I know only one service like that - CloudFlare (but it doesn't mean there aren't others like that). Unfortunately security features goes in with Pro plan which is paid :(