Is there a way, or is it even possible to take a screenshot of a website with Flash (or Java)? If it is, could someone please provide some basic information on how to achieve this?
The reason why I need it to be Flash or Java (or even Canvas), is because the screenshot needs to be done on the client-side.
I did some research with no definitive answer to my question.
From Flash you can not take a screenshot beyond the actual view of the flash rendering area - for security reasons. Just ask the user to press PrintScreen.
I did something like this before. Although my solution was to just have javascript send back the actual html rendered on the client-side. I had a servlet that accepts the html code, then the servlet calls an executable (I can't remember what it was, but it was a freeware but has a watermark, it accepts an html in its command-line argument) that produces an image from the html, which the servlet saves to a directory.
Although the business user's requirement also included making sure that the code is not used for spying or snooping on the client side... But they agreed with the outcome of the program in the end. As indeed the screenshot is not made in the client side...
Related
Good evening, I'm working on a project with a team, we have to make a browser without using JEditorPane or any other class that reads HTML.
How can we do that? Do we need to make a new class that does what JEditorPane does? Can I find somewhere JEditorPane's code? Thanks!
Well, this is an answer:
If you need to display web content without using any pre-existing engine (such JEditorPanel or a ChromeBind), you need to read the HTML as a XML file and construct your native View based on it (without CSS and JS this is a fairly easy task) by constructing the screen based on a one-to-one equivalent of a HTML tag to a Java JComponent.
Modern Web Browsers are pretty complicated, so there are a lot of different pieces that come together to display a web page. In order to build a browser, you need to first understand what a browser is. For that, I recommend reading this tutorial.
Once you have an understanding of how a browser actually works you need to determine which pieces you can reuse and which pieces you have to write from scratch. Do you have to write the entire rendering engine? Good luck! Can you use an existing engine like Gecko or Webkit? Or maybe you can get a little closer to done and use the java port of Webkit?
Once you have a better understanding of the question come back and ask more direct questions when you get stuck at a specific piece. As it is, your first step is to gain an understanding of the problem you are trying to solve.
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it's pretty much just checking for one line of the html code.
I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.
It is a hack. :)
There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.
As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.
Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.
To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.
Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don't.
I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.
You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
I am looking to develop an app that will take login details from the user, go to a website, login, return values on the web page and then display them to the user on the phone.
Does java have this functionallity? Will I need to use javascript instead maybe? do these answers depend on the website that I am trying to access?
In my head I figure that I could just read in the paramaters as strings or chars, parse the webpage for the appropriate form and "paste" the appropriate value into the form "box". However, I have never attempted anything like this with coding so I am completely new to the idea and dont really know where to start. I tried googling around but any information that I found was either irrelevant or conflicting.
I'm not looking for the code to do it because I will not really learn anythig from that but a finger in the right direction would be great. I really do want to try get better at programming so that's why I've started to give myself these little side projects
Any help that can be offered would be great
Ian,
You can try using http-client (http://hc.apache.org/httpclient-3.x/) lib from apache. It lets to pro grammatically access a website (from a Java code). You will need to do the following things
Use the http-client lib to POST the data to the web site.
Receive the html response.
Use some html parser or xpath to retrieve the values from the response html.
You would need a script which accesses the webpage and enters the data, but in my opinion this is illegal. Because you are accessing a secured area and are able to look into sensitive data. Also accessing the page via a script is "botting" - most pages have safety precautions to prevent the execution of scripts, because most of them are harmful.
In my opinion there is no legal and easy solution to this.
I'm writing a perl program that was doing a simple get command to retrieve results and process them. But the site has been updated and now has a java component that handles the results (so the actual data is not in the source code anymore).
This is the site:
http://wro.westchesterclerk.com/legalsearch.aspx
Try putting in:
Index Number: 11103
Year: 2009
I want to be able to pro grammatically enter the "index number" and "year" at the bottom of the form where it says "search by number" and then retrieve the results listed next to it.
I've written many programs in Perl that simply pass variables via the URL and the results are listed in the source code, so it's easy to parse. (Using LWP:Simple)
Like:
$html = get("http://www.url.com?id=$somenum&year=$someyear")
But this is totally new to me and I don't know where to begin.
I'm somewhat familiar with LWP:UserAgent and Mechanize.
I'd really appreciate any help.
Thanks!
That sort of question gets asked a lot. The standard answer is Wireshark.
I was just using it on that website with the test data you gave and extracted a single responsible POST request. This lets you bypass Javascript altogether.
It might be more logical for you to use one of the modules which drives a browser. Something like Mozilla::Mechanize or the Selenium tools.
A browser knows best how to interact with the server using AJAX and re-render the DOM and so on, so build your script on top of that ability.
What your asking to do in this case is hard. Not impossible but hard.
method A:
You can sift through their javascript code. What their "ajax" is doing is making a get/post request to another web page and dynamically loading the results. If you can decipher what that link is and the proper arguments you can continue to use get. I would recoment Getting the firebug plugin and any other tool that will help you de-obfuscate their javascript.
Another Method:
If your program could access a web browser(with javascript url support. like firefox). You could programatticaly go to these addresses, then wait a moment and get your data.
http://wro.westchesterclerk.com/legalsearch.aspx
javascript: function go() { document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtInde xNo').value=11109; document.getElementById('ctl00_tbSearchArea__ctl1_cphLegalSearch_splMain_tmpl0_tbLegalSearchType__ctl0_txtYear').value='09';searchClick();} go();
This is a method we have used along with mozembed to programatically get around this stuff. Recently we switched to Web Kit. And to remove this from taking up a video display we have used Xvfb/Xvnc to create a virtual desktop to load the browser in.
Those are the methods I have came up with so far. Let me know if you come up with another. Also I hope I helped.
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).