So I'm making a program for android that tries to download something from www.wupload.com. What I want isn't a browser but to interact with the webpage without actually showing it. Like how HtmlUnit is supposed to work.
I'm using apache for the html requests and what I've done so far is send a post that simulates clicking on slow download on the web page. Then I read the response so I can get some variables needed to make the next post and execute the next post. In theory, the web page should be showing the captcha cause the response I get is please enter the captcha, but no image url.
The next step would be to enter the captcha and finally download the file, the problem I'm having is I don't know how to show the captcha image to the user. Do I have to capture it somehow? I know how to make the post to send what the user would type, but the image url of the captcha isn't in the source code.
I thought of inspecting the web page so I could get the url from the DOM tree, like what inspect element on google chrome does, but I have no idea if it's even possible. Any ideas would be great.
thx
The captcha is probably generated using JavaScript. Therefore when you get the source of the website, the captcha hasn't yet been generated and you won't see the image in the source HTML. You would need to run the Javascript somehow. You could try using a WebView because it has built-in support for Javascript, or get a Javascript library for Java and use it somehow. I think it would be a lot of work.
Edit:
Actually, if they are using a thirdy-party captcha library, I'm sure it uses some sort of HTTP request system, so you might be able to inspect it with this plugin for Firefox.
Related
I have created an application in JSwing that has a button that I want to open the user manual (which is a html file) in a browser. I can successfully open the entire webpage, but I want to link to certain anchors in the document. For example I am trying to use this code:
URI uri = new URI("c:/Giggafriggin/user_manual/user_manual.html#h1_3");
Desktop.getDesktop().browse(uri);
But this causes an error, claiming the file cannot be found. But if leave off "#h1_3" it opens the page in a browser without a problem. The anchors work when i enter them into the browser manually. Any ideas?
You -could- have that linking to another html page which goes to the end uri. Unfortunately, Java is not a web browser.
Looks like this is a known issue you wouldn't run into if you were using HTTP instead of a local file.
One easy fix is simply to point to a version of the that's already online instead of on disk.
If you can't assume the content is available online, you could always spin up an embedded HTTP server like jetty inside your application and point to that instead.
I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy
right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.
I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.
I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.
Sincerly,
Ogofo
Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).
One thing that worths trying in HtmlUnit is changing the BrowserVersion (Chrome / InternetEplorer / FireFox) while creating the WebClient instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.
I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine.
Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a search based on my date parameters (01/01/2003 - 30/06/2003), and then I can run the show by simply manipulating the URL I'm connecting to.
Any Suggestions?
First make sure the terms of service for the site allow this.
I would httpclient posts to send the data and get the results. See the form on the page, figure out which variables you need to emulate and submit them with httpclient. You should get back the results you are looking for. Also this page has lots of javascript, so you need to figure out what it is doing, maybe its never submitting the form but making ajax calls to update the page, but maybe you can get the same results.
You can always install something like "fiddler" and watch the http traffic the page is sending and then emulate that using httpclient.
Here's I want to do, I want to upload a file that will be processed by a servlet. I would use Apache Commons - File Upload to handle the file to be uploaded.
I've seen the gmail-like AJAX file upload, where there would be a hidden iframe that would later be populated with a javascript to stop showing the upload image or displaying a message that the upload is succesful. However, this uses PHP, where the php file to handle the file upload would include the javascript inside the iframe.
My question is, how would I do this in Java using servlets, without resorting to JSP and imitating the above implementation on PHP. I don't even know if this is possible, so please guide me on a good implementation (without external libraries except for commons fileupload).
Note: I am aware that there are libraries out there that could do this easily, but I first want to know how this happens, how this is possible, and to dirty my hands and learn this.
Edit: Just to add, I would use the streaming API of Apache-Commons FileUpload
It is exactly the same.
The client makes an HTTP request to the server (by submitting a form).
The server responds with some HTML (which links to or embeds some JavaScript).
Switching from PHP to Java is just a drop in replacement. You don't need to change any of the JavaScript. The user guide tells you how to set it up.
http://oreilly.com/pub/a/javascript/2002/02/08/iframe.html is the best idea to file-upload. i done file upload using hidden iframe. Please consult with attached link.