So im trying to use HtmlUnit to go to a URL but once you visit that url it downloads a json file regarding the data you want. Not sure how to word this but basically in HtmlUnit how can I get the result from a downloaded file.
I suck at explaining here look
trying to check user availability by this
private static final String URL = "https://twitter.com/users/username_available?username=";
...
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
so that basically creates a new page for each username thing is the URL + username returns a json file of user availability. I know how to read the json file but the problem is this
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage
cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
I get that on this line
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
I suppose I need to create a new page for the response but how would I do that since it automatically downloads file instead of per se, clicking a button which downloads the file. (Correct me if im wrong)
Sorry 4AM
As its name indicates, an HtmlPage is a page containing HTML. JSON is not HTML.
As the documentation indicates;
The DefaultPageCreator will create a Page depending on the content type of the HTTP response, basically HtmlPage for HTML content, XmlPage for XML content, TextPage for other text content and UnexpectedPage for anything else.
(emphasis mine).
So, as the exception you're getting indicates, the behavior you observed is the documented behavior: you're getting a page that is neither HTML, nor XML, nor text, so you get an UnexpectedPage.
Your code should thus be:
UnexpectedPage page = webClient.getPage(URL + users[finalUsersIndex]);
Related
I need an input in my html file (I think via javascript) and I need to parse in my Java file.
To get user input in html, it's easy, but I don't know how to parse in my java file.
Any suggestions or tutorials? Thank you!
You can use jsoup like this.
String url = "https://stackoverflow.com/questions/50083471/input-in-html-to-java-file";
Document doc = Jsoup.parse(new URL(url), 3000);
System.out.println("title: " + doc.select("title").text());
System.out.println("users:");
for (Element e : doc.select("div.user-details a"))
System.out.println(" " + e.text());
This code parse this page and print title and user names.
result:
title: Input in html to java file - Stack Overflow
users:
l1den
truekiller
Ethan Three
saka1029
Ceate a form in html and when you submit call a servlet to handle the request. In servlet receive the value you want to handle by using request.getParameter(name) . There are many framework for web MVC like struts2 , spring MVC .
Yes, there are way many way to do it.
The simplest apporach is using html form, your html should include javascript that is able to submit a form (the data from user) to server (POST to an url) as a request.
and then server should program to response the request (the url you submit to).
and then you pickup a framework (serverlet is simple one, but if you have an existing project, check how it process form data).
I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!
Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.
I am struggling with the last part of my project in regards to HtmlUnit. I have succesfully managed to fill out the form details and click the submit button but this returns me a page object
Page submitted = button.click();
The API for page interface can be found here - http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/Page.html . I have spent a while trawling through the API to try and see how, based on the returned page after clicking the button I can then access the html table on the resulting page.
Would anyone be able to help me with the appropriate methods calls I would need to use in order to complete this.
Thanks
If the page returned is truly HTML (and not, for instance, a zip file) you can do this:
HtmlPage htmlPage = (HtmlPage) button.click();
DomNodeList<HtmlElement> nodes = htmlPage.getElementsByTagName("table");
...
HtmlTable table = getTheTableIWant(nodes);
doSomethingWith(table);
I know you may think this question is stupid, but I need to use HtmlUnit. However, it returns a page either as XML or as text.
I don't how to get the pure HTML (the same as the source code that browsers return)
I need this, because I need to use some written modules. Any ideas?
You can use the following piece of code to achieve your goal:
WebClient webClient = new WebClient();
Page page = webClient.getPage("http://example.com");
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
See javadocs of the WebResponse.html#getContentAsString() method.
I have a snippet of code like this:
webUrl = new URL(url);
reader = new BufferedReader(new InputStreamReader(webUrl.openStream()));
When I try to get html content of some page I get response that my browser doesn't support frames. So I do not get the real html of the page.
Is there a workaround?
Maybe to tell to the program to register as some browser?
For me it is critical only to get the html, then I want to parse it.
EDIT: Can not get src of the frame from the html in browser. It is hidden in js.
The "You don't support frames and we haven't put sensible alternative content here" message will be in the <noframes> element. You need to access the appropriate <frame> element, access its src attribute, resolve the URI in it, and then fetch data from there.
You must set a user-agent string in your HTTP request, so that the server thinks you are supporting frames. I suggest something like HtmlClient or HttpClient for this.