I have a snippet of code like this:
webUrl = new URL(url);
reader = new BufferedReader(new InputStreamReader(webUrl.openStream()));
When I try to get html content of some page I get response that my browser doesn't support frames. So I do not get the real html of the page.
Is there a workaround?
Maybe to tell to the program to register as some browser?
For me it is critical only to get the html, then I want to parse it.
EDIT: Can not get src of the frame from the html in browser. It is hidden in js.
The "You don't support frames and we haven't put sensible alternative content here" message will be in the <noframes> element. You need to access the appropriate <frame> element, access its src attribute, resolve the URI in it, and then fetch data from there.
You must set a user-agent string in your HTTP request, so that the server thinks you are supporting frames. I suggest something like HtmlClient or HttpClient for this.
Related
I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!
Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.
I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.
So im trying to use HtmlUnit to go to a URL but once you visit that url it downloads a json file regarding the data you want. Not sure how to word this but basically in HtmlUnit how can I get the result from a downloaded file.
I suck at explaining here look
trying to check user availability by this
private static final String URL = "https://twitter.com/users/username_available?username=";
...
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
so that basically creates a new page for each username thing is the URL + username returns a json file of user availability. I know how to read the json file but the problem is this
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage
cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
I get that on this line
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
I suppose I need to create a new page for the response but how would I do that since it automatically downloads file instead of per se, clicking a button which downloads the file. (Correct me if im wrong)
Sorry 4AM
As its name indicates, an HtmlPage is a page containing HTML. JSON is not HTML.
As the documentation indicates;
The DefaultPageCreator will create a Page depending on the content type of the HTTP response, basically HtmlPage for HTML content, XmlPage for XML content, TextPage for other text content and UnexpectedPage for anything else.
(emphasis mine).
So, as the exception you're getting indicates, the behavior you observed is the documented behavior: you're getting a page that is neither HTML, nor XML, nor text, so you get an UnexpectedPage.
Your code should thus be:
UnexpectedPage page = webClient.getPage(URL + users[finalUsersIndex]);
I need an idea to solve this problem:
I have a Portlet that submit a form on external URL (outside the portal). The submit was provided with HttpClient and HttpPost.
The result is a html page. (obviously)
Now..
I need to put this content into an iframe (and use it after...)
The URI that I used in httpPost, I cannot use it inside the iframe src attribute, because the page would be reloaded with loosing all previous information.
How can I do this?
Thanks in advance
Call the following javascript method, giving the result HTML page as paramter. (Suppose the id of the iframe is "iframeId")
function setIFrame(result_html){
$("#iframeId").contents().find('html').html(result_html);
}
I need to know if there is a way in java/servlet to make documents(doc,pdf) stored in database available for download to users in requested way(please see below),
for example there is a web page and the link for document in it
right now it is done this way:
if the user clicks that link than a new blank window opens and there the download dialog box is shown and the user is able to download the document but that blank window stays open
and the user have to close it manually
but wish to do it this way:
If the User clicks that link than directly staying on that page a download dialog box should show up asking them to save the file
a servlet url handles the download of the document which is responsible for extracting the doc form database and makes available for download to users
thank you for your time and effort
You need to add following headers in your servlet to make it a downloadable content so browsers don't try to display it,
String value = "attachment;filename=\"" + URLEncoder.encode(filename, "UTF-8") +'"';
response.setHeader("Content-Disposition", value);
response.setHeader("Content-Transfer-Encoding", "binary");
The filename is proposed filename and user can change it.
I wonder if your link html doesn't have something like:
<a href="/foo" **target="_blank"** ....>download</href>
Otherwise, it should work as you want.
This is a bug in IE which depends on several things, the content type is one of them. We had the same problem a few years ago but I don't remember the correct solution anymore, only that we struggled with this for quite some time. Try this:
Use the correct content type (application/pdf)
If that doesn't work, use a wrong file type (like application/octet-stream) which should tell IE to leave the file alone. You may have problems with the file extension, though.
Send or don't send the correct file size
Check which chunking mode you're using.
One of these things made IE behave. Good luck.
You need to remove target="_blank" from your <a> element.
Edit: as per your comments: you need to set Content-Disposition header to attachment. You can find here examples of a simple fileservlet and an advanced fileservlet to get some insights.