How to convert HtmlPage into Html string in HtmlUnit - java

I want to convert a page into a real HTML string, with <html>, <body>, etc..., not XML. I only see the asXml() function, which often changes many things in the structure.
Also note that I've performed modifications to the page after fetching it and I want those modifications to be present in the output as well.
How can I do that? Thank you so much.

So let me check if I got it right:
You fetched a page
You performed modifications to the page (EG: modifying nodes in it)
You want a valid HTML page containing the previous modifications as a String
page.asXml() will not help. This will return a valid XML file as a String rather than a valid HTML file.
page.getWebResponse().getContentAsString() will not help either. This will returned the response that the server gave you as it is (without any modification that you have made).
There is no other method that would return a string with a valid HTML String.
However, you could try using page.save(file). That would save the page the modified page to a file as HTML. Sadly, I don't think there is a method that receives an OutputStream so you're most likely to have to save the file to a file system and then get it back.
Probably, you could take a look at the HTMLUnit source and see how that method is implemented. Maybe adding your own save method is not that complex :)

Related

Java Tag update META

I have a basic Java Tag (called PluginTag), which extends TagSupport. This tag adds some behaviour to the calling JSP using the JspWriter instance, e.g.
this.pageContext.setAttribute("plugins", someBehaviour);
I would like to extend this tag, so that it injects HTML meta data into HEAD of the html document. So as explained, the tag has a JspWriter, and not much else...
Also, by the time PluginTag is invoked, another tag will have written the HEAD and any META data out. The trick is I cant update this tag to do my work - and in any case would like the PluginTag to handle my META data, if possible.
I seen a few things like apache HtmlElement, but dont think they are applicable from the context of a Tag.
Thanks.
It's impossible to access HTML document which is formed outside the custom tag. The reason is that previously formed HTML could have been flushed to the user-agent already while other has not been formed yet.
Another way to change shipped to the client and rendered HTML document is to use a custom tag which includes some JavaScript that changes the needed HTML-document elements.

Parse javascript generated content using Java

http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.

How can I send a newsletter with xPages content?

I have some content displayed using computed fields inside a repeat in my xpage.
I now need to be able to send out a newsletter (by email) every week with the content of this repeat. The content can be both plain text and html
My site is also translated into different languages so I need the code to be able to specify the language and return the content in that language.
I am thinking about creating a scheduled lotusscript or java agent that somehow read the content of the repeat. is this possible? if so, some sample code to get me started would be great
edit: the content is only available to logged in users
thanks
Thomas
Use a java agent, and instead of going to the content natively, do a web page open and open the page as if in a browser, then process the result. (you could make a special version of the web page that hides all extraneous content as well if you wanted)
How is the data for the repeat evaluated? Can it be translated in to a lotusscript database.search?
If so then it would be best to forget about the actual xPage and concentrate on working out how to get the same data via LotusScript and then write your scheduled agent to loop through the document collection and generate the email that way.
Looking to the Xpage would generate a lot of extra work, you need to be authenticated as the user ( if the data in the repeat is different from one user to the next ) to get the exact same data that this particular user would see and then you have to parse the page to extract the data.
If you have a complicated enough newsletter that you want to do an Xpage and not build the html yourself in the agent, what you could do is build a single xpage that changes what's rendered based on a special query string, then in your agent get the html from a URLConnection and pass the html into the body of your email.
You could build the URL based on a view that shows documents with today's date.
I would solve this by giving the user a teaser on what to read and give them a link to the full content.
You should check out Weihang Chens (my colleague) article about rendering an xPage as Mime and sending it as a mail.
http://www.bleedyellow.com/blogs/weihang/entry/render_a_xpages_programmtically_and_send_it_as_a_mail?lang=en_us
We got this working in house and it is very convenient.
He describes 3 different approaches to the problem.

How is web browser search implemented?

I want to implement in desktop application in java searching and highlighting multiple phrases in html files, like it is done in web browsers, so html tags (within < and >) are ignored but some tags like <b> arent ignored. When searching for example each table in text ...each <b>table</b> has name... will be highlighted, but in text ...has each</p><p> Table is... it will be not highlighted, because the <p> tag interrupts the text meaning.
in web browser is this somehow implemented, how can I get to this implementation? or is there some source on the net? I tried google, but without success :(
Instead of searching inside the actual HTML file the browsers search on the rendered output of that HTML.
Get a suitable HTML renderer and get its output as text. Then search on that text output using appropriate string searching algorithms.
The example that you highlighted in your question would result in a newline character in the rendered HTML output and hence a normal string searching algorithm will behave as you expect.
As Faisal stated, browsers search in rendered content only. For doing so you'll need to remove the HTML tags before doing the actual search:
This code might help you:
http://www.dotnetperls.com/remove-html-tags
Of course you'll need to add some checks/exclusions like script tags and other things that are not rendered into the browser.
This seems pretty easy.
1) Search for the last word in the string.
2) Look at what's before the last word.
3) Decide if what's before the last word constitutes and interruption (<p>, <br />, <div>).
4) If interruption, continue
5) Else evaluate previous word against the search query.
I don't know if this is how browsers perform this operation, but this approach should work.
Try using javax.swing.text.html package in java.

How to get AJAX generated HTML text?

AJAX is a very powerful tool so I am struggling with it :-).
Is there any way or API(in java) so that I can get the HTML code which is generated by AJAX?
Generally, AJAX make use of inner HTML code and hence this inner HTML code is missing when I look into the page source of a page.
e.g click here
Just see the section OTHER NEWS. The content is populated by AJAX. When I look into the page source the code is not there.
I need this HTML code through a java program. How can I get it?
To have a Java application use the content received via AJAX, you need to first find the URLs from where the content is getting called from. In case this it would be http://itm2083.com/get_wwo_content.php?featureGroupId=8355&featureDisplayLimit=1&sponsorName=vortalx&wwoDivCounter=5&domainUrlForWWo=http://item2083.com/&featureImgDisplay=FLAG_TRUE&featureGroupImageWidthLimit=200&featureGroupDefaultImageUrl1=http://wwo.itmftp.com/75x75.gif&featureGroupDefaultImageUrl2=http://wwo.itmftp.com/75x75.gif&featureGroupDefaultImageUrl3=http://wwo.itmftp.com/75x75.gif
The featureGroupId= parameter has 5 IDs: 8355, 8359, 8367, 8369, 8429. Use these to pull the content from the Other News box.
The featureDisplayLimit= parameter determines how much content is pulled from the server.
If you want the nice HTML as well, the Java app will have to recreate it, as the HTML rendered on the site is created by JavaScript code.

Categories

Resources