Get the HTML page using htmlunit

Get the HTML page using htmlunit - java

I am trying to get the HTML page of a website (ex http://htmlunit.sourceforge.net) but I get an error of IlleagalArgumentException: Cannot locate declared field class org.apache.http.impl.client.HttpClientBuilder.dnsResolver. My code is as follow:
public class Main1 {
public static void main(String[] args) {
try {
homePage();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void homePage() throws Exception {
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://www.google.com");
String text = page.asText();
System.out.println(text);
}
}
}
Is there something wrong with the code? Thanks

It's counter-intuitive but we can use asXml() on HtmlPage or HtmlElement to get it as HTML/XML representation.
page.asXml()
The way you wrote the code, it will return a text representation for what would be shown to a used on browser.
May you need to add this to enable JavaScript:
webClient.options.setJavaScriptEnabled(true)

IlleagalArgumentException: Cannot locate declared field class org.apache.http.impl.client.HttpClientBuilder.dnsResolver
This looks like a wrong version of the HttpClient dependency. Please check your classpath to have only one (and only the correct) version of every dependency.
For the current version you can finde a list of dependencies here http://htmlunit.sourceforge.net/dependencies.html

You can use jsoup parser.
Little code sample
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Advanced Usage
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Helpful URLs
Dom Navigation
Extracting
Working with URLs

Related

Unable to login into pages using Jsoup

I've looked everywhere and hit a dead end. I cant seem to login using jsoup even though everything looking right. I tried using Document instead of Connection.Response for the second connection but still failed. And I'm pretty sure this all the data needed to login. Any help appreciated.
public class Test {
public static void main(String[] args) {
try {
Connection.Response initial = Jsoup.connect("https://mytedata.net/wps/portal/ssp/WSLogin/")
.method(Connection.Method.GET)
.execute();
Connection.Response login = Jsoup.connect("https://mytedata.net/wps/portal/ssp/WSLogin/")
.data("javax.faces.encodedURL", "%2Fwps%2Fportal%2Fssp%2FWSLogin%2F%21ut%2Fp%2Fa1%2F04_Sj9CPykssy0xPLMnMz0vMAfGjzOKd3R09TMx9DAwMTCyMDDxdnDxczC19DQx8TYEKIoEKDHAARwNC-sP1o8BKPANMDDwsLAy83P3dLQw8zcMMnPwcDQwNTAygCvBYUZAbYZDpqKgIAHDzbiY%21%2Fdl5%2Fd5%2FL2dBISEvZ0FBIS9nQSEh%2Fpw%2FZ7_49L81I02J0VS50AFQUGD3C00G4%2Fres%2Fid%3DLoginPortletView.xhtml%2Fc%3DcacheLevelPage%2F%3D%2F")
.data("javax.faces.ViewState", "QFGkXvcilKfcbPXW1qJENp5sG6jAAt89s5%2BIY9vyKom3S72E9rizdFrPeA%2FYjD3Ja1sMNvMQtrSOkF2TUnKrEGYnxx918q5QVq2XkGqCVqm3iMQ1JFMuqBOe%2FJhaiOIueXg6Fw%3D%3D")
.data("viewns_Z7_49L81I02J0VS50AFQUGD3C00G4_%3Aform1%3AcommandButtonSignIn", "Log+In")
.data("viewns_Z7_49L81I02J0VS50AFQUGD3C00G4_%3Aform1%3AinputSecretPassword", "mypassword")
.data("viewns_Z7_49L81I02J0VS50AFQUGD3C00G4_%3Aform1%3AinputTextEmail", "myemail")
.data("viewns_Z7_49L81I02J0VS50AFQUGD3C00G4_%3Aform1_SUBMIT", "1")
.cookies(initial.cookies())
.method(Connection.Method.POST)
.execute();
Document doc = Jsoup.connect("https://mytedata.net/wps/myportal/ssp/Home")
.cookies(login.cookies())
.get();
Elements ele = doc.select("span.grayItem");
System.out.println(ele.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Return directory listing 2 levels down if applicable using Jsoup

I am using Jsoup to return a list of files found in a specified directory. I am accomplishing this with following code:
public List<String> getDirectoryListing(String urlPath)throws{
InitParams ip = new InitParams();
Elements links;
List<String> directoryListing = new ArrayList<>();
try{
Document doc = Jsoup.connect("http://" + urlPath).get();
links = doc.select("body a");
for (Element link : links){
directoryListing.add(link.text());
}
} catch (Exception ex) {
ex.printStackTrace();
}
return directoryListing;
}
However, I have different case, where there could be another folder inside this one with the file in it.
I need to check if what this returns is a directory and if that is the case, go inside of it and return the file.
Does anyone know how to do this?

You need some recursive logic in there, in which the method calls itself to list files in subfolders. It will then go as many levels deep as you need. You'll need a more complex object than string which can hold children. I'd make your own class.
pseudo code, something like this, this is not compilable, but it relays the algorithm
public List<WebFile> getFiles(urlPath) {
List webFiles = new web files list;
List urlFilesList = methodToGetWebFilesList(urlPath);
foreach urlFile in urlFiles {
//constructor has logic to parse whatever is in URL file and
//determine if it is a director
WebFile webFile = new WebFile(urlFile);
if "webFile" is a directory {
//recursive call to self, drill down into this file
children = getFiles(urlFile);
webFile.children.addAll(children);
}
}
}

Great idea #slambeth
public List<String> getDirectoryListing(String urlPath)throws Exception
{
return getDirectoryListing(urlPath, new ArrayList<>());
}
public List<String> getDirectoryListing(String urlPath, List<String> directoryListing)throws Exception
{
InitParams ip = new InitParams();
Elements links;
try
{
Document doc = Jsoup.connect("http://" + urlPath).get();
links = doc.select("body a");
for (Element link : links)
{
if(link.text().lastIndexOf("/")>0) {
getDirectoryListing(urlPath + link.text(), directoryListing);
}
else
directoryListing.add(link.text());
}
} catch (Exception ex) {
ex.printStackTrace();
}
return directoryListing;
}

How to add HTML headers and footers to a page?

How to add header to pdf from an html source using itext?
Currently, we have extended PdfPageEventHelper and overriden these methods. Works fine but it throws a RuntimeWorkerException when I get to 2+ pages.
#Override
void onStartPage(PdfWriter writer, Document document) {
InputStream is = new ByteArrayInputStream(header?.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
}
#Override
void onEndPage(PdfWriter writer, Document document) {
InputStream is = new ByteArrayInputStream(footer?.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
}

It is forbidden to add content in the onStartPage() event in general. It is forbidden to add content to the document object in the onEndPage(). You should add your header and your footer in the onEndPage() method using PdfWriter, NOT document. Also: you are wasting plenty of CPU by parsing the HTML over and over again.
Please take a look at the HtmlHeaderFooter example.
It has two snippets of HTML, one for the header, one for the footer.
public static final String HEADER =
"<table width=\"100%\" border=\"0\"><tr><td>Header</td><td align=\"right\">Some title</td></tr></table>";
public static final String FOOTER =
"<table width=\"100%\" border=\"0\"><tr><td>Footer</td><td align=\"right\">Some title</td></tr></table>";
Note that there are better ways to describe the header and footer than by using HTML, but maybe it's one of your requirements, so I won't ask you why you don't use any of the methods that is explained in the official documentation. By the way: all the information you need to solve your problem can also be found in that free ebook so you may want to download it...
We will read these HTML snippets only once in our page event and then we'll render the elements over and over again on every page:
public class HeaderFooter extends PdfPageEventHelper {
protected ElementList header;
protected ElementList footer;
public HeaderFooter() throws IOException {
header = XMLWorkerHelper.parseToElementList(HEADER, null);
footer = XMLWorkerHelper.parseToElementList(FOOTER, null);
}
#Override
public void onEndPage(PdfWriter writer, Document document) {
try {
ColumnText ct = new ColumnText(writer.getDirectContent());
ct.setSimpleColumn(new Rectangle(36, 832, 559, 810));
for (Element e : header) {
ct.addElement(e);
}
ct.go();
ct.setSimpleColumn(new Rectangle(36, 10, 559, 32));
for (Element e : footer) {
ct.addElement(e);
}
ct.go();
} catch (DocumentException de) {
throw new ExceptionConverter(de);
}
}
}
Do you see the mechanism we use to add the Element objects obtained from XML Worker? We create a ColumnText object that will write to the direct content of the writer (using the document is forbidden). We define a Rectangle and we using go() to render the elements.
The results is shown in html_header_footer.pdf.

Bruno's anwser is correct but it didn't worked for me completely as XMLWorkerHelper.parsetoElementsList was not able to parse some system fonts on the other hand XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
} was able to parse system fonts correctly so i have to go down the route of elements handler which worked a treat here's the code in C#
/// <summary>
/// returns pdf in bytes.
/// </summary>
/// <param name="contentsHtml">contents.</param>
/// <param name="headerHtml">header contents.</param>
/// <param name="footerHtml">footer contents.</param>
/// <returns></returns>
public Byte[] GetPDF(string contentsHtml, string headerHtml, string footerHtml)
{
// Create a byte array that will eventually hold our final PDF
Byte[] bytes;
// Boilerplate iTextSharp setup here
// Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream())
{
// Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var document = new Document(PageSize.A4, 40, 40, 120, 120))
{
// Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(document, ms))
{
// Open the document for writing
document.Open();
var headerElements = new HtmlElementHandler();
var footerElements = new HtmlElementHandler();
XMLWorkerHelper.GetInstance().ParseXHtml(headerElements, new StringReader(headerHtml));
XMLWorkerHelper.GetInstance().ParseXHtml(footerElements, new StringReader(footerHtml));
writer.PageEvent = new HeaderFooter(headerElements.GetElements(), footerElements.GetElements());
// Read your html by database or file here and store it into finalHtml e.g. a string
// XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(contentsHtml))
{
// Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, document, srHtml);
}
document.Close();
}
}
// After all of the PDF "stuff" above is done and closed but **before** we
// close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}
return bytes;
}
}
page events and elements handler code is here
public partial class HeaderFooter : PdfPageEventHelper
{
private ElementList HeaderElements { get; set; }
private ElementList FooterElements { get; set; }
public HeaderFooter(ElementList headerElements, ElementList footerElements)
{
HeaderElements = headerElements;
FooterElements = footerElements;
}
public override void OnEndPage(PdfWriter writer, Document document)
{
base.OnEndPage(writer, document);
try
{
ColumnText headerText = new ColumnText(writer.DirectContent);
foreach (IElement e in HeaderElements)
{
headerText.AddElement(e);
}
headerText.SetSimpleColumn(document.Left, document.Top, document.Right, document.GetTop(-100), 10, Element.ALIGN_MIDDLE);
headerText.Go();
ColumnText footerText = new ColumnText(writer.DirectContent);
foreach (IElement e in FooterElements)
{
footerText.AddElement(e);
}
footerText.SetSimpleColumn(document.Left, document.GetBottom(-100), document.Right, document.GetBottom(-40), 10, Element.ALIGN_MIDDLE);
footerText.Go();
}
catch (DocumentException de)
{
throw new Exception(de.Message);
}
}
}
public class HtmlElementHandler : IElementHandler
{
public ElementList Elements { get; set; }
public HtmlElementHandler()
{
Elements = new ElementList();
}
public ElementList GetElements()
{
return Elements;
}
public void Add(IWritable w)
{
if (w is WritableElement)
{
foreach (IElement e in ((WritableElement)w).Elements())
{
Elements.Add(e);
}
}
}
}

Intercept rendered HTML

I am writing a web-application using Play 1.2.3. One of the feature is to export a rendered HTML page as PDF. I already have the HTML template rendered dynamically based on the parameters sent by the server.
I am planning to use wkhtmltopdf to convert the rendered HTML to PDF. Is there a way in which I can intercept the final HTML (processed by the framework by replacing all template tags) for this purpose..? Or is there a better way to achieve this?

There is already a module for that : http://www.playframework.org/modules/pdf
If you want to do it yourself you can watch in the Controller class how a template is loaded and replace some part to get the rendered template as a string
protected static String renderTemplate(String templateName, Map<String,Object> args) {
try {
Template template = TemplateLoader.load(template(templateName));
// Get the template into a String
return template.render(args);
} catch (TemplateNotFoundException ex) {
if (ex.isSourceAvailable()) {
throw ex;
}
StackTraceElement element = PlayException.getInterestingStrackTraceElement(ex);
if (element != null) {
throw new TemplateNotFoundException(templateName, Play.classes.getApplicationClass(element.getClassName()), element.getLineNumber());
} else {
throw ex;
}
}

How do I edit a XML node in a file object, using Java

There are a lot of examples on the internet of "reading" files but I can't find anything on "editing" a node value and writing it back out to the original file.
I have a non-working xml writer class that looks like this:
import org.w3c.dom.Document;
public class RunIt {
public static Document xmlDocument;
public static void main(String[] args)
throws TransformerException, IOException {
try {
xmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse("thor.xml");
} catch (IOException ex) {
ex.printStackTrace();
} catch (SAXException ex) {
ex.printStackTrace();
} catch (ParserConfigurationException ex) {
ex.printStackTrace();
}
addElement("A", "New");
writeDoc();
}
public static void addElement(String path, String val){
Element e = xmlDocument.createElement(path);
e.appendChild(xmlDocument.createTextNode(val));
xmlDocument.getDocumentElement().appendChild(e);
}
public static void writeDoc() throws TransformerException, IOException {
StringWriter writer = new StringWriter();
Transformer tf;
try {
tf = TransformerFactory.newInstance().newTransformer();
tf.transform(new DOMSource(xmlDocument), new StreamResult(writer));
writer.close();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerFactoryConfigurationError e) {
e.printStackTrace();
}
}
}
For this example, lets say this is the XML and I want to add a "C" node (inside the A node) that contains the value "New" :
<A>
<B>Original</B>
</A>

You use the Document object to create new nodes. Adding nodes as you suggest involves creating a node, setting its content and then appending it to the root element. In this case your code would look somehting like this:
Element e = xmlDocument.createElement("C");
e.appendChild(xmlDocument.createTextNode("new"));
xmlDocument.getDocumentElement().appendChild(e);
This will add the C node as a new child of A right after the B node.
Additionally, Element has some convenience functions that reduce the amount of required code. The second line above could have been replaced with
e.setTextContent("new");
More complicated efforts involving non root elements will involve you using XPath to fetch the target node to be edited. If you do start to use XPath to target nodes, bear in mind that the JDK XPath performance is abysmal. Avoid using an XPath of "#foo" in favor of constructs like e.getAttribute("foo") whenever you can.
EDIT: Formatting the document back to a string which can be written to a file can be done with the following code.
Document xmlDocument;
StringWriter writer = new StringWriter();
TransformerFactory.newInstance().transform(new DOMSource(xmlDocument), new StreamResult(writer));
writer.close();
String xmlString = writer.toString();
EDIT: Re: updated question with code.
Your code doesn't work because you're conflating 'path' and 'element name'. The parameter to Document.createElement() is the name of the new node, not the location in which to place it. In the example code I wrote I didn't get into locating the appropriate node because you were asking specifically about adding a node to the document parent element. If you want your addElement() to behave the way I think you're expecting it to behave, you'd have to add another parameter for the xpath of the target parent node.
The other problem with your code is that your writeDoc() function doesn't have any output. My example shows writing the XML to a String value. You can write it to any writer you want by adapting the code, but in your example code you use a StringWriter but never extract the written string out of it.
I would suggest rewriting your code something like this
public static void main(String[] args) {
File xmlFile = new File("thor.xml");
Document xmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(xmlFile);
// this is effective because we know we're adding to the
// document root element
// if you want to write to an arbitrary node, you must
// include code to find that node
addTextElement(xmlDocument.getDocumentElement(), "C", "New");
writeDoc(new FileWriter(xmlFile);
}
public static Element addTextElement(Node parent, String element, String val){
Element e = addElement(parent, element)
e.appendChild(xmlDocument.createTextNode(val));
return e;
}
public static Element addElement(Node parent, String element){
Element e = xmlDocument.createElement(path);
parent.appendChild(e);
return e;
}
public static void writeDoc(Writer writer) {
try {
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.transform(new DOMSource(xmlDocument), new StreamResult(writer));
} finally {
writer.close();
}
}

In order to write your document back to a file, you'll need an XML serializer or write your own. If you are using the Xerces library, check out XMLSerializer. For sample usage, you can also check out the DOMWriter samples page.
For more information on Xerces, read this

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get the HTML page using htmlunit - java

Related

Unable to login into pages using Jsoup

Return directory listing 2 levels down if applicable using Jsoup

How to add HTML headers and footers to a page?

Intercept rendered HTML

How do I edit a XML node in a file object, using Java

Categories

Resources