I'm trying to parse a page with Jsoup, but the html doesn't seem to be parsing correctly.
The general structure is:
<html>
<head> ... </head>
<frameset ...>
<frame ...>
#document
<html> ... </html>
</frame>
</frameset>
</html>
When I parse the html and print it with Document doc = Jsoup.parse(html); System.out.println(doc.html()); it prints out the outer html (including #document, but not the frames or inner html).
Does anyone know how to get the inner html with Jsoup, or should I consider using a different library?
Thanks.
Edit: Here's the site I'm parsing. I have a subscription to it; don't know if it'll let any of you in.
http://database.asahi.com/library2/login/login.php
After authentication, it will take you to: http://database.asahi.com/library2/main/start.php
Edit 2:
<html>
<head></head>
<frameset rows="58,*" border="0">
<frame name="Header"> </frame>
<frame name="Introduce">
#document
<html>
<head>hello</head>
<body>hello again</body>
</html>
</frame>
</frameset>
</html>
Then I run:
Document doc = Jsoup.parse(html);
Elements elems = doc.select("frameset > frame:last-child");
// print(elems);
switch(elems.size()) {
case 0: break;
case 1: doc = Jsoup.connect(elems.first().attr("src")).get(); break;
default: break;
}
System.out.println(doc.html());
The parsed html (doc.html()):
<html>
<head></head>
<body>
 #document hello hello again
</body>
</html>
So it's not even finding <frameset>
Any ideas?
Here is how to parse the nested html:
// Fetch the page with frameset
Document doc = Jsoup
.connect("http://database.asahi.com/library2/login/login.php")
.get(); // Add login, password etc
// Determine the frame url you want to parse...
// Note: I assume you want to parse the content of the first frame
Elements elts = doc.select("frameset > frame:first-child");
switch (elts.size()) {
case 0:
// No frame found ...
break;
case 1:
Element frameElt = elts.first();
Document frameDoc = Jsoup
.connect(frameElt.attr("src"))
.get();
// Add the frameDoc nodes to doc (via frameElt#insertChildren)
frameElt.insertChildren(0, frameDoc.childNodes());
break;
default:
// Strange result...
}
System.out.println(doc.html());
Related
I'm working on extracting ISO-8559-2 encoded text from an XML. It works fine, however, there are some special characters which use their corresponding HTML code.
The XML file:
<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE TEI.2 SYSTEM "http://mek.oszk.hu/mekdtd/prose/TEI-MEK-prose.dtd">
<!-- ?xml-stylesheet type="text/xsl" href="http://mek.oszk.hu/mekdtd/xsl/boszorkany_txt.xsl"? -->
<TEI.2 id="MEK-00798">
<text type="novel">
<front>
<titlePage>
<docAuthor>Jókai Mór</docAuthor>
<docTitle>
<titlePart>Az arany ember</titlePart>
</docTitle>
</titlePage>
</front>
<body>
<div type="part">
<head>
<title>A Szent Borbála</title>
</head>
<div type="chapter">
<head>
<title>I. A VASKAPU</title>
</head>
<p text-align="justify">A kitartó hetes vihar. Ez járhatlanná teszi a Dunát a Vaskapu
között.
</p>
</div>
</div>
</body>
</text>
</TEI.2>
A snippet of the code I use:
SAXReader reader = new SAXReader();
reader.setEncoding("ISO-8859-2");
Document document = reader.read(file);
Node node = document.selectSingleNode("//*[#type='chapter']/p");
String text = node.getStringValue();
// String text = org.jsoup.parser.Parser.unescapeEntities(node.getStringValue(), true);
// String text = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(node.getStringValue());
I also included in comments some libraries I tried, without any success.
What I want to see is:
A kitartó hetes vihar. - Ez járhatlanná teszi a Dunát a Vaskapu között.
What I see when I debug is:
A kitartó hetes vihar . Ez járhatlanná teszi a Dunát a Vaskapu között.
I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}
In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you
You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.
I have the following html (sized down for literary content) that is passed into a java method.
However, I want to take this passed in html string and add a <pre> tag that contains some text passed in and add a section of <script type="text/javascript"> to the head.
String buildHTML(String htmlString, String textToInject)
{
// Inject inject textToInject into pre tag and add javascript sections
String newHTMLString = <build new html sections>
}
-- htmlString --
<html>
<head>
</head>
<body>
<body>
</html>
-- newHTMLString
<html>
<head>
<script type="text/javascript">
window.onload=function(){alert("hello?";}
</script>
</head>
<body>
<div id="1">
<pre>
<!-- Inject textToInject here into a newly created pre tag-->
</pre>
</div>
<body>
</html>
What is the best tool to do this from within java other than a regex?
Here's how to do this with Jsoup:
public String buildHTML(String htmlString, String textToInject)
{
// Create a document from string
Document doc = Jsoup.parse(htmlString);
// create the script tag in head
doc.head().appendElement("script")
.attr("type", "text/javascript")
.text("window.onload=function(){alert(\'hello?\';}");
// Create div tag
Element div = doc.body().appendElement("div").attr("id", "1");
// Create pre tag
Element pre = div.appendElement("pre");
pre.text(textToInject);
// Return as string
return doc.toString();
}
I've used chaining a lot, what means:
doc.body().appendElement(...).attr(...).text(...)
is exactly the same as
Element example = doc.body().appendElement(...);
example.attr(...);
example.text(...);
Example:
final String html = "<html>\n"
+ " <head>\n"
+ " </head>\n"
+ " <body>\n"
+ " <body>\n"
+ "</html>";
String result = buildHTML(html, "This is a test.");
System.out.println(result);
Result:
<html>
<head>
<script type="text/javascript">window.onload=function(){alert('hello?';}</script>
</head>
<body>
<div id="1">
<pre>This is a test.</pre>
</div>
</body>
</html>
I am working to fetch data which is under iframe tag.I want the data under body tag.But I am not getting proper output.Here is my HTML page looks like as follows:
<iframe id="sharetools-iframe" width="100%" height="100%" frameborder="0" style="z-index: 1000; position: relative; visibility: visible; " scrolling="no" allowtransparency="true" src="some url here"></iframe>
#document
<html><head><script type="text/javascript">window.WIDGET_ID = 'sharepopup';</script>
<base href="href here">
<link rel="stylesheet" href="href here">
<script src="url here"></script>
<script type="text/javascript" src="url here"></script></head>
<body>
body contents are here........
..............................
</div></div></body></html>
I have used the following code:
WebDriver webDriver = new FirefoxDriver();
webDriver.get("url to open");
String htmlPage = webDriver.getPageSource();
Tidy tidy = new Tidy();
InputStream inputStream = new ByteArrayInputStream(htmlPage.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
Element element = doc.getElementById("sharetools-iframe");
Here I am getting element null.I have also type casted it as follows:
HTMLIFrameElement iframeElement = (HTMLIFrameElement) doc.getElementById("sharetools-iframe");
But getting iframeElement null.
I have also used jsoup parser as follows:
Document doc = Jsoup.parse(htmlPage);
org.jsoup.nodes.Element iframeElement= doc.getElementById("sharetools-iframe");
Here I am getting output <iframe id="sharetools-iframe" width="100%" height="100%" frameborder="0" style="z-index: 1000; position: relative; visibility: visible; " scrolling="no" allowtransparency="true" src="some url here"></iframe> as a iframeElement but not getting body content.
Please guide me how to fetch body content of iframe.
You need to switch to iframe and then you can interact with content.
For example to get body:
driver.switchTo().frame(driver.findElement(By.id("sharetools-iframe"))).findElement(By.tagName("body"));