I have the following code:
Document mainContent = new Document();
Element rootElement = new Element("html");
mainContent.setContent(rootElement);
Element headElement = new Element("head");
Element metaElement = new Element("meta");
metaElement.setAttribute("content", "text/html; charset=utf-8");
headElement.addContent(metaElement);
rootElement.addContent(headElement);
org.jdom2.output.Format format = org.jdom2.output.Format.getPrettyFormat().setOmitDeclaration(true);
XMLOutputter outputter = new XMLOutputter(format);
System.out.println(outputter.outputString(mainContent));
This will produce the output :
<html>
<head>
<meta content="text/html; charset=utf-8" />
</head>
</html>
Now, I have the following string:
String links = "<link src=\"mysrc1\" /><link src=\"mysrc2\" />"
How can I add it to the HTML element so the output will be:
<html>
<head>
<meta content="text/html; charset=utf-8" />
<link src="mysrc1" />
<link src="mysrc2" />
</head>
</html>
Please note that it's NOT a valid XML element altogether, but each link is a valid XML Element.
I don't mind using another XML parser if needed. I am already using somewhere else in my code HTMLCleaner if it helps.
You can do something like they mention here. Basically place your xml snippet inside of a root element:
links ="<root>"+links+"</root>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc=builder.parse(links ByteArrayInputStream(xml.getBytes()));
NodeList nl = ((Element)doc.getDocumentElement()).getChildNodes();
for (int temp = 0; temp < nl .getLength(); temp++) {
Node nNode = nl .item(temp);
//Here you create your new Element based on the Node nNode, and the add it to the new DOM you're building
}
Then parse links as a valid XML document, and extract the nodes you want (basically anything other than the root node)
Related
I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you
You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.
I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.
Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.
I'm trying to parse a page with Jsoup, but the html doesn't seem to be parsing correctly.
The general structure is:
<html>
<head> ... </head>
<frameset ...>
<frame ...>
#document
<html> ... </html>
</frame>
</frameset>
</html>
When I parse the html and print it with Document doc = Jsoup.parse(html); System.out.println(doc.html()); it prints out the outer html (including #document, but not the frames or inner html).
Does anyone know how to get the inner html with Jsoup, or should I consider using a different library?
Thanks.
Edit: Here's the site I'm parsing. I have a subscription to it; don't know if it'll let any of you in.
http://database.asahi.com/library2/login/login.php
After authentication, it will take you to: http://database.asahi.com/library2/main/start.php
Edit 2:
<html>
<head></head>
<frameset rows="58,*" border="0">
<frame name="Header"> </frame>
<frame name="Introduce">
#document
<html>
<head>hello</head>
<body>hello again</body>
</html>
</frame>
</frameset>
</html>
Then I run:
Document doc = Jsoup.parse(html);
Elements elems = doc.select("frameset > frame:last-child");
// print(elems);
switch(elems.size()) {
case 0: break;
case 1: doc = Jsoup.connect(elems.first().attr("src")).get(); break;
default: break;
}
System.out.println(doc.html());
The parsed html (doc.html()):
<html>
<head></head>
<body>
 #document hello hello again
</body>
</html>
So it's not even finding <frameset>
Any ideas?
Here is how to parse the nested html:
// Fetch the page with frameset
Document doc = Jsoup
.connect("http://database.asahi.com/library2/login/login.php")
.get(); // Add login, password etc
// Determine the frame url you want to parse...
// Note: I assume you want to parse the content of the first frame
Elements elts = doc.select("frameset > frame:first-child");
switch (elts.size()) {
case 0:
// No frame found ...
break;
case 1:
Element frameElt = elts.first();
Document frameDoc = Jsoup
.connect(frameElt.attr("src"))
.get();
// Add the frameDoc nodes to doc (via frameElt#insertChildren)
frameElt.insertChildren(0, frameDoc.childNodes());
break;
default:
// Strange result...
}
System.out.println(doc.html());
Can anyone help with extraction of CSS styles from HTML using Jsoup in Java.
For e.g in below html i want to extract .ft00 and .ft01
<HTML>
<HEAD>
<TITLE>Page 1</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
.ft00{font-size:11px;font-family:Times;color:#ffffff;}
.ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>
If the style is embedded in your Element you just have to use .attr("style").
JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<HTML>\n" +
"<HEAD>\n"+
"<TITLE>Page 1</TITLE>\n"+
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
"<DIV style=\"position:relative;width:931;height:1243;\">\n"+
"<STYLE type=\"text/css\">\n"+
"<!--\n"+
" .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
" .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
"-->\n"+
"</STYLE>\n"+
"</HEAD>\n"+
"</HTML>";
Document doc = Jsoup.parse(html);
Element style = doc.select("style").first();
Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
while (cssMatcher.find()) {
System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
}
}
}
Will output:
Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;
Try this:
Document document = Jsoup.parse(html);
String style = document.select("style").first().data();
You can then use a CSS parser to fetch the details you are interested in.
http://www.w3.org/Style/CSS/SAC
http://cssparser.sourceforge.net
https://github.com/corgrath/osbcp-css-parser#readme
How do you quickly locate element/elements via xpath string on a given org.w3c.dom.document? there seems to be no FindElementsByXpath() method. For example
/html/body/p/div[3]/a
I found that recursively iterating through all the child node levels to be quite slow when there are lot of elements of same name. Any suggestions?
I cannot use any parser or library, must work with w3c dom document only.
Try this:
//obtain Document somehow, doesn't matter how
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream("page.html"));
//Evaluate XPath against Document itself
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate("/html/body/p/div[3]/a",
doc, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
}
With the following page.html file:
<html>
<head>
</head>
<body>
<p>
<div></div>
<div></div>
<div><a>link</a></div>
</p>
</body>
</html>