How to get attribute content Jsoup? - java

I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.

Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.

Related

JSoup Not Producing Valid XHTML

I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}
In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

How do I enter a url link using jsoup

I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you
You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.

How to inject snippets of html into an string containing valid html?

I have the following html (sized down for literary content) that is passed into a java method.
However, I want to take this passed in html string and add a <pre> tag that contains some text passed in and add a section of <script type="text/javascript"> to the head.
String buildHTML(String htmlString, String textToInject)
{
// Inject inject textToInject into pre tag and add javascript sections
String newHTMLString = <build new html sections>
}
-- htmlString --
<html>
<head>
</head>
<body>
<body>
</html>
-- newHTMLString
<html>
<head>
<script type="text/javascript">
window.onload=function(){alert("hello?";}
</script>
</head>
<body>
<div id="1">
<pre>
<!-- Inject textToInject here into a newly created pre tag-->
</pre>
</div>
<body>
</html>
What is the best tool to do this from within java other than a regex?
Here's how to do this with Jsoup:
public String buildHTML(String htmlString, String textToInject)
{
// Create a document from string
Document doc = Jsoup.parse(htmlString);
// create the script tag in head
doc.head().appendElement("script")
.attr("type", "text/javascript")
.text("window.onload=function(){alert(\'hello?\';}");
// Create div tag
Element div = doc.body().appendElement("div").attr("id", "1");
// Create pre tag
Element pre = div.appendElement("pre");
pre.text(textToInject);
// Return as string
return doc.toString();
}
I've used chaining a lot, what means:
doc.body().appendElement(...).attr(...).text(...)
is exactly the same as
Element example = doc.body().appendElement(...);
example.attr(...);
example.text(...);
Example:
final String html = "<html>\n"
+ " <head>\n"
+ " </head>\n"
+ " <body>\n"
+ " <body>\n"
+ "</html>";
String result = buildHTML(html, "This is a test.");
System.out.println(result);
Result:
<html>
<head>
<script type="text/javascript">window.onload=function(){alert('hello?';}</script>
</head>
<body>
<div id="1">
<pre>This is a test.</pre>
</div>
</body>
</html>

Add invalid xml elements to xml document java

I have the following code:
Document mainContent = new Document();
Element rootElement = new Element("html");
mainContent.setContent(rootElement);
Element headElement = new Element("head");
Element metaElement = new Element("meta");
metaElement.setAttribute("content", "text/html; charset=utf-8");
headElement.addContent(metaElement);
rootElement.addContent(headElement);
org.jdom2.output.Format format = org.jdom2.output.Format.getPrettyFormat().setOmitDeclaration(true);
XMLOutputter outputter = new XMLOutputter(format);
System.out.println(outputter.outputString(mainContent));
This will produce the output :
<html>
<head>
<meta content="text/html; charset=utf-8" />
</head>
</html>
Now, I have the following string:
String links = "<link src=\"mysrc1\" /><link src=\"mysrc2\" />"
How can I add it to the HTML element so the output will be:
<html>
<head>
<meta content="text/html; charset=utf-8" />
<link src="mysrc1" />
<link src="mysrc2" />
</head>
</html>
Please note that it's NOT a valid XML element altogether, but each link is a valid XML Element.
I don't mind using another XML parser if needed. I am already using somewhere else in my code HTMLCleaner if it helps.
You can do something like they mention here. Basically place your xml snippet inside of a root element:
links ="<root>"+links+"</root>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc=builder.parse(links ByteArrayInputStream(xml.getBytes()));
NodeList nl = ((Element)doc.getDocumentElement()).getChildNodes();
for (int temp = 0; temp < nl .getLength(); temp++) {
Node nNode = nl .item(temp);
//Here you create your new Element based on the Node nNode, and the add it to the new DOM you're building
}
Then parse links as a valid XML document, and extract the nodes you want (basically anything other than the root node)

Extract CSS Styles from HTML using JSOUP in JAVA

Can anyone help with extraction of CSS styles from HTML using Jsoup in Java.
For e.g in below html i want to extract .ft00 and .ft01
<HTML>
<HEAD>
<TITLE>Page 1</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
.ft00{font-size:11px;font-family:Times;color:#ffffff;}
.ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>
If the style is embedded in your Element you just have to use .attr("style").
JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<HTML>\n" +
"<HEAD>\n"+
"<TITLE>Page 1</TITLE>\n"+
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
"<DIV style=\"position:relative;width:931;height:1243;\">\n"+
"<STYLE type=\"text/css\">\n"+
"<!--\n"+
" .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
" .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
"-->\n"+
"</STYLE>\n"+
"</HEAD>\n"+
"</HTML>";
Document doc = Jsoup.parse(html);
Element style = doc.select("style").first();
Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
while (cssMatcher.find()) {
System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
}
}
}
Will output:
Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;
Try this:
Document document = Jsoup.parse(html);
String style = document.select("style").first().data();
You can then use a CSS parser to fetch the details you are interested in.
http://www.w3.org/Style/CSS/SAC
http://cssparser.sourceforge.net
https://github.com/corgrath/osbcp-css-parser#readme

Categories

Resources