JSoup Not Producing Valid XHTML

JSoup Not Producing Valid XHTML - java

I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}

In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

Related

Validity of html

I want to enter complete html throgh string and then check is the given sting is a valid html or not.
Public booleanisValidHTML(String htmlData)
Description-Checks whether a given HTML data is a valid HTML data or not
htmlData- A HTML document in the form of string which contains TAGS and data.
returns-true if the given htmlData contains all valid tags with their allowed attributes and their possible values, otherwise false.
A valid HTML:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
<b>This text is bold</b>
</body>
</html>
The java code should look like
class htmlValidator{
public static void main(String args[]){
Scanner in =new Scanner(System.in);
String html=new String("pass the html here'');
isValidHtml(html)
}
public static boolean isValidHtml(String html){
/** write code here**/
/** method returns true if the given html is valid **
//**please help**/
}
}

Rather than writing regex to parse and check (which is generally A Bad Idea), you're better off using something like jsoup to parse it and check for errors.
From https://jsoup.org/cookbook/input/parse-document-from-string:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

How do I enter a url link using jsoup

I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you

You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.

How to get attribute content Jsoup?

I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.

Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.

Nested html not being parsed by Jsoup

I'm trying to parse a page with Jsoup, but the html doesn't seem to be parsing correctly.
The general structure is:
<html>
<head> ... </head>
<frameset ...>
<frame ...>
#document
<html> ... </html>
</frame>
</frameset>
</html>
When I parse the html and print it with Document doc = Jsoup.parse(html); System.out.println(doc.html()); it prints out the outer html (including #document, but not the frames or inner html).
Does anyone know how to get the inner html with Jsoup, or should I consider using a different library?
Thanks.
Edit: Here's the site I'm parsing. I have a subscription to it; don't know if it'll let any of you in.
http://database.asahi.com/library2/login/login.php
After authentication, it will take you to: http://database.asahi.com/library2/main/start.php
Edit 2:
<html>
<head></head>
<frameset rows="58,*" border="0">
<frame name="Header"> </frame>
<frame name="Introduce">
#document
<html>
<head>hello</head>
<body>hello again</body>
</html>
</frame>
</frameset>
</html>
Then I run:
Document doc = Jsoup.parse(html);
Elements elems = doc.select("frameset > frame:last-child");
// print(elems);
switch(elems.size()) {
case 0: break;
case 1: doc = Jsoup.connect(elems.first().attr("src")).get(); break;
default: break;
}
System.out.println(doc.html());
The parsed html (doc.html()):
<html>
<head></head>
<body>
ï»¿ #document hello hello again
</body>
</html>
So it's not even finding <frameset>
Any ideas?

Here is how to parse the nested html:
// Fetch the page with frameset
Document doc = Jsoup
.connect("http://database.asahi.com/library2/login/login.php")
.get(); // Add login, password etc
// Determine the frame url you want to parse...
// Note: I assume you want to parse the content of the first frame
Elements elts = doc.select("frameset > frame:first-child");
switch (elts.size()) {
case 0:
// No frame found ...
break;
case 1:
Element frameElt = elts.first();
Document frameDoc = Jsoup
.connect(frameElt.attr("src"))
.get();
// Add the frameDoc nodes to doc (via frameElt#insertChildren)
frameElt.insertChildren(0, frameDoc.childNodes());
break;
default:
// Strange result...
}
System.out.println(doc.html());

Extract CSS Styles from HTML using JSOUP in JAVA

Can anyone help with extraction of CSS styles from HTML using Jsoup in Java.
For e.g in below html i want to extract .ft00 and .ft01
<HTML>
<HEAD>
<TITLE>Page 1</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
.ft00{font-size:11px;font-family:Times;color:#ffffff;}
.ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>

If the style is embedded in your Element you just have to use .attr("style").
JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<HTML>\n" +
"<HEAD>\n"+
"<TITLE>Page 1</TITLE>\n"+
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
"<DIV style=\"position:relative;width:931;height:1243;\">\n"+
"<STYLE type=\"text/css\">\n"+
"<!--\n"+
" .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
" .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
"-->\n"+
"</STYLE>\n"+
"</HEAD>\n"+
"</HTML>";
Document doc = Jsoup.parse(html);
Element style = doc.select("style").first();
Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
while (cssMatcher.find()) {
System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
}
}
}
Will output:
Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;

Try this:
Document document = Jsoup.parse(html);
String style = document.select("style").first().data();
You can then use a CSS parser to fetch the details you are interested in.
http://www.w3.org/Style/CSS/SAC
http://cssparser.sourceforge.net
https://github.com/corgrath/osbcp-css-parser#readme

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup Not Producing Valid XHTML - java

In addition to (or instead of) the escape mode, you want to set the syntax: document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

Related

Validity of html

How do I enter a url link using jsoup

How to get attribute content Jsoup?

Nested html not being parsed by Jsoup

Extract CSS Styles from HTML using JSOUP in JAVA

Categories

Resources