Validity of html - java

I want to enter complete html throgh string and then check is the given sting is a valid html or not.
Public booleanisValidHTML(String htmlData)
Description-Checks whether a given HTML data is a valid HTML data or not
htmlData- A HTML document in the form of string which contains TAGS and data.
returns-true if the given htmlData contains all valid tags with their allowed attributes and their possible values, otherwise false.
A valid HTML:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
<b>This text is bold</b>
</body>
</html>
The java code should look like
class htmlValidator{
public static void main(String args[]){
Scanner in =new Scanner(System.in);
String html=new String("pass the html here'');
isValidHtml(html)
}
public static boolean isValidHtml(String html){
/** write code here**/
/** method returns true if the given html is valid **
//**please help**/
}
}

Rather than writing regex to parse and check (which is generally A Bad Idea), you're better off using something like jsoup to parse it and check for errors.
From https://jsoup.org/cookbook/input/parse-document-from-string:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

Related

How to unescape escaped special characters while reading XML in Java

I'm working on extracting ISO-8559-2 encoded text from an XML. It works fine, however, there are some special characters which use their corresponding HTML code.
The XML file:
<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE TEI.2 SYSTEM "http://mek.oszk.hu/mekdtd/prose/TEI-MEK-prose.dtd">
<!-- ?xml-stylesheet type="text/xsl" href="http://mek.oszk.hu/mekdtd/xsl/boszorkany_txt.xsl"? -->
<TEI.2 id="MEK-00798">
<text type="novel">
<front>
<titlePage>
<docAuthor>Jókai Mór</docAuthor>
<docTitle>
<titlePart>Az arany ember</titlePart>
</docTitle>
</titlePage>
</front>
<body>
<div type="part">
<head>
<title>A Szent Borbála</title>
</head>
<div type="chapter">
<head>
<title>I. A VASKAPU</title>
</head>
<p text-align="justify">A kitartó hetes vihar. – Ez járhatlanná teszi a Dunát a Vaskapu
között.
</p>
</div>
</div>
</body>
</text>
</TEI.2>
A snippet of the code I use:
SAXReader reader = new SAXReader();
reader.setEncoding("ISO-8859-2");
Document document = reader.read(file);
Node node = document.selectSingleNode("//*[#type='chapter']/p");
String text = node.getStringValue();
// String text = org.jsoup.parser.Parser.unescapeEntities(node.getStringValue(), true);
// String text = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(node.getStringValue());
I also included in comments some libraries I tried, without any success.
What I want to see is:
A kitartó hetes vihar. - Ez járhatlanná teszi a Dunát a Vaskapu között.
What I see when I debug is:
A kitartó hetes vihar . Ez járhatlanná teszi a Dunát a Vaskapu között.

JSoup Not Producing Valid XHTML

I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}
In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

how to edit all text values in html tags using jsoup

What I want: I am new to Jsoup. I want to parse my html string and search for each text value that appears inside tags (any tag). And then change that text value to something else.
What I have done: I am able to change the text value for single tag. Below is the code:
public static void main(String[] args) {
String html = "<div><p>Test Data</p> <p>HELLO World</p></div>";
Document doc1=Jsoup.parse(html);
Elements ps = doc1.getElementsByTag("p");
for (Element p : ps) {
String pText = p.text();
p.text(base64_Dummy(pText));
}
System.out.println("======================");
String changedHTML=doc1.html();
System.out.println(changedHTML);
}
public static String base64_Dummy(String abc){
return "This is changed text";
}
output:
======================
<html>
<head></head>
<body>
<div>
<p>This is changed text</p>
<p>This is changed text</p>
</div>
</body>
</html>
Above code is able to change the p tag's value. But, in my case html string can contain any tag; whose value I want to search and change.
How can I search all tags in html string and change their text value one by one.
You can try with something similar to this code:
String html = "<html><body><div><p>Test Data</p> <div> <p>HELLO World</p></div></div> other text</body></html>";
Document doc = Jsoup.parse(html);
List<Node> children = doc.childNodes();
// We will search nodes in a breadth-first way
Queue<Node> nodes = new ArrayDeque<>();
nodes.addAll(doc.childNodes());
while (!nodes.isEmpty()) {
Node n = nodes.remove();
if (n instanceof TextNode && ((TextNode) n).text().trim().length() > 0) {
// Do whatever you want with n.
// Here we just print its text...
System.out.println(n.parent().nodeName()+" contains text: "+((TextNode) n).text().trim());
} else {
nodes.addAll(n.childNodes());
}
}
And you'll get the following output:
body contains text: other text
p contains text: Test Data
p contains text: HELLO World
You want to use the CSS selector * and the method textNodes to get the text of a given tag (Element in Jsoup world).
This line below
Elements ps = doc1.getElementsByTag("p");
becomes
Elements ps = doc1.select("*");
Now, with this new selector you'll be able to select any elements (tags) within your HTML code.
FULL CODE EXAMPLE
public static void main(String[] args) {
System.out.println("Setup proxy...");
JSoup.setupProxy();
String html = "<html><body><div><p>Test Data</p> <div> <p>HELLO World</p></div></div> other text</body></html>";
Document doc1 = Jsoup.parse(html);
Elements tags = doc1.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
tn.text(base64_Dummy(tagText));
}
}
}
System.out.println("======================");
String changedHTML = doc1.html();
System.out.println(changedHTML);
}
public static String base64_Dummy(String abc) {
return "This is changed text";
}
OUTPUT
======================
<html>
<head></head>
<body>
<div>
<p>This is changed text</p>
<div>
<p>This is changed text</p>
</div>
</div>This is changed text
</body>
</html>

how to get proper formatted text from html when tags don't have line breaks

I am trying to parse this sample html file with the help of Jsoup HTML parsing Library.
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option><option value="Chevy">Chevy</option><option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
And I am getting the following when I extract only text.
this is sample text this is heading sample FordChevySubaru this is second sample text
There is no spaces or line breaks in option tag text.
Whereas If the html had been like this
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option>
<option value="Chevy">Chevy</option>
<option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
now in this case the text is like this
this is sample text this is heading sample Ford Chevy Subaru this is second sample text
with proper spaces in the text of option tag. How do I get the second output with the first html file. i.e. if there is no linebreak in the tags how is it possible that string does not get concatenated.
I am using the following code in Java.
public static String extractText(File file) throws IOException {
Document document = Jsoup.parse(file,null);
Element body=document.body();
String textOnly=body.text();
return textOnly;
}
I think only solution that achieves your requirements is traversing the DOM and print the textnodes:
public static String extractText(File file) throws IOException {
StringBuilder sb = new StringBuilder();
Document document = Jsoup.parse(file, null);
Elements body = document.getAllElements();
for (Element e : body) {
for (TextNode t : e.textNodes()) {
String s = t.text();
if (StringUtils.isNotBlank(s))
sb.append(t.text()).append(" ");
}
}
return sb.toString();
}
Hope it helps.

Getting data from a form using java

<div></div>
<div></div>
<div></div>
<div></div>
<ul>
<form id=the_main_form method="post">
<li>
<div></div>
<div> <h2>
<a onclick="xyz;" target="_blank" href="http://sample.com" style="text-decoration:underline;">This is sample</a>
</h2></div>
<div></div>
<div></div>
</li>
there are 50 li's like that
I have posted the snip of the html from a big HTML.
<div> </div> => means there is data in between them removed the data as it is not neccessary.
I would like to know how the JSOUP- select statement be to extract the href and Text?
I selected doc.select("div div div ul xxxx");
where xxx is form ..shoud I give the form id (or) how should I do that
Try this:
Elements allLis = doc.select("#the_main_form > li ");
for (Element li : allLis) {
Element a = li.select("div:eq(1) > h2 > a");
String href = a.attr("href");
String text = a.text();
}
Hope it helps!
EDIT:
Elements allLis = doc.select("#the_main_form > li ");
This part of the code gets all <li> tags that are inside the <form> with id #the_main_form.
Element a = li.select("div:eq(1) > h2 > a");
Then we iterate over all the <li> tags and get <a> tags, by first getting <div> tags ( the second one inside all <li>s by using index=1 -> div:eq(1)) then getting <h2> tags, where our <a> tags are present.
Hope you understand now! :)
Please try this:
package com.stackoverflow.works;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/*
* # author: sarath_sivan
*/
public class HtmlParserService {
public static void parseHtml(String html) {
Document document = Jsoup.parse(html);
Element linkElement = document.select("a").first();
String linkHref = linkElement.attr("href"); // "http://sample.com"
String linkText = linkElement.text(); // "This is sample"
System.out.println(linkHref);
System.out.println(linkText);
}
public static void main(String[] args) {
String html = "<a onclick=\"xyz;\" target=\"_blank\" href=\"http://sample.com\" style=\"text-decoration:underline;\">This is sample</a>";
parseHtml(html);
}
}
Hope you have the Jsoup Library in your classpath.
Thank you!

Categories

Resources