How to unescape escaped special characters while reading XML in Java - java

I'm working on extracting ISO-8559-2 encoded text from an XML. It works fine, however, there are some special characters which use their corresponding HTML code.
The XML file:
<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE TEI.2 SYSTEM "http://mek.oszk.hu/mekdtd/prose/TEI-MEK-prose.dtd">
<!-- ?xml-stylesheet type="text/xsl" href="http://mek.oszk.hu/mekdtd/xsl/boszorkany_txt.xsl"? -->
<TEI.2 id="MEK-00798">
<text type="novel">
<front>
<titlePage>
<docAuthor>Jókai Mór</docAuthor>
<docTitle>
<titlePart>Az arany ember</titlePart>
</docTitle>
</titlePage>
</front>
<body>
<div type="part">
<head>
<title>A Szent Borbála</title>
</head>
<div type="chapter">
<head>
<title>I. A VASKAPU</title>
</head>
<p text-align="justify">A kitartó hetes vihar. – Ez járhatlanná teszi a Dunát a Vaskapu
között.
</p>
</div>
</div>
</body>
</text>
</TEI.2>
A snippet of the code I use:
SAXReader reader = new SAXReader();
reader.setEncoding("ISO-8859-2");
Document document = reader.read(file);
Node node = document.selectSingleNode("//*[#type='chapter']/p");
String text = node.getStringValue();
// String text = org.jsoup.parser.Parser.unescapeEntities(node.getStringValue(), true);
// String text = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(node.getStringValue());
I also included in comments some libraries I tried, without any success.
What I want to see is:
A kitartó hetes vihar. - Ez járhatlanná teszi a Dunát a Vaskapu között.
What I see when I debug is:
A kitartó hetes vihar . Ez járhatlanná teszi a Dunát a Vaskapu között.

Related

getting Russian input from web into java applcation

I obviously am missing something here. I have a web app where the input for a form may be in English or, after a keyboard switch, Russian. The meta tag for the page is specifying that the page is UTF-8. That does not seem to matter.
If I type in "вв", two of the unicode character: CYRILLIC SMALL LETTER VE
What do I get? A string. I call getCodePoints().toArray() and I get:
[208, 178, 208, 178]
If I call chars().toArray[], I get the same.
What the heck?
I am completely in control of the web page, but of course there will be different browsers. But how can I get something back from the web page that will let me get the proper cyrillic characters?
This is on java 1.8.0_312. I can upgrade some, but not all the way to the latest java.
The page is this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Cards</title>
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity = "sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin = "anonymous" />
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity = "sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" crossorigin = "anonymous" />
<script src = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity = "sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin = "anonymous">
</script>
<meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" />
<style>.table-nonfluid { width: auto !important; }</style>
</head>
<body>
<div style = "padding: 25px 25px 25px 25px;">
<h2 align = "center">Cards</h2>
<div style = "white-space: nowrap;">
Home
<div>
<form name="f_3_1" method="post" action="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1">
<table class = "table" border = "1" style = "max-width: 50%; font-size: 300%; text-align: center;">
<tr>
<td>to go</td>
</tr>
<tr>
<td><input size="25" type="text" name="3.1.5.3.3" /></td>
</tr>
<td>
<input type="submit" value="Submit" name="3.1.5.3.5" /> Skip
</td>
</table>
<input type="hidden" name="wosid" value="ee67KCNaHEiW1WdpdA8JIM" />
</form>
</div>
</div>
</div>
</body>
</html>
Hm. Well, here is at least part of the story.
I have this code:
System.out.println("start: " + start);
int[] points = start.chars().toArray();
byte[] next = new byte[points.length];
int idx = 0;
System.out.print("fixed: ");
for (int p : points) {
next[idx] = (byte)(p & 0xff);
System.out.print(Integer.toHexString(next[idx]) + " ");
idx++;
}
System.out.println("");
The output is:
start: вв
fixed: ffffffd0 ffffffb2 ffffffd0 ffffffb2
And the UTF-8 value for "В", in hex, is d0b2.
So, there it is. The question is, why is this not more easily accessible? Do I really have to put this together byte-pair by byte-pair?
If the string is already in UTF-8, as I think we can see it is, why does the codePoints() method not give us, you know, the codePoints?
Ok, so now I do:
new String(next, StandardCharsets.UTF_8);
and I get the proper string. But it still seems strange that codePoints() gives me an IntStream, but if you use these things as int values, it is broken.
It was a problem with the frameworks I was using. I thought I was setting the request and response content type to utf-8 but I was not.

Validity of html

I want to enter complete html throgh string and then check is the given sting is a valid html or not.
Public booleanisValidHTML(String htmlData)
Description-Checks whether a given HTML data is a valid HTML data or not
htmlData- A HTML document in the form of string which contains TAGS and data.
returns-true if the given htmlData contains all valid tags with their allowed attributes and their possible values, otherwise false.
A valid HTML:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
<b>This text is bold</b>
</body>
</html>
The java code should look like
class htmlValidator{
public static void main(String args[]){
Scanner in =new Scanner(System.in);
String html=new String("pass the html here'');
isValidHtml(html)
}
public static boolean isValidHtml(String html){
/** write code here**/
/** method returns true if the given html is valid **
//**please help**/
}
}
Rather than writing regex to parse and check (which is generally A Bad Idea), you're better off using something like jsoup to parse it and check for errors.
From https://jsoup.org/cookbook/input/parse-document-from-string:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

How to inject snippets of html into an string containing valid html?

I have the following html (sized down for literary content) that is passed into a java method.
However, I want to take this passed in html string and add a <pre> tag that contains some text passed in and add a section of <script type="text/javascript"> to the head.
String buildHTML(String htmlString, String textToInject)
{
// Inject inject textToInject into pre tag and add javascript sections
String newHTMLString = <build new html sections>
}
-- htmlString --
<html>
<head>
</head>
<body>
<body>
</html>
-- newHTMLString
<html>
<head>
<script type="text/javascript">
window.onload=function(){alert("hello?";}
</script>
</head>
<body>
<div id="1">
<pre>
<!-- Inject textToInject here into a newly created pre tag-->
</pre>
</div>
<body>
</html>
What is the best tool to do this from within java other than a regex?
Here's how to do this with Jsoup:
public String buildHTML(String htmlString, String textToInject)
{
// Create a document from string
Document doc = Jsoup.parse(htmlString);
// create the script tag in head
doc.head().appendElement("script")
.attr("type", "text/javascript")
.text("window.onload=function(){alert(\'hello?\';}");
// Create div tag
Element div = doc.body().appendElement("div").attr("id", "1");
// Create pre tag
Element pre = div.appendElement("pre");
pre.text(textToInject);
// Return as string
return doc.toString();
}
I've used chaining a lot, what means:
doc.body().appendElement(...).attr(...).text(...)
is exactly the same as
Element example = doc.body().appendElement(...);
example.attr(...);
example.text(...);
Example:
final String html = "<html>\n"
+ " <head>\n"
+ " </head>\n"
+ " <body>\n"
+ " <body>\n"
+ "</html>";
String result = buildHTML(html, "This is a test.");
System.out.println(result);
Result:
<html>
<head>
<script type="text/javascript">window.onload=function(){alert('hello?';}</script>
</head>
<body>
<div id="1">
<pre>This is a test.</pre>
</div>
</body>
</html>

how to get proper formatted text from html when tags don't have line breaks

I am trying to parse this sample html file with the help of Jsoup HTML parsing Library.
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option><option value="Chevy">Chevy</option><option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
And I am getting the following when I extract only text.
this is sample text this is heading sample FordChevySubaru this is second sample text
There is no spaces or line breaks in option tag text.
Whereas If the html had been like this
<html>
<body>
<p> this is sample text</p>
<h1>this is heading sample</h1>
<select name="car" size="1">
<option value="Ford">Ford</option>
<option value="Chevy">Chevy</option>
<option selected value="Subaru">Subaru</option>
</select>
<p>this is second sample text</p>
</body>
</html>
now in this case the text is like this
this is sample text this is heading sample Ford Chevy Subaru this is second sample text
with proper spaces in the text of option tag. How do I get the second output with the first html file. i.e. if there is no linebreak in the tags how is it possible that string does not get concatenated.
I am using the following code in Java.
public static String extractText(File file) throws IOException {
Document document = Jsoup.parse(file,null);
Element body=document.body();
String textOnly=body.text();
return textOnly;
}
I think only solution that achieves your requirements is traversing the DOM and print the textnodes:
public static String extractText(File file) throws IOException {
StringBuilder sb = new StringBuilder();
Document document = Jsoup.parse(file, null);
Elements body = document.getAllElements();
for (Element e : body) {
for (TextNode t : e.textNodes()) {
String s = t.text();
if (StringUtils.isNotBlank(s))
sb.append(t.text()).append(" ");
}
}
return sb.toString();
}
Hope it helps.

Java getElementById() or Alternative

So, I have an XHTML document report skeleton that I want to populate by getting Elements of a certain IDs and setting their contents.
I tried getElementById(), and had null returned (because, as I found out, id is not implicitly "id" and needs to be declared in a schema).
panel.setDocument(Main.class.getResource("/halreportview/DefaultSiteDetails.html").toString());
panel = populateDefaultReport(panel);
Element header1 = panel.getDocument().getElementById("header1");
header1.setTextContent("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "+employee.toString()+"<br/><span class=\"b\">Scheduled Date:</span> "+dateFormat.format(scheduledDate));
So, I tried some work-arounds because I don't want to have to validate my XHTML documents. I tried adding a quick DTD to the top of the file in question like so;
<?xml version="1.0"?>
<!DOCTYPE foo [<!ATTLIST bar id ID #IMPLIED>]>
But getElementById() still returned null. So tried using xml:id instead of id in the XHTML document in the hopes it was supported, but again no luck. So instead I tried to use getElementsByTagName() and loop through the results checking ids. This worked, and found the correct element (as confirmed by output printing "Found it"), but when I try to call setTextContent on this element I am still getting a NullPointException. Code below;
Element header1;
NodeList sections = panel.getDocument().getElementsByTagName("p");
for (int i = 0; i < sections.getLength(); ++i) {
if (((Element)sections.item(i)).getAttribute("id").equals("header1")) {
System.out.println("Found it");
header1 = (Element) sections.item(i);
header1.setTextContent("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "+employee.toString()+"<br/><span class=\"b\">Scheduled Date:</span> "+dateFormat.format(scheduledDate));
}
}
I'm loosing my mind on this one. I must be suffering from some kind of fundamental misunderstanding of how this is supposed to work. Any ideas?
Edit; Excerpt from my XHTML file below with CSS removed.
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69"/>
<p id="header1"><span class="b">Instruction Type:</span> Example<br/><span class="b">Allocated To:</span> Example<br/><span class="b">Scheduled Date:</span> Example</p>
</div>
</body>
</html>
I am not sure why its not working , but I have put together example for you and it works !
Note : My example is using following libraries
Apache Commons IO (Link)
Jsoup HTML Parser (Jsoup link)
Apache Commons Lang (Link)
My input xhtml file ,
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69" />
<p id="header1">
<span class="b">Instruction Type:</span> Example<br />
<span class="b">Allocated To:</span> Example<br />
<span class="b">Scheduled Date:</span> Example
</p>
</div>
</body>
</html>
The java code that work ! [All comments, read ]
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Date;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
/**
* #param args
* #throws IOException
* #throws InterruptedException
*/
public static void main(String[] args) throws IOException, InterruptedException {
//loading file from project
//When it is exported as JAR the files inside jar are not files but they are stream
InputStream stream = Test.class.getResourceAsStream("/test.xhtml");
//convert stream to file
File xhtmlfile = File.createTempFile("xhtmlFile", ".tmp");
FileOutputStream fileOutputStream = new FileOutputStream(xhtmlfile);
IOUtils.copy(stream, fileOutputStream);
xhtmlfile.deleteOnExit();
//get html string from file
String htmlString = FileUtils.readFileToString(xhtmlfile);
//parse using jsoup
Document doc = Jsoup.parse(htmlString);
//get all elements
Elements allElements = doc.getAllElements();
for (Element el : allElements) {
//if element id is header 1
if (el.id().equals("header1")) {
//dummy emp name
String employeeName = "dummyEmployee";
//update text
el.text("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "
+ employeeName.toString() + "<br/><span class=\"b\">Scheduled Date:</span> " + new Date());
//dont loop further
break;
}
}
//now get html from the updated document
String html = doc.html();
//we need to unscape html
String escapeHtml4 = StringEscapeUtils.unescapeHtml4(html);
//print html
System.out.println(escapeHtml4);
}
}
*output *
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69" />
<p id="header1"><span class="b">Instruction Type:</span> Example<br/><span class="b">Allocated To:</span> dummyEmployee<br/><span class="b">Scheduled Date:</span> Sat Nov 02 07:37:12 GMT 2013</p>
</div>
</body>
</html>

Categories

Resources