How to escape the special char in SAX parsing - java

I am parsing the xml file below:
<description>
<p>
<a href="http://news.yahoo.com/jessica-chastain-talks-princess-diana-biopic- 164102608.html">
<img src="http://l3.yimg.com/bt/api/res/1.2/zD3Iwxezk8JVGQwhow7y4Q--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2011-11-07T171906Z_01_BTRE7A61C3Y00_RTROPTP_2_FILM-US-JESSICACHASTAIN.JPG"
alt="photo"
align="left"
title="Actress Chastain poses for photographers as she arrives on the "Wilde Salome" red carpet at the 68th Venice Film Festival" border="0" />
</a>NEW YORK (TheWrap.com) - Jessica Chastain may not win Oscar gold this year, but it appears she will wear a crown.
</p>
<br clear="all"/>
</description>
I am using SAX parser and trying to get the data inside the img tag, title attribute. But because of the special char "Wild Salome" in the text i am getting ExpatParser exception.
Could you please let me know how this can be solved?

The XML is invalid. Attribute value should not contain quotes ("). The program that generated it should replace the inner " characters with ".
If you print the " to a webpage, the browser will automatically show "-character on its place.

Related

Thymeleaf title attibute with html tags

Actually I'm using summernote https://summernote.org/ plugin to style the text and saved it into database. It gives the string as <b style='color:#CCC'>Test</b>.
In normal text cases i'm using th:utext attribute. But i doesn't make this available for th:title. How to do this in thymeleaf ? Thanks in advance
In first scenario, i want to show it as text, So i used this implementation <span th:utext="${text}"></span> and this is working as expected
In second scenario, i want to show it as title for other tag like
<a th:title="${text}">Some other text </a> this gives title with tag as a string.Not applying styles to title. How can i get these title with text style provided by string
In both cases ${text} is <b style='color:#CCC'>Test</b>. How can i get unescaped text in title attribute.
If you are getting <b style='color:#CCC'>Test</b> string using model (like th:utext="${text}"), try like this:
From server: model.addAttribute("text", "<b style='color:#CCC'>Test</b>");
Html #1: <span th:utext="${text}"></span>
Html #2: <a th:title="${text}">Some other text </a>
I tried on my server and worked.

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Java: enrich xml with elements through regex string parsing

I have a complex task to transform docx document to JATS XML. For now I have grab everything that it is possible from docx with xslt. And the next step is parsing xml file and update it by changing some xml strings (text in nodes) to xml elements. I have not found this information in somewhat similar questions on this forum. My input xml looks like this:
<article dtd-version="3.0" article-type="other">
<body>
<sec>
<title>mySuperTitle</title>
<p>
This is some scientific stuff [1]. Here is more complicated info. This text is even more bizarre [2,3].
</p>
<p>
Einstein formulas [4]. String theory [5,6]. Really don`t know what to write here[7,8].
</p>
</sec>
<sec>
<title>AnotherBoringTitle</title>
<p>
Another one section and obviously here is even more citations [9,10,11]
</p>
</sec>
</body>
</article>
Ideally, I want to replace all [citations], which are simple numbers in [], to xml elements. For example:
<article dtd-version="3.0" article-type="other">
<body>
<sec>
<title>mySuperTitle</title>
<p>
This is some scientific stuff [<xref ref-type="bibr" rid="bib1">1</xref>]. Here is more complicated info. This text is even more bizarre [<xref ref-type="bibr" rid="bib2">2</xref>,<xref ref-type="bibr" rid="bib3">3</xref>].
</p>
<p>
Einstein formulas [<xref ref-type="bibr" rid ="bib4">4</xref>]. String theory [<xref ref-type="bibr" rid ="bib5">5</xref>,<xref ref-type="bibr" rid ="bib6">6</xref>]. Really don`t know what to write here [<xref ref-type="bibr" rid ="bib7">7</xref>,<xref ref-type="bibr" rid ="bib8">8</xref>].
</p>
</sec>
<sec>
<title>AnotherBoringTitle</title>
<p>
Another one section and obviously here is even more citations [<xref ref-type="bibr" rid ="bib9">9</xref>,<xref ref-type="bibr" rid ="bib10">10</xref>,<xref ref-type="bibr" rid ="bib11">11</xref>]
</p>
</sec>
</body>
</article>
I don`t have much experience in Java, but already tried to use DOM, Xpath and regex for this task. The problem is when I parse the document and get the node, I must get it from DOM, transform to string, replace chars in string to number, transform to element and make the output. I find it problematic to transform this string to element (this requires to create new documentBuilder, or how it calls) and replace the proper element in DOM for ouput a new xml.
Is here an easy solution? Or I must write many lines of code here?
This works using DOM and regexex:
I assume you know how to find the right Text node.
You then need to:
//get the split point:
int prevSplitOffset = 0;
Matcher m = Pattern.compile("\\[(\\d+)\\]").matcher(textNode.getData());
while (m.find()) {
// get the text and split it:
Text number = textNode.splitText(m.start(1) - prevSplitOffset);
textNode = number.splitText(m.group(1).length());
// Replace the number with a new DOM node:
Element xref = document.createElement("xref");
xref.setAttribute("rid", "bib" + m.group(1));
xref.setAttribute("ref-type", "bibr");
number.getParentNode().replaceChild(xref, number);
xref.appendChild(number);
prevSplitOffset = m.end(1);
}

Using Java 6 and Jsoup 1.7.3, how can I parse this HTML where sibling text is not inside an element?

Mainly, my question is how can I parse ...
<p>some text<br />
<br />
<strong>categorized: </strong>like this<br />
<br /></p>
... where I am ultimately interested in obtaining key value pairs like "categorized","like this" using Java and Jsoup? I am looking at the <strong> tag to be some kind of a delimiter I can use to indicate the key, then its following text which is inconveniently not enclosed in a tag I need to grab as the value.
I think the challenge for me is the "like this" part is not in an element. It is a sibling node but it is not selectable with CSS, so I can't find it with Jsoup. I am not clear on how the Node and Element relationship works in Jsoup in such a way that I can get both the element text "categorized" and its sibling "like this" in a single call.
In more detail, I do not have control over the HTML structure since I am trying to collect data from many Consumer Product Safety Commission web pages. The pages are formatted in a few different ways, but there is one format in particular that is causing me problems using Java and Jsoup to parse out data.
<div class="archived">
<p style="text-align: center;"><strong><span style="color: #ff0000;">Note: The hotline number and ...</span></strong></p>
<h2 style="text-align: left;">CPSC, Elkay Manufacturing Co. Announces ...</h2>
<p>WASHINGTON, D.C. - The U.S. Consumer Product Safety Commission ...<br />
<br />
<strong>Name of product:</strong> Elkay hot/cold bottled water coolers <br />
<br />
<br />
<strong>Units:</strong> 145,000<br />
<br />
<strong>Description:</strong> These 115 volt hot/cold bottled water coolers ... <br />
<p><img title="Picture of Recalled Water Cooler" src="/PageFiles/73998/04175.jpg" alt="Picture of Recalled Water Cooler" width="110" height="434" /></p>
</div>
That particular section of HTML is shortened, but it originates from http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/
String url = "http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";
Document doc = Jsoup.connect(url).get();
Elements archived = doc.select("div.archived > *");
for(Element ele : archived) {
//what goes here to get those key/value pairs?
}
This isn't a complete answer but it'll get you 95% there.
String url="http://www.cpsc.gov/en/Recalls/2004/CPSC-NETGEAR-Inc-Announce-Recall-of-Wall-Plug-Ethernet-Bridges-/";
Document doc = Jsoup.connect(url).get();
Elements archived = doc.select("div.archived strong");
for (Element element: archived){
System.out.println("KEY: " + element.text());
System.out.println("VALUE: " + element.nextSibling());
}
Output:
KEY: Firm's Hotline: (800) 303-5507
VALUE: <br />
KEY: Name of product:
VALUE: Wall Plug Ethernet Bridge
KEY: Units:
VALUE: About 53,500 units
KEY: Manufacturer:
VALUE: NETGEAR Inc., of Santa Clara, Calif.
KEY: Hazard:
VALUE: The plastic housing on these units can detach, posing a shock hazard.
and so on...
As you can see, it'll require a little bit of work to disregard the unnecessary stuff, like the first element KEY/VALUE pair and whatnot, but otherwise it should work! Good luck.

<logic:iterate question

I need to pull all teh messages from a table and display on the jsp page. I have the code like below:
I have the list of messages stored as :
SimpleStringVO string value: Welcome to the XYZ Tool homepage. ,SimpleStringVO string value: Here you can enter your account number ,SimpleStringVO string value: ,SimpleStringVO string value: thank you
When I tried to display this in jsp page, the message is not formated as it is stored. "thank you" comes immediately after the 2nd string. I need to display the space string and then the "thank you" string. (3rd string has just spaces) My code in Jsp is like this :
<td><logic:iterate name="AllNewsCashe" id="news" type="com.fw.valueobject.SimpleStringVO">
<bean:write name="news" property="stringValue"/>
</logic:iterate>
</td>
how to display thise messages as it is without formating ?
If you need to force a space in HTML, you will need to store it as instead of " ". However, since are you using bean:write tag, you can probably use filter property to retain " ":-
<bean:write name="news" property="stringValue" filter="false" />
Here's the description of filter from Struts documentation:-
If this attribute is set to true, the
rendered property value will be
filtered for characters that are
sensitive in HTML, and any such
characters will be replaced by their
entity equivalents.

Categories

Resources