Removing HTML entities while preserving line breaks with JSoup

Removing HTML entities while preserving line breaks with JSoup - java

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.
I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such:
Glóandi augu, silfurnátt
<br />Blóð alvöru, starir á
<br />Óður hundur er í vígamóð, í maga... mér
<br />
<br />Kolniður gref, kvik sem dreg hér
<br />Kolniður svart, hvergi bjart né
But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.
However, if I use Node.text(), I can get a better looking result, free of tags and entities:
Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,
Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.
Simply replacing <br /> from the node before calling Node.text() yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.
Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

(disclaimer) I haven't used this API ...
but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered.
The TextNode.getWholeText() call also looks useful.

based on another answer from stackoverflow I added a few fixes and came with
String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();
Hope this helps

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?

This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101

You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >

You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>

Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Data from database not same in java string

Here is my data when i view in SQL Developer tool
introduction
topic 1
topic end
and after i read it using a ResultSet,
ResultSet result = stmt.executeQuery();
result.getString("description")
and display in JSP page as
<bean:write name="data" property="description" />
but it will display like this
introduction topic 1 topic end
how can i keep the display same as in the SQL Developer?

Newlines aren't preserved in HTML. You need to either tell the browser it's preformatted:
<pre>
<bean:write name="data" property="description"/>
</pre>
Or replace the newlines with HTML line breaks. See this question for examples.

how can i keep the display same as in the SQL Developer?
The data presumably contains line breaks, e.g. "\r\n" or "\n". If you look at the source of your JSP, you'll probably see them there. However, HTML doesn't treat those as line breaks for display purposes - you'll need to either use the <br /> tag, or put each line in a separate paragraph, or something similar.
Basically, I don't think this is a database problem at all - I think it's an HTML problem. You can experiment with a static HTML file which you edit locally and display in your browser. Once you know the HTML you want to generate, then work on integrating it into your JSP.

Handle line breaks from backing bean in JavaScript

I'm trying to use a string from my backing bean which may contain line breaks as a parameter for my JavaScript method:
Snippet from xhtml:
<a4j:commandLink id="showEntry"
immediate="true"
styleClass="smallSpaceLeft"
action="#{bean1.method()}"
onclick="jsMethod('#{entry.text}')"
value="#{messages['general.click']}" />
Everything works fine, except the string contains any line breaks.
E.g.: #{entry.text} = "First line.\nSecond line."
The html-output looks like:
<a class="smallSpaceLeft" href="#" id="j_id279:0:showEntry"
name="j_id279:0:showEntry" onclick="jsMethod('First line.
Second line.');A4J.AJAX.Submit('j_id272',event,
{'similarityGroupingId':'j_id279:0:showEntry','parameters':
{'j_id279:0:showEntry':'j_id279:0:showEntry'} } );return false;">Click me</a>
So the JavaScript is broken as a line break ends a command. How can I avoid this?

You cannot handle it in javascript, you must replace the linebreaks before you print the code.
In strings you may prepend a backslash before the linebreak. But as there may be more questionable characters I would prefer to URL-encode the string and then decode it in javascript by using decodeURIComponent() .

Using Both Tagged And Untagged Data With XPath

I'm trying to parse some HTML using XPath in Java. Consider this HTML:
<td class="postbody">
<img src="...""><br />
<br />
<b>What is Blah?</b><br />
<br />
Blah blah blah
<br />
Note that "What Is Blah" is helpfully contained within a b tag and is therefore easily parseable. But "Blah blah blah" is out in the open, and so I can only pick it up by calling text() on its parent node.
Thing is, I need to go through this in sequence, putting the img down, then the bolded text, then the body text. It's important it ends up in order (it needn't be processed in order, if you can suggest a way that takes two passes).
So are there any suggestions for how, if I've got the above contained within a Java XPath node, I can go through it in turn and get what I need?

I think an SAX based parser would be a better tool for this problem. It's event based so you can parse your XML document in order.
But it's an XML parser so you'll need to have a valid XML document. I never used JTidy but it's a java port of the HTML Tidy, so hopefully it can help you to transform your (invalid) HTML documents to a valid XML.

Use this XPath expression evaluated with the parent of the provided XML fragment as the context node:
node()
This selects every node - child of the context node -- every element -child, every text-node-child, every comment-child and every PI (processing instruction) - child.
In case you want to exclude comments and PIs, use:
node()[not(self::comment() or self::processing-instruction)]
In case that in addition to this you don't want to select the whitespace-only-text-nodes, use:
node()
[not(self::comment() or self::processing-instruction)]
[not(self::text()[string-length() = 0])]

Using PrintWriter, I am getting Chinese junk characters in browser

I am using PrintWriter as follows to get the output in the browser:
PrintWriter pw = response.getwriter();
StringBuffer sb = getTextFromDatabase();
pw.print(sb);
However, this prints the following Chinese junk characters:
格㸳潃浭湥獴⼼㍨‾琼扡敬㰾牴戠捧汯牯✽䔣䔷䔷❆㰾摴倾獯整⁤湏›〱㈭ⴷ〲〱ㄠ㨴㌰㔺਱‬祂›教桳慷瑮丠祡歡⠊湹祡歡捀獩潣挮浯਩硅散汬湥㱴琯㹤⼼牴㰾牴戠捧汯牯✽䔣䔷䔷❆㰾摴㰾琯㹤⼼牴㰾牴戠捧汯牯✽䔣䔷䔷❆㰾摴倾獯整⁤湏›〱㈭ⴷ〲〱ㄠ㨴㐰ㄺ਱‬祂›教桳慷瑮丠祡歡⠊湹祡歡捀獩潣挮浯਩敶祲朠潯㱤琯㹤⼼牴㰾牴戠捧汯牯✽䔣䔷䔷❆㰾摴㰾琯㹤⼼牴㰾牴戠捧汯牯✽䔣䔷䔷❆㰾摴倾獯整⁤湏›〱㈭ⴷ〲〱ㄠ㨴㜱㌺ਸ਼‬祂›教桳慷瑮丠祡歡⠊湹祡歡捀獩潣挮浯਩桔獩椠⁳潴琠獥㱴琯㹤⼼牴㰾琯扡敬㰾牢⼠‾格㸳潐瑳夠畯⁲潃浭湥㱴栯㸳㰠潦浲愠瑣潩㵮䌢浯敭瑮即牥汶瑥•敭桴摯∽敧≴渠浡㵥挢浯敭瑮潆浲•湯畳浢瑩∽爠瑥牵⁮慖楬慤整潆浲⤨∻‾琼扡敬†眠摩桴∽〳∰栠楥桧㵴㌢〰㸢ठ琼㹲琼㹤氼扡汥映牯∽慮敭㸢潃浭湥㩴猼慰⁮汣獡㵳洢湡呤汃獡≳⨾⼼灳湡㰾氯扡汥㰾牢㸯琼硥慴敲⁡慮敭∽潣瑮湥≴椠㵤挢浯敭瑮硔䅴敲≡挠慬獳∽整瑸牡慥氠牡敧•潣獬∽㠲•潲獷∽∶㸠⼼整瑸牡慥㰾琯㹤⼼牴㰾牴㰾摴㰾慬敢⁬潦㵲渢浡≥举浡㩥猼慰⁮汣獡㵳洢湡呤汃獡≳⨾⼼灳湡㰾氯扡汥㰾牢㸯椼灮瑵椠㵤渢浡≥琠灹㵥琢硥≴渠浡㵥渢浡≥挠慬獳∽慮敭•慶畬㵥∢洠硡敬杮桴∽㔲∵†楳敺∽㘳⼢㰾琯㹤⼼牴㰾牴㰾摴㰾慬敢⁬潦㵲攢慭汩㸢ⵅ慍汩㰺灳湡挠慬獳∽慭摮䍔慬獳㸢㰪猯慰㹮⼼慬敢㹬戼⽲㰾湩異⁴摩∽浥楡≬琠灹㵥琢硥≴渠浡㵥攢慭汩•汣獡㵳攢慭汩•慶畬㵥∢洠硡敬杮桴∽㔲∵†楳敺∽㘳⼢㰾琯㹤⼼牴㰾牴㰾摴㰾湩異⁴琠灹㵥猢扵業≴†慮敭∽潰瑳•慶畬㵥倢獯≴㸯⼼摴㰾琯㹲⼼慴汢㹥⼼潦浲
I tried to use String instead of StringBuffer, but that didn't help. I also tried to set the content type header as follows
response.setContentType("text/html;charset=UTF-8");
before getting the response writer, but that did also not help.
In the DB there are no issues with the data as I have used the same data for 2 different purposes. In one I get correct output, but in other I get the above junk. I have used the above code in JSP using scriptlets. I have also given content type for the JSP.

Getting Chinese characters as Mojibake indicates that you're incorrectly showing UTF-16LE data as UTF-8. UTF16-LE stores each character in 4 bytes. In UTF-8, the 4-byte panels contains usually CJK (Chinese/Japanese/Korean) characters.
To fix this, you need to either show the data as UTF-16LE or to have stored the data in the DB as UTF-8 from the beginning on. Since you're attempting to display them as UTF-8, I think that your DB has to be reconfigured/converted to use UTF-8 instead of UTF-16LE.
Unrelated to the concrete problem, storing HTML (that was what those characters originally represent) in a database is really a bad idea ;) This was the original content:
<h3>Comments</h3> <table><tr bgcolor='#E7E7EF'><td>Posted On: 10-27-2010 14:03:51
, By: Yeshwant Nayak
(ynayak#cisco.com)
Excellent</td></tr><tr bgcolor='#E7E7EF'><td></td></tr><tr bgcolor='#E7E7EF'><td>Posted On: 10-27-2010 14:04:11
, By: Yeshwant Nayak
(ynayak#cisco.com)
very good</td></tr><tr bgcolor='#E7E7EF'><td></td></tr><tr bgcolor='#E7E7EF'><td>Posted On: 10-27-2010 14:17:36
, By: Yeshwant Nayak
(ynayak#cisco.com)
This is to test</td></tr></table><br /> <h3>Post Your Comment</h3> <form action="CommentsServlet" method="get" name="commentForm" onsubmit=" return ValidateForm();"> <table width="300" height="300"> <tr><td><label for="name">Comment:<span class="mandTClass">*</span></label><br/><textarea name="content" id="commentTxtArea" class="textarea large" cols="28" rows="6" ></textarea></td></tr><tr><td><label for="name">Name:<span class="mandTClass">*</span></label><br/><input id="name" type="text" name="name" class="name" value="" maxlength="255" size="36"/></td></tr><tr><td><label for="email">E-Mail:<span class="mandTClass">*</span></label><br/><input id="email" type="text" name="email" class="email" value="" maxlength="255" size="36"/></td></tr><tr><td><input type="submit" name="post" value="Post"/></td></tr></table></form
Here's how you can turn this incorrectly encoded Chinese back to normal characters:
String incorrect = "格㸳潃浭湥獴⼼㍨‾琼扡敬㰾牴戠捧汯";
String original = new String(incorrect.getBytes("UTF-16LE"), "UTF-8");
Note that this should not be used as solution! It was just posted as an evidence of the root cause of the problem.

Clearly, you have some kind of encoding problem here, but my guess is it is on the server or database side, not in the browser.
In the DB there are no issues with the data as i have used the same data for 2 different options,but in one i get correct output n in other junk.
I don't find that argument convincing. In fact, I think you may be overlooking the real cause of the problem.
What I think you need to do is add some server-side logging to capture what is actually in that StringBuffer that you are sending to the PrintWriter
Also, look at what is different about the way that the server side handles the "2 different options". (What do you mean by that phrase?).
Finally, please provide some REAL code, not just 3 line snippets that won't compile.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Removing HTML entities while preserving line breaks with JSoup - java

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered. The TextNode.getWholeText() call also looks useful.

based on another answer from stackoverflow I added a few fixes and came with String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text(); text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim(); Hope this helps

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

Data from database not same in java string

Handle line breaks from backing bean in JavaScript

Using Both Tagged And Untagged Data With XPath

Using PrintWriter, I am getting Chinese junk characters in browser

Categories

Resources