I have an arbitrary large HTML string with incorrectly escaped attribute values. I would like to get the full HTML string with properly escaped attribute values. I would like to do this in Java.
For example, given this incorrectly escaped HTML tag:
<p name="Chalupa "Batman" McArthur">Chalupa "Batman" McArthur</p>
I want this output:
<p name="Chalupa "Batman" McArthur">Chalupa "Batman" McArthur</p>
StringEscapeUtils.escapeHtml() or replaceAll() replaces all invalid HTML characters like this:
<p name="Chalupa "Batman" McArthur">Chalupa "Batman" McArthur</p>
I want the characters within attribute values escaped properly, but the rest of the HTML left alone so it can properly be processed by a browser. Is there a java library that can handle this issue?
Related
I have html file that have tags for binary data like:
<HTML>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">
<TEXT>
begin 644 image_002.jpg
M_]C_X 02D9)1# ! 0 0 ! #_VP!# #&!#<&!0#'!P<)"0#*#!0-# L+
M#!D2$P\4'1H?'AT:'!P#)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3#R/"XS-#+_
MVP!# 0D)"0P+#!#-#1#R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R
,Z4]1]: %HHHIB/_9
end
</TEXT>
<TEXT>losses occurring in the third quarter and from weather </TEXT>
</BODY>
</HTML>
so I am trying to remove all "TEXT" tags those have binary data using Java Regex. I tried Jsoup library But it only remove html tags. I saw the same question here. But it is not using Java Regex.
Is any standard way to remove this binary data from html file?
It is well know that you shouldn't use a regex to handle xhtml.
I would use jsoup to remove the whole tag and later add it empty.
But if you want to use a regex, then you can use a regex like this:
"your html here".replaceAll("(?s)<TEXT>.*?<\\/TEXT>", "<TEXT></TEXT>")
Working demo
val regex = """<TEXT>\s*begin \d+ (?>[^e]+|e(?!nd\s*<\/TEXT>))*end\s*<\/TEXT>"""
Full example available here
I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.
In a Java application I have HTML, as a String, that looks like this:
<DIV STYLE="font-family:"Times New Roman"">
And I wish to decode the encoded quotes so that it is correctly displayed on the page. The problem is that conventional StringEscapeUtils escape methods will decode each quote as a double quote, resulting in HTML like this:
<DIV STYLE="font-family:"Times New Roman"">
Which will not correctly render on the page. The desired result is for the HTML to look like this:
<DIV STYLE='font-family:"Times New Roman"'>
I can algorithmically examine the string to replace the encoded quotes to what I want but is there a dedicated method to correctly decode quotes for such a String?
If it is defined in your java code
you may try to add \ before "
I assume you are expecting something like this right?
String randomHtmlCode = " <DIV STYLE='font-family:\"Times New Roman\"'> ";
I'm using Thymeleaf to process html templates, I understood how to append inline strings from my controller, but now I want to append a fragment of HTML code into the page.
For example, lets stay that I have this in my Java application:
String n="<span><i class=\"icon-leaf\"></i>"+str+"</span> \n";
final WebContext ctx = new WebContext(request, response,
servletContext, request.getLocale());
ctx.setVariable("n", n);
What do I need to write in the HTML page so that it would be replaced by the value of the n variable and be processed as HTML code instead of it being encoded as text?
You can use th:utext attribute that stands for unescaped text (see documentation). Use this with caution and avoid user input in th:utext as it can cause security problems.
<div th:remove="tag" th:utext="${n}"></div>
If you want short-hand syntax you can use following:
[(${variable})]
Escaped short-hand syntax is
[[${variable}]]
but if you change inner square brackets [ with regular ( ones HTML is not escaped.
Example within tags:
<div>
[(${variable})]
</div>
Staring with Thymeleaf 3.0 the html friendly tag would be:
<div class="mailbox-read-message" data-th-utext="*{body}">
We are using Jsoup to parse, manipulate and extend a html template. So far everything works fine until it comes to single quotes used in combination with HTML attributes
<span data-attr='JSON'></span>
That HTML snippet is converted to
<span data-attr="JSON"></span>
which will conflict with the inner json data which is specified as valid with double quotes only
{"param" : "value"} //valid
{'param' : 'value'} //invalid
so we need to force Jsoup to NOT change those single quotes to double quotes, but how? Currently that is our code to parse and produce html content.
pageTemplate = Jsoup.parse(new File(mainTemplateFilePath), "UTF-8");
pageTemplate.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
pageTemplate.outputSettings().charset("UTF-8");
... adding some html
pageTemplate.html(); // will output the double quoted attributes :(
You need to HTML encode the JSON value before putting it into the data-attr attribute. When you do so, you should end up with this:
<span data-attr="{"param":"value"}"></span>
Although that looks fairly daunting, it is actually valid HTML. When your corresponding JavaScript executes someSpan.getAttribute("data-attr"), the " values will be transformed into " values automatically, giving you access to the original valid JSON string.