Remove binary data from html file using Java Regex - java

I have html file that have tags for binary data like:
<HTML>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">
<TEXT>
begin 644 image_002.jpg
M_]C_X 02D9)1# ! 0 0 ! #_VP!# #&!#<&!0#'!P<)"0#*#!0-# L+
M#!D2$P\4'1H?'AT:'!P#)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3#R/"XS-#+_
MVP!# 0D)"0P+#!#-#1#R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R
,Z4]1]: %HHHIB/_9
end
</TEXT>
<TEXT>losses occurring in the third quarter and from weather </TEXT>
</BODY>
</HTML>
so I am trying to remove all "TEXT" tags those have binary data using Java Regex. I tried Jsoup library But it only remove html tags. I saw the same question here. But it is not using Java Regex.
Is any standard way to remove this binary data from html file?

It is well know that you shouldn't use a regex to handle xhtml.
I would use jsoup to remove the whole tag and later add it empty.
But if you want to use a regex, then you can use a regex like this:
"your html here".replaceAll("(?s)<TEXT>.*?<\\/TEXT>", "<TEXT></TEXT>")
Working demo

val regex = """<TEXT>\s*begin \d+ (?>[^e]+|e(?!nd\s*<\/TEXT>))*end\s*<\/TEXT>"""
Full example available here

Related

Extract some data using Regex

I'm struggling some time to extract JSON data from one html tag. To be more specific it's a script tag and using JSOUP library I can get data between script tags. But inside there is some JSON data which I can't extract. Here is the tag:
<script type="text/javascript">jwplayer.key="WbtWzGvcRNi6Tk+gtKldIbx+nn6lXZFvKiaO2g==";jwplayer("tvplayer").setup({playlist:[{image: "http://img.canlitvlive.io/yayin/trt1_480.jpg?1509735585",title:"TRT 1 Canlı Yayın - CanliTVLive.io",file : "http://yayin.canlitvlive.io/trt1/live.m3u8?tkn=8JD95lXv9dOUXwtgOTBYfw&tms=1509749985"}],...</script>
I need url from file tag which is inside jwplayer. I tried using regular expression for example I tried somethig like this:
"playlist[\":\\s\\{]+file[\":\\s\\{]+\"([^\"]+)\""
But I don't have much experience with regex and can't figure out right pattern. Can someone help with this? Thanks
I'm guessing you just need some whitespace
file\s*:\s*"(.*?)"
https://regex101.com/r/4HldaP/3

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Process Thymeleaf variable as HTML code and not text

I'm using Thymeleaf to process html templates, I understood how to append inline strings from my controller, but now I want to append a fragment of HTML code into the page.
For example, lets stay that I have this in my Java application:
String n="<span><i class=\"icon-leaf\"></i>"+str+"</span> \n";
final WebContext ctx = new WebContext(request, response,
servletContext, request.getLocale());
ctx.setVariable("n", n);
What do I need to write in the HTML page so that it would be replaced by the value of the n variable and be processed as HTML code instead of it being encoded as text?
You can use th:utext attribute that stands for unescaped text (see documentation). Use this with caution and avoid user input in th:utext as it can cause security problems.
<div th:remove="tag" th:utext="${n}"></div>
If you want short-hand syntax you can use following:
[(${variable})]
Escaped short-hand syntax is
[[${variable}]]
but if you change inner square brackets [ with regular ( ones HTML is not escaped.
Example within tags:
<div>
[(${variable})]
</div>
Staring with Thymeleaf 3.0 the html friendly tag would be:
<div class="mailbox-read-message" data-th-utext="*{body}">

Test if position/character is inside a HTML tag

What is the easiest way to find out if a position is in a HTML tag in a string containing html formatted text?
Example:
This could be my text:
This is a text and this is also <b>part</b> of the <b /> text.
Given the position x, how can I test if I am currently in a HTML tag or not? I suppose I'll have to test if I am in one of these situations (* is my position):
- < * > ... </>
- <...> * </>
- < * />
But what is an efficient approach to handle this?
You have some answers about it in this link:
Java HTML Parsing
Basically, use some library to do the html parsing. I personally used JSoup some months ago and it worked perfectly.
Next time search first ;)

How to keep the HTML tags specified

I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...> (with </a>) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex
You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
Rubular
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
What HTML parsing libraries do you recommend in Java
Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.
Hey! Here is your answer:
You can’t parse [X]HTML with regex.
Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.
I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Categories

Resources