Remove special characters java

Remove special characters java - java

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Their result is
Breaking \u003cb\u003eNews\u003c/b\u003e Updates
How can we remove these characters?
I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results in Java.

I pull those routinely with
String.replaceAll("\\p{Cntrl}","")

You can use the below regex..
String str = "Breaking \u003cb\u003eNews\u003c/b\u003e Updates";
str = str.replaceAll("\\<(.*)?\\>(.*)\\</\\1\\>", "$2");
System.out.println(str);
OUTPUT: -
Breaking News Updates
\\<(.*)?\\> matches the first opening tag - <b>
\\</\\1\\> matches the corresponding closing tag - </b>
\\1 is used to backreference what was the tag, so that correct pair of tags are matched..
So, <b>news <update></b> -> In this case <update> will not be removed..

The best solution would be to use JSON to convert the data.
JSON.parse(JSON.stringify({a : '<put your string here>'}));
It will be proper as the data you will get from Google API will be in the form of JSON.

This is HTML. \u003cb\u003e translates to <b>.
You'll want to use an HTML parser as HTML is not fully parse-able by a regular expression.
With a library like Jsoup you could do this as.
String data = Jsoup.parse(html).body().text();
This will get you "Breaking News Updates".

Related

Extract some data using Regex

I'm struggling some time to extract JSON data from one html tag. To be more specific it's a script tag and using JSOUP library I can get data between script tags. But inside there is some JSON data which I can't extract. Here is the tag:
<script type="text/javascript">jwplayer.key="WbtWzGvcRNi6Tk+gtKldIbx+nn6lXZFvKiaO2g==";jwplayer("tvplayer").setup({playlist:[{image: "http://img.canlitvlive.io/yayin/trt1_480.jpg?1509735585",title:"TRT 1 Canlı Yayın - CanliTVLive.io",file : "http://yayin.canlitvlive.io/trt1/live.m3u8?tkn=8JD95lXv9dOUXwtgOTBYfw&tms=1509749985"}],...</script>
I need url from file tag which is inside jwplayer. I tried using regular expression for example I tried somethig like this:
"playlist[\":\\s\\{]+file[\":\\s\\{]+\"([^\"]+)\""
But I don't have much experience with regex and can't figure out right pattern. Can someone help with this? Thanks

I'm guessing you just need some whitespace
file\s*:\s*"(.*?)"
https://regex101.com/r/4HldaP/3

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?

This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101

You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >

You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>

Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

remove whitespace long at first json data in pre tag html

I use framework struts and i want display json string from variable java
String test = "{\n \"fileName\": \"\",\n \"fileUrl\": \"\",\n \"accountId\": ,\n \"totalRow\": \n}";
Display on browser :
And view source chrome browser :
How can i remove whitespace at first and last pre tag to display pretty json data

I found the problem. Because distance of pre tag and s:property tag of struts framework :
Error :
Solve : move s:property to near pre tag
<pre><s:property value="test" /></pre>
It's display pretty json data like :

I know this isn't a proper answer but I can give you a hint that You can use Regular Expressions for solving your problem. I'll surely help you.

Thymeleaf string substitution and escaping

I have a string which contains raw data, which I want escaped. The string also contains markers which I want to replace with span tags.
For example my string is
"blah {0}something to span{1} < random chars <"
I would like the above to be rendered within a div, and replace {0} with and {1} with
I have tried a number of things, including doing the substitution in my controller, and trying to use the th:utext attribute, however I then get SAX exceptions.
Any ideas?

You can do this using i18n ?
something like:
resource.properties:
string.pattern=my name is {0} {1}
thymeleaf view:
<label th:text="#{__${#string.pattern('john', 'doe')}__}"></label>
The result should be:
my name is john doe
Im not sure this is a good way. But I hope it could help you

It looks using message parameters is the right approach to output formatted strings. See http://www.thymeleaf.org/doc/usingthymeleaf.html#messages
I suspect you need to pass character entity reference in order to avoid SAX exceptions
<span th:utext = "#{string.pattern(${'<span>john</span>'}, ${'<span>doe</span>'})}"/>
Alternatively place the markup in your .properties file:
string.pattern=my name is <span>{0}</span> <span>{1}</span>

How to keep the HTML tags specified

I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...> (with </a>) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex

You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
Rubular
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
What HTML parsing libraries do you recommend in Java

Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.

Hey! Here is your answer:
You can’t parse [X]HTML with regex.

Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.

I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove special characters java - java

I pull those routinely with String.replaceAll("\\p{Cntrl}","")

The best solution would be to use JSON to convert the data. JSON.parse(JSON.stringify({a : '<put your string here>'})); It will be proper as the data you will get from Google API will be in the form of JSON.

This is HTML. \u003cb\u003e translates to <b>. You'll want to use an HTML parser as HTML is not fully parse-able by a regular expression. With a library like Jsoup you could do this as. String data = Jsoup.parse(html).body().text(); This will get you "Breaking News Updates".

Related

Extract some data using Regex

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

remove whitespace long at first json data in pre tag html

Thymeleaf string substitution and escaping

How to keep the HTML tags specified

Categories

Resources