Want to replace special characters with equivalent UTF-8 symbols

Want to replace special characters with equivalent UTF-8 symbols - java

As part of my application I have written a custom method to extract data from the DB and return it as a string. My string has special characters like the pound sign, which when extracted looks like this:
"MyMobile Blue £54.99 [12 month term]"
I want the £ to be replaced with actual pound symbol. Below is my method:
public String getOfferName(String offerId) {
log(Level.DEBUG, "Entered getSupOfferName");
OfferClient client = (OfferClient) ApplicationContext
.get(OfferClient.class);
OfferObject offerElement = getOfferElement(client, offerId);
if (offerElement == null) {
return "";
} else {
return offerElement.getDisplayValue();
}
}
Can some one help on this?

The document contains XML/HTML entities .
You can use the StringEscapeUtils.unescapeXml() method from commons-lang to parse these back to their unicode equivalents.
If this is HTML rather than XML use the other methods as there are differences in the two sets of entities.

I voted for StringEscapeUtils.unescapeXml() solution. Anyway, here's is a custom solution
String s = "MyMobile Blue £54.99 [12 month term]";
Pattern p = Pattern.compile("&#(\\d+?);");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while(m.find()) {
int c = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "" + (char)c);
}
m.appendTail(sb);
System.out.println(sb);
output
MyMobile Blue £54.99 [12 month term]
note that it does not accept hex entity reference

Related

Java %u20AC conversion to euro €

how can I convert a string like:
URLDecoder.decode("promo desc %u20AC", "UTF-16");
into "promo desc €" ?
In fact the method above doesn't work because % indicates a hex string whilst u20AC is not a valid hex string.
The string to decode is generated by a Javascript like this:
var string = escape("{€ć") ---> "%7B%u20AC%u0107"
I didn't want to use URLDecoder because, semantically, it's not a URL I'm trying to decode but a very long text. In java % indicates a hex string and %u is illegal. I think that converting % to \ is a bit naive, there may be sequences of % in the text.
What I am after is this function here:
unescape("%7B%u20AC%u0107")
that exists in Javascript but not in Java to my knowledge. How can I achieve this in Java?
Thanks

I was curious, because I've not seen the %u escapes before, but it turns out unescaping them is fairly easy:
private static final Pattern JAVASCRIPT_ESCAPE_SEQUENCE= Pattern.compile("%(u[0-9a-fA-F]{4}|[0-9a-fA-F]{2})");
/**
* Unescape a JavaScript-escaped string.
* Undoes the effect of calling the <a href="https://developer.mozilla.org/de/docs/Web/JavaScript/Reference/Global_Objects/escape">
* the JavaScript escape method</a>.
*/
static String unescape(String input) {
Matcher matcher = JAVASCRIPT_ESCAPE_SEQUENCE.matcher(input);
StringBuilder sb = new StringBuilder(input.length());
while(matcher.find()) {
String escapeSequence = matcher.group(1);
if (escapeSequence.startsWith("u")) {
escapeSequence = escapeSequence.substring(1);
}
char c = (char) Integer.parseInt(escapeSequence, 16);
matcher.appendReplacement(sb, Character.toString(c));
}
matcher.appendTail(sb);
return sb.toString();
}
Given this method unescape("%7B%u20AC%u0107") produces the desired output {€ć.

Remove elements from Date Format String using a Regular Expression

I want to remove elements a supplied Date Format String - for example convert the format "dd/MM/yyyy" to "MM/yyyy" by removing any non-M/y element.
What I'm trying to do is create a localised month/year format based on the existing day/month/year format provided for the Locale.
I've done this using regular expressions, but the solution seems longer than I'd expect.
An example is below:
public static void main(final String[] args) {
System.out.println(filterDateFormat("dd/MM/yyyy HH:mm:ss", 'M', 'y'));
System.out.println(filterDateFormat("MM/yyyy/dd", 'M', 'y'));
System.out.println(filterDateFormat("yyyy-MMM-dd", 'M', 'y'));
}
/**
* Removes {#code charsToRetain} from {#code format}, including any redundant
* separators.
*/
private static String filterDateFormat(final String format, final char...charsToRetain) {
// Match e.g. "ddd-"
final Pattern pattern = Pattern.compile("[" + new String(charsToRetain) + "]+\\p{Punct}?");
final Matcher matcher = pattern.matcher(format);
final StringBuilder builder = new StringBuilder();
while (matcher.find()) {
// Append each match
builder.append(matcher.group());
}
// If the last match is "mmm-", remove the trailing punctuation symbol
return builder.toString().replaceFirst("\\p{Punct}$", "");
}

Let's try a solution for the following date format strings:
String[] formatStrings = { "dd/MM/yyyy HH:mm:ss",
"MM/yyyy/dd",
"yyyy-MMM-dd",
"MM/yy - yy/dd",
"yyabbadabbadooMM" };
The following will analyze strings for a match, then print the first group of the match.
Pattern p = Pattern.compile(REGEX);
for(String formatStr : formatStrings) {
Matcher m = p.matcher(formatStr);
if(m.matches()) {
System.out.println(m.group(1));
}
else {
System.out.println("Didn't match!");
}
}
Now, there are two separate regular expressions I've tried. First:
final String REGEX = "(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
Didn't match!
Didn't match!
Second:
final String REGEX = "(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
MM/yy - yy
Didn't match!
Now, let's see what the first regex actually matches to:
(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*) First regex =
(?:[^My]*) Any amount of non-Ms and non-ys (non-capturing)
([My]+ followed by one or more Ms and ys
[^\\w]* optionally separated by non-word characters
(implying they are also not Ms or ys)
[My]+) followed by one or more Ms and ys
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
What this means is that at least 2 M/ys are required to match the regex, although you should be careful that something like MM-dd or yy-DD will match as well, because they have two M-or-y regions 1 character long. You can avoid getting into trouble here by just keeping a sanity check on your date format string, such as:
if(formatStr.contains('y') && formatStr.contains('M') && m.matches())
{
String yMString = m.group(1);
... // other logic
}
As for the second regex, here's what it means:
(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*) Second regex =
(?:[^My]*) Any amount of non-Ms and non-ys
(non-capturing)
( ) followed by
(?:[My]+ )+[My]+ at least two text segments consisting of
one or more Ms or ys, where each segment is
[^\\w]* optionally separated by non-word characters
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
This regex will match a slightly broader series of strings, but it still requires that any separations between Ms and ys be non-words ([^a-zA-Z_0-9]). Additionally, keep in mind that this regex will still match "yy", "MM", or similar strings like "yyy", "yyyy"..., so it would be useful to have a sanity check as described for the previous regular expression.
Additionally, here's a quick example of how one might use the above to manipulate a single date format string:
LocalDateTime date = LocalDateTime.now();
String dateFormatString = "dd/MM/yyyy H:m:s";
System.out.println("Old Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
Pattern p = Pattern.compile("(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)");
Matcher m = p.matcher(dateFormatString);
if(dateFormatString.contains("y") && dateFormatString.contains("M") && m.matches())
{
dateFormatString = m.group(1);
System.out.println("New Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
}
else
{
throw new IllegalArgumentException("Couldn't shorten date format string!");
}
Output:
Old Format: "dd/MM/yyyy H:m:s" = 14/08/2019 16:55:45
New Format: "MM/yyyy" = 08/2019

I'll try to answer with the understanding of my question : how do I remove from a list/table/array of String, elements that does not exactly follow the patern 'dd/MM'.
so I'm looking for a function that looks like
public List<String> removeUnWantedDateFormat(List<String> input)
We can expect, from my knowledge on Dateformat, only 4 possibilities that you would want, hoping i dont miss any, which are "MM/yyyy", "MMM/yyyy", "MM/yy", "MM/yyyy". So that we know what we are looking for we can do an easy function.
public List<String> removeUnWantedDateFormat(List<String> input) {
String s1 = "MM/yyyy";
string s2 = "MMM/yyyy";
String s3 = "MM/yy";
string s4 = "MMM/yy";
for (String format:input) {
if (!s1.equals(format) && s2.equals(format) && s3.equals(format) && s4.equals(format))
input.remove(format);
}
return input;
}
Better not to use regex if you can, it costs a lot of resources. And great improvement would be to use an enum of the date format you accept, like this you have better control over it, and even replace them.
Hope this will help, cheers
edit: after i saw the comment, i think it would be better to use contains instead of equals, should work like a charm and instead of remove,
input = string expected.
so it would looks more like:
public List<String> removeUnWantedDateFormat(List<String> input) {
List<String> comparaisons = new ArrayList<>();
comparaison.add("MMM/yyyy");
comparaison.add("MMM/yy");
comparaison.add("MM/yyyy");
comparaison.add("MM/yy");
for (String format:input) {
for(String comparaison: comparaisons)
if (format.contains(comparaison)) {
format = comparaison;
break;
}
}
return input;
}

Parse string value from URL

I have a string (which is an URL) in this pattern https://xxx.kflslfsk.com/kjjfkskfjksf/v1/files/media/93939393hhs8.jpeg
now I want to clip it to this
media/93939393hhs8.jpeg
I want to remove all the characters before the second last slash /.
i'm a newbie in java but in swift (iOS) this is how we do this:
if let url = NSURL(string:"https://xxx.kflslfsk.com/kjjfkskfjksf/v1/files/media/93939393hhs8.jpeg"), pathComponents = url.pathComponents {
let trimmedString = pathComponents.suffix(2).joinWithSeparator("/")
print(trimmedString) // "output = media/93939393hhs8.jpeg"
}
Basically, I'm removing everything from this Url expect of last 2 item and then.
I'm joining those 2 items using /.

String ret = url.substring(url.indexof("media"),url.indexof("jpg"))

Are you familiar with Regex? Try to use this Regex (explained in the link) that captures the last 2 items separated with /:
.*?\/([^\/]+?\/[^\/]+?$)
Here is the example in Java (don't forget the escaping with \\:
Pattern p = Pattern.compile("^.*?\\/([^\\/]+?\\/[^\\/]+?$)");
Matcher m = p.matcher(string);
if (m.find()) {
System.out.println(m.group(1));
}
Alternatively there is the split(..) function, however I recommend you the way above. (Finally concatenate separated strings correctly with StringBuilder).
String part[] = string.split("/");
int l = part.length;
StringBuilder sb = new StringBuilder();
String result = sb.append(part[l-2]).append("/").append(part[l-1]).toString();
Both giving the same result: media/93939393hhs8.jpeg

string result=url.substring(url.substring(0,url.lastIndexOf('/')).lastIndexOf('/'));
or
Use Split and add last 2 items
string[] arr=url.split("/");
string result= arr[arr.length-2]+"/"+arr[arr.length-1]

public static String parseUrl(String str) {
return (str.lastIndexOf("/") > 0) ? str.substring(1+(str.substring(0,str.lastIndexOf("/")).lastIndexOf("/"))) : str;
}

Encode only specific characters in String

I have to encode only some special characters in a string to numeric value.
Say,
String name = "test $#";
I want to encode only characters $ and # in the above string. I tried using below code but it did not work out.
String encode = URLEncoder.encode(StringEscapeUtils.escapeJava(name), "UTF-8");
The encoded value will be like, for white space the encoded value is &#160

What about to split that String (by string#split method - with space as regex), from Array, which it returns you can use last item and you will get there symbols, what you need :)
String name = "test $#";
String nameSplittedArr = name.split(" ");
String yourChars = nameSplittedArr[nameSplittedArr.length-1]; //indexes from zero
That should works :)

As per the comments, I think you are after a customized encoding function. Something like:
public static String EncodeString(String text) {
StringBuffer sb = new StringBuffer();
for (char c : text.toCharArray()) {
if (Character.isLetterOrDigit(c)) {
sb.append(c);
} else {
sb.append("&#" + (int)c + ";");
}
}
return sb.toString();
}
An example of this is here.

Is there a Java function which parses escaped characters?

I'm looking for a built-in Java functions which for example can convert "\\n" into "\n".
Something like this:
assert parseFunc("\\n") = "\n"
Or do I have to manually search-and-replace all the escaped characters?

You can use StringEscapeUtils.unescapeJava(s) from Apache Commons Lang. It works for all escape sequences, including Unicode characters (i.e. \u1234).
https://commons.apache.org/lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#unescapeJava-java.lang.String-

Anthony is 99% right -- since backslash is also a reserved character in regular expressions, it needs to be escaped a second time:
result = myString.replaceAll("\\\\n", "\n");

Just use the strings own replaceAll method.
result = myString.replaceAll("\\n", "\n");
However if you want match all escape sequences then you could use a Matcher. See http://www.regular-expressions.info/java.html for a very basic example of using Matcher.
Pattern p = Pattern.compile("\\(.)");
Matcher m = p.matcher("This is tab \\t and \\n this is on a new line");
StringBuffer sb = new StringBuffer();
while (m.find()) {
String s = m.group(1);
if (s == "n") {s = "\n"; }
else if (s == "t") {s = "\t"; }
m.appendReplacement(sb, s);
}
m.appendTail(sb);
System.out.println(sb.toString());
You just need to make the assignment to s more sophisticated depending on the number and type of escapes you want to handle. (Warning this is air code, I'm not Java developer)

If you don't want to list all possible escaped characters you can delegate this to Properties behaviour
String escapedText="This is tab \\t and \\rthis is on a new line";
Properties prop = new Properties();
prop.load(new StringReader("x=" + escapedText + "\n"));
String decoded = prop.getProperty("x");
System.out.println(decoded);
This handle all possible characters

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Want to replace special characters with equivalent UTF-8 symbols - java

The document contains XML/HTML entities . You can use the StringEscapeUtils.unescapeXml() method from commons-lang to parse these back to their unicode equivalents. If this is HTML rather than XML use the other methods as there are differences in the two sets of entities.

Related

Java %u20AC conversion to euro €

Remove elements from Date Format String using a Regular Expression

Parse string value from URL

Encode only specific characters in String

Is there a Java function which parses escaped characters?

Categories

Resources