How can i get know that my String contains diacritics? - java

For Example -
text = Československá obchodní banka;
text string contains diacritics like Č , á etc.
I want to write a function where i will pass this string "Československá obchodní banka" and function will return true if string contains diacritics else false.
I have to handle diacritics and string which contains character which doesn't fall in A-z or a-z range separately.
1) If String contains diacritics then I have to do some XXXXXX on it.
2) If String contains character other than A-Z or a-z and not contains diacritics then do some other operations YYYYY.
I have no idea how to do it.

One piece of knowledge: in Unicode there exists a code for á but the same result one may get with an a and a combining mark-'.
You can use java.text.Normalizer, as follows:
public static boolean hasDiacritics(String s) {
// Decompose any á into a and combining-'.
String s2 = Normalizer.normalize(s, Normalizer.Form.NFD);
return s2.matches("(?s).*\\p{InCombiningDiacriticalMarks}.*");
//return !s2.equals(s);
}

The Normalizer class seems to be able to accomplish this. Some limited testing indicate that
Normalizer.isNormalized(text, Normalizer.Form.NFD)
might be what you need.

Related

Normalize a string except ñ

I have the following example code:
String n = "Péña";
n = Normalizer.normalize(n, Normalizer.Form.NFC);
How do I normalize the string n excepting the ñ?
And not only that string, I'm making a form and I want to keep just the ñ's, and everything else without diacritics.
Replace all occurrences of "ñ" with a non-printable character "\001", so "Péña" becomes "Pé\001a". Then call Normalizer.normalize() to decompose the "é" into "e" and a separate diacritical mark. Finally remove the diacritical marks, and convert the non-printable character back to "ñ".
String partiallyNormalize(String string)
{
string = string.replace('ñ', '\001');
string = Normalizer.normalize(string, Normalizer.Form.NFD);
string = string.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
string = string.replace('\001', 'ñ');
return string;
}
You might also want to upvote the preferred answer to Easy way to remove UTF-8 accents from a string?, where I learned how to remove the diacritical marks.

Check that string contains non-latin letters

I have the following method to check that string contains only latin symbols.
private boolean containsNonLatin(String val) {
return val.matches("\\w+");
}
But it returns false if I pass string: my string because it contains space.
But I need the method which will check that if string contains letters not in Latin alphabet it should return false and it should return true in all other cases.
Please help to improve my method.
examples of valid strings:
w123.
w, 12
w#123
dsf%&#
You can use \p{IsLatin} class:
return !(var.matches("[\\p{Punct}\\p{Space}\\p{IsLatin}]+$"));
Java Regex Reference
I need something like not p{IsLatin}
If you need to match all letters but Latin ASCII letters, you can use
"[\\p{L}\\p{M}&&[^\\p{Alpha}]]+"
The \p{Alpha} POSIX class matches [A-Za-z]. The \p{L} matches any Unicode base letter, \p{M} matches diacritics. When we add &&[^\p{Alpha}] we subtract these [A-Za-z] from all the Unicode letters.
The whole expression means match one or more Unicode letters other than ASCII letters.
To add a space, just add \s:
"[\\s\\p{L}\\p{M}&&[^\\p{Alpha}]]+"
See IDEONE demo:
List<String> strs = Arrays.asList("w123.", "w, 12", "w#123", "dsf%&#", "Двв");
for (String str : strs)
System.out.println(!str.matches("[\\s\\p{L}\\p{M}&&[^\\p{Alpha}]]+")); // => 4 true, 1 false
Just add a space to your matcher:
private boolean isLatin(String val) {
return val.matches("[ \\w]+");
}
User this :
public static boolean isNoAlphaNumeric(String s) {
return s.matches("[\\p{L}\\s]+");
}
\p{L} means any Unicode letter.
\s space character

match all characters in a string independent of their order in the sequence

I want to match certain group of characters in a String independent of their order in the String using regex fucntion. However, the only requirement is that they all must be there.
I have tried
String elD = "15672";
String t = "12";
if ((elD.matches(".*[" + t + "].*"))) {
System.out.println(elD);
}
This one checks whether any of the characters are present. But I want all of them to be there.
Also I tried
String elD = "15672";
String t = "12";
if ((elD.matches(".*(" + t + ").*"))) {
System.out.println(elD);
}
This does not work as well. I have searched quite a while but I could not find an example when all of the characters from the pattern must be present in the String independent of their order.
Thanks
You can write regex for this but it would not look nice. If you would want to check if your string contains anywhere x and y you would need to use few times look-ahead like
^(?=.*x)(?=.*y).*$
and use it like
yourStirng.matches(regex);
But this way you would need to create your own method which would generate you dynamic regex and add (?=.*X) for each character you want to check. You would also need to make sure that this character is not special in regex like ? or +.
Simpler and not less effective solution would be creating your own method which would check if your string contains all searched characters, something like
public static boolean containsUnordered(String input, String searchFor){
char[] characters = searchFor.toCharArray();
for (char c: characters)
if (!input.contains(String.valueOf(c)))
return false;
return true;
}
You can built a pattern from the search string using the replaceAll method:
String s = "12";
String pattern = s.replaceAll("(.)", "(?=[^$1]*$1)");
Note: You can't test the same character several times. (i.e. 112 gives (?=[^1]*1)(?=[^1]*1)(?=[^2]*2) that is exactly the same as (?=[^1]*1)(?=[^2]*2))
But in my opinion Pshemo method is probably more efficient.

How to get an alphanumeric String from any string in Java? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars
How to replace special characters in a string?
I would like to format some String such as "I>Télé" to something like "itele".
The idea is that I want my String to be lower case (done), without whitespaces (done), no accents or special characters (like >, <, /, %, ~, é, #, ï etc).
It is okay to delete occurences of special characters, but I want to keep letters while removing accents (as I did in my example). Here is what I did, but I don't think that the good solution is to replace every é,è,ê,ë by "e", than do it again for "i","a" etc, and then remove every special character...
String name ="I>télé" //example
String result = name.toLowerCase().replace(" ", "").replace("é","e").........;
The purpose of that is to provide a valid filename for resources for an Android app, so if you have any other idea, I'll take it !
You can use the java.text.Normalizer class to convert your text into normal Latin characters followed by diacritic marks (accents), where possible. So for example, the single-character string "é" would become the two character string ['e', {COMBINING ACUTE ACCENT}].
After you've done this, your String would be a combination of unaccented characters, accent modifiers, and the other special characters you've mentioned. At this point you could filter the characters in your string using only a whitelist to keep what you want (which could be as simple as [A-Za-z0-9] for a regex, depending on what you're after).
An approach might look like:
String name ="I>télé"; //example
String normalized = Normalizer.normalize(name, Form.NFD);
String result = normalized.replaceAll("[^A-Za-z0-9]", "");
You can do something like
String res = ""
for (char c : name.toCharArray()) {
if (Character.isLetter(c) ||Character.isDigit(c))
res += c
}
//Normalize using the method below
http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/
public static String stripAccents(String s) {
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
return s;
}
try using ascii code. may this link will help

How to replace special characters in a string?

I have a string with lots of special characters. I want to remove all those, but keep alphabetical characters.
How can I do this?
That depends on what you mean. If you just want to get rid of them, do this:
(Update: Apparently you want to keep digits as well, use the second lines in that case)
String alphaOnly = input.replaceAll("[^a-zA-Z]+","");
String alphaAndDigits = input.replaceAll("[^a-zA-Z0-9]+","");
or the equivalent:
String alphaOnly = input.replaceAll("[^\\p{Alpha}]+","");
String alphaAndDigits = input.replaceAll("[^\\p{Alpha}\\p{Digit}]+","");
(All of these can be significantly improved by precompiling the regex pattern and storing it in a constant)
Or, with Guava:
private static final CharMatcher ALNUM =
CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z'))
.or(CharMatcher.inRange('0', '9')).precomputed();
// ...
String alphaAndDigits = ALNUM.retainFrom(input);
But if you want to turn accented characters into something sensible that's still ascii, look at these questions:
Converting Java String to ASCII
Java change áéőűú to aeouu
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars
I am using this.
s = s.replaceAll("\\W", "");
It replace all special characters from string.
Here
\w : A word character, short for [a-zA-Z_0-9]
\W : A non-word character
You can use the following method to keep alphanumeric characters.
replaceAll("[^a-zA-Z0-9]", "");
And if you want to keep only alphabetical characters use this
replaceAll("[^a-zA-Z]", "");
Replace any special characters by
replaceAll("\\your special character","new character");
ex:to replace all the occurrence of * with white space
replaceAll("\\*","");
*this statement can only replace one type of special character at a time
Following the example of the Andrzej Doyle's answer, I think the better solution is to use org.apache.commons.lang3.StringUtils.stripAccents():
package bla.bla.utility;
import org.apache.commons.lang3.StringUtils;
public class UriUtility {
public static String normalizeUri(String s) {
String r = StringUtils.stripAccents(s);
r = r.replace(" ", "_");
r = r.replaceAll("[^\\.A-Za-z0-9_]", "");
return r;
}
}
string Output = Regex.Replace(Input, #"([ a-zA-Z0-9&, _]|^\s)", "");
Here all the special characters except space, comma, and ampersand are replaced. You can also omit space, comma and ampersand by the following regular expression.
string Output = Regex.Replace(Input, #"([ a-zA-Z0-9_]|^\s)", "");
Where Input is the string which we need to replace the characters.
Here is a function I used to remove all possible special characters from the string
let name = name.replace(/[&\/\\#,+()$~%!.„'":*‚^_¤?<>|#ª{«»§}©®™ ]/g, '').toLowerCase();
You can use basic regular expressions on strings to find all special characters or use pattern and matcher classes to search/modify/delete user defined strings. This link has some simple and easy to understand examples for regular expressions: http://www.vogella.de/articles/JavaRegularExpressions/article.html
You can get unicode for that junk character from charactermap tool in window pc and add \u e.g. \u00a9 for copyright symbol.
Now you can use that string with that particular junk caharacter, don't remove any junk character but replace with proper unicode.
For spaces use "[^a-z A-Z 0-9]" this pattern

Categories

Resources