Pattern matching for Japanese string have issues in java - java

I have a strange issue while pattern matching only Japaneese characters in Java.
Let me explain by code.
private static final Pattern ADDRESS_STRING_PATTERN =
Pattern.compile("^[\\p{L}\\d\\s\\p{Punct}]{1,200}$");
private static boolean isValidInput(final String input, Pattern pattern) {
return pattern.matcher(input).matches();
}
System.out.println("こんにちは、元気ですか");
Here I am matching any Letter,Space, digit or Punctuation letters 1 to 200.
Now this will always return false. After some debugging found that the issue is with one character "、" . If I add that character as part of the regular expression it works fine.
Anyone come across this issue ? Or is this bug in Java ?

The thing is that 、 (U+3001 IDEOGRAPHIC COMMA) belongs to "Punctuation, other" Unicode category and \\p{Punct} only matches ASCII punctuation by default. If you use a Pattern.UNICODE_CHARACTER_CLASS option or (?U) embedded flag option, it will match (i.e. the pattern might look like "(?U)^[\\p{L}\\d\\s\\p{Punct}]{1,200}$"). However, this may impact \d and \s, and I am not sure you want to match all Unicode digits and whitespace.
An alternative is to use \p{P}\p{S} (to match Unicode punctuation and symbols) instead of \p{Punct} (the POSIX character class matches both punctuation and symbols).
See a Java demo printing true:
private static final Pattern ADDRESS_STRING_PATTERN = Pattern.compile("^[\\p{L}\\d\\s\\p{P}\\p{S}]{1,200}$");
private static boolean isValidInput(final String input, Pattern pattern) {
return pattern.matcher(input).matches();
}
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(isValidInput("こんにちは、元気ですか",ADDRESS_STRING_PATTERN));
}
// => true

Related

Java regular expression for French names

I need to modify regular expression to allow all standard characters, French characters, spaces AND dash (hyphen) but only one at a time.
What I have right now is:
import java.util.regex.Pattern;
public class FrenchRegEx {
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z-' ]+";
public static void main(String[] args) {
String name;
//name = "Jean Luc"; // allowed
//name = "Jean-Luc"; // allowed
//name = "Jean-Luc-Marie"; // allowed
name = "Jean--Luc"; // NOT allowed
if (!Pattern.matches(NAME_PATTERN, name)) {
System.out.println("ERROR!");
} else System.out.println("OK!");
}
}
and it allows 'Jean--Luc' as a name and that is not allowed.
Any help with this?
Thanks.
So, you want a pattern which is a 0 or more hyphens, separated by 1 or more other characters. It's just a matter of writing the pattern that way:
"[\u00C0-\u017Fa-zA-Z']+([- ][\u00C0-\u017Fa-zA-Z']+)*"
This also assumes you don't want names to start or end with a hyphen or space, nor that you want more than one space in a row, and that you also want to disallow a space to follow or proceed a hyphen.
You need to disallow consecutive hyphens. You may do it with a negative lookahead:
static final String NAME_PATTERN = "(?!.*--)[\u00C0-\u017Fa-zA-Z-' ]+";
^^^^^^^^
To disallow any of the special chars to be consecutive, use
static final String NAME_PATTERN = "(?!.*([-' ])\\1)[\u00C0-\u017Fa-zA-Z-' ]+";
Another way is to unroll the pattern a bit to match strings where the special char(s) can appear in between letters, but cannot appear consecutively (i.e. if you need to match Abc-def'here like strings):
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z]+(?:[-' ][\u00C0-\u017Fa-zA-Z]+)*";
or to only allow 1 special char that can only appear in between letters (i.e. if you nee to only allow strings like abc-def, or abc'def):
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z]+(?:[-' ][\u00C0-\u017Fa-zA-Z]+)?";
Note that you do not need anchors here because you are using the pattern inside a .matches() method that requires a full string match.
NOTE: you may further tune the patterns by moving special chars that may appear anywhere in the string from the [-' ] character class to the [\u00C0-\u017Fa-zA-Z] character classes, like [\u00C0-\u017Fa-zA-Z], but watch out for -. It should be placed at the end, near ].
Try using ([\u00C0-\u017Fa-zA-Z']+[- ]?)+. This would match one or more names separated by exactly one dash or space.

Java split regex non-greedy match not working

Why is non-greedy match not working for me? Take following example:
public String nonGreedy(){
String str2 = "abc|s:0:\"gef\";s:2:\"ced\"";
return str2.split(":.*?ced")[0];
}
In my eyes the result should be: abc|s:0:\"gef\";s:2 but it is: abc|s
The .*? in your regex matches any character except \n (0 or more times, matching the least amount possible).
You can try the regular expression:
:[^:]*?ced
On another note, you should use a constant Pattern to avoid recompiling the expression every time, something like:
private static final Pattern REGEX_PATTERN =
Pattern.compile(":[^:]*?ced");
public static void main(String[] args) {
String input = "abc|s:0:\"gef\";s:2:\"ced\"";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[abc|s:0:"gef";s:2, "]"
}
It is behaving as expected. The non-greedy match will match as little as it has to, and with your input, the minimum characters to match is the first colon to the next ced.
You could try limiting the number of characters consumed. For example to limit the term to "up to 3 characters:
:.{0,3}ced
To make it split as close to ced as possible, use a negative look-ahead, with this regex:
:(?!.*:.*ced).*ced
This makes sure there isn't a closer colon to ced.

Check that all lines match regex pattern in Java

How to check that all lines match regex pattern in Java.
I mean that I be able to split lines myself in while loop. But is there any library or standard API, which implement this functionality?
UPDATE This is Ruby solution:
if text =~ /PATTERN/
Here's a utility method using Guava that returns true if every line in the supplied text matches the supplied pattern:
public static boolean matchEachLine(String text, Pattern pattern){
return FluentIterable.from(Splitter.on('\n').split(text))
.filter(Predicates.not(Predicates.contains(pattern)))
.isEmpty();
}
This is one I use
public static boolean multilineMatches(final String regex, final String text) {
final Matcher m = Pattern.compile("^(.*)$", Pattern.MULTILINE).matcher(text);
final Pattern p = Pattern.compile(regex);
while(m.find()) {
if (!p.matcher(m.group()).find()) {
return false;
}
}
return true;
}
There is no standard API functionality I know of to do this, however, something like this is easy enough:
string.matches("(What you want to match(\r?\n|$))*+")
Usage:
String string = "This is a string\nThis is a string\nThis is a string";
System.out.println(string.matches("(This is a string(\r?\n|$))*+"));
\r?\n covers the most common new-lines.
$ is end of string.
(\r?\n|$) is a new-line or the end of string.
*+ is zero or more - but this is a possessive qualifier.
So the whole thing basically checks that every line matches This is a string.
If you want it in a function:
boolean allLinesMatch(String string, String regex)
{
return string.matches("(" + regex + "(\r?\n|$))*+");
}
Java regex reference.
Prime example of why you need a possessive qualifier:
If you take the string This is a string. repeated a few times (34 times to be exact) but have the last string be This is a string.s (won't match the regex) and have What you want to match be .* .* .*\\., you end up waiting a quite while with *.
* example - runtime on my machine - more than a few hours, after which I stopped it.
*+ example - runtime on my machine - much less than a second.
See Catastrophic Backtracking for more information.

How to remove special characters from a string?

I want to remove special characters like:
- + ^ . : ,
from an String using Java.
That depends on what you define as special characters, but try replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since you'd then either have to escape it or it would mean "any but these characters".
Another note: the - character needs to be the first or last one on the list, otherwise you'd have to escape it or it would define a range ( e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S} (keep in mind that in Java strings you'd have to escape back slashes: "\\p{P}\\p{S}").
A third way could be something like this, if you can exactly define what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can't explain them all, so have a look at the reference section of regular-expressions.info.
Here's less restrictive alternative to the "define allowed characters" approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can't use [\P{L}\P{Z}] (upper case P means not having that property), since that would mean "everything that is not a letter or not whitespace", which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.
This will replace all the characters except alphanumeric
replaceAll("[^A-Za-z0-9]","");
As described here
http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$##!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
String is = one
Number is = 9196390097
Try replaceAll() method of the String class.
BTW here is the method, return type and parameters.
public String replaceAll(String regex,
String replacement)
Example:
String str = "Hello +-^ my + - friends ^ ^^-- ^^^ +!";
str = str.replaceAll("[-+^]*", "");
It should remove all the {'^', '+', '-'} chars that you wanted to remove!
To Remove Special character
String t2 = "!##$%^&*()-';,./?><+abdd";
t2 = t2.replaceAll("\\W+","");
Output will be : abdd.
This works perfectly.
Use the String.replaceAll() method in Java.
replaceAll should be good enough for your problem.
You can remove single char as follows:
String str="+919595354336";
String result = str.replaceAll("\\\\+","");
System.out.println(result);
OUTPUT:
919595354336
If you just want to do a literal replace in java, use Pattern.quote(string) to escape any string to a literal.
myString.replaceAll(Pattern.quote(matchingStr), replacementStr)

Check String whether it contains only Latin characters?

Greetings,
I am developing GWT application where user can enter his details in Japanese.
But the 'userid' and 'password' should only contain English characters(Latin Alphabet).
How to validate Strings for this?
You can use String#matches() with a bit regex for this. Latin characters are covered by \w.
So this should do:
boolean valid = input.matches("\\w+");
This by the way also covers numbers and the underscore _. Not sure if that harms. Else you can just use [A-Za-z]+ instead.
If you want to cover diacritical characters as well (ä, é, ò, and so on, those are per definition also Latin characters), then you need to normalize them first and get rid of the diacritical marks before matching, simply because there's no (documented) regex which covers diacriticals.
String clean = Normalizer.normalize(input, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
boolean valid = clean.matches("\\w+");
Update: there's an undocumented regex in Java which covers diacriticals as well, the \p{L}.
boolean valid = input.matches("\\p{L}+");
Above works at Java 1.6.
public static boolean isValidISOLatin1 (String s) {
return Charset.forName("US-ASCII").newEncoder().canEncode(s);
} // or "ISO-8859-1" for ISO Latin 1
For reference, see the documentation on Charset.
There is my solution and it is working excellent
public static boolean isStringContainsLatinCharactersOnly(final String iStringToCheck)
{
return iStringToCheck.matches("^[a-zA-Z0-9.]+$");
}
There might be a better approach, but you could load a collection with whatever you deem to be acceptable characters, and then check each character in the username/password field against that collection.
Pseudo:
foreach (character in username)
{
if !allowedCharacters.contains(character)
{
throw exception
}
}
For something this simple, I'd use a regular expression.
private static final Pattern p = Pattern.compile("\\p{Alpha}+");
static boolean isValid(String input) {
Matcher m = p.matcher(input);
return m.matches();
}
There are other pre-defined classes like \w that might work better.
I successfully used a combination of the answers of user232624, Joachim Sauer and Tvaroh:
static CharsetEncoder asciiEncoder = Charset.forName("US-ASCII"); // or "ISO-8859-1" for ISO Latin 1
boolean isValid(String input) {
return Character.isLetter(ch) && asciiEncoder.canEncode(username);
}

Categories

Resources