Java regular expression for French names - java

I need to modify regular expression to allow all standard characters, French characters, spaces AND dash (hyphen) but only one at a time.
What I have right now is:
import java.util.regex.Pattern;
public class FrenchRegEx {
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z-' ]+";
public static void main(String[] args) {
String name;
//name = "Jean Luc"; // allowed
//name = "Jean-Luc"; // allowed
//name = "Jean-Luc-Marie"; // allowed
name = "Jean--Luc"; // NOT allowed
if (!Pattern.matches(NAME_PATTERN, name)) {
System.out.println("ERROR!");
} else System.out.println("OK!");
}
}
and it allows 'Jean--Luc' as a name and that is not allowed.
Any help with this?
Thanks.

So, you want a pattern which is a 0 or more hyphens, separated by 1 or more other characters. It's just a matter of writing the pattern that way:
"[\u00C0-\u017Fa-zA-Z']+([- ][\u00C0-\u017Fa-zA-Z']+)*"
This also assumes you don't want names to start or end with a hyphen or space, nor that you want more than one space in a row, and that you also want to disallow a space to follow or proceed a hyphen.

You need to disallow consecutive hyphens. You may do it with a negative lookahead:
static final String NAME_PATTERN = "(?!.*--)[\u00C0-\u017Fa-zA-Z-' ]+";
^^^^^^^^
To disallow any of the special chars to be consecutive, use
static final String NAME_PATTERN = "(?!.*([-' ])\\1)[\u00C0-\u017Fa-zA-Z-' ]+";
Another way is to unroll the pattern a bit to match strings where the special char(s) can appear in between letters, but cannot appear consecutively (i.e. if you need to match Abc-def'here like strings):
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z]+(?:[-' ][\u00C0-\u017Fa-zA-Z]+)*";
or to only allow 1 special char that can only appear in between letters (i.e. if you nee to only allow strings like abc-def, or abc'def):
static final String NAME_PATTERN = "[\u00C0-\u017Fa-zA-Z]+(?:[-' ][\u00C0-\u017Fa-zA-Z]+)?";
Note that you do not need anchors here because you are using the pattern inside a .matches() method that requires a full string match.
NOTE: you may further tune the patterns by moving special chars that may appear anywhere in the string from the [-' ] character class to the [\u00C0-\u017Fa-zA-Z] character classes, like [\u00C0-\u017Fa-zA-Z], but watch out for -. It should be placed at the end, near ].

Try using ([\u00C0-\u017Fa-zA-Z']+[- ]?)+. This would match one or more names separated by exactly one dash or space.

Related

Pattern matching for Japanese string have issues in java

I have a strange issue while pattern matching only Japaneese characters in Java.
Let me explain by code.
private static final Pattern ADDRESS_STRING_PATTERN =
Pattern.compile("^[\\p{L}\\d\\s\\p{Punct}]{1,200}$");
private static boolean isValidInput(final String input, Pattern pattern) {
return pattern.matcher(input).matches();
}
System.out.println("こんにちは、元気ですか");
Here I am matching any Letter,Space, digit or Punctuation letters 1 to 200.
Now this will always return false. After some debugging found that the issue is with one character "、" . If I add that character as part of the regular expression it works fine.
Anyone come across this issue ? Or is this bug in Java ?
The thing is that 、 (U+3001 IDEOGRAPHIC COMMA) belongs to "Punctuation, other" Unicode category and \\p{Punct} only matches ASCII punctuation by default. If you use a Pattern.UNICODE_CHARACTER_CLASS option or (?U) embedded flag option, it will match (i.e. the pattern might look like "(?U)^[\\p{L}\\d\\s\\p{Punct}]{1,200}$"). However, this may impact \d and \s, and I am not sure you want to match all Unicode digits and whitespace.
An alternative is to use \p{P}\p{S} (to match Unicode punctuation and symbols) instead of \p{Punct} (the POSIX character class matches both punctuation and symbols).
See a Java demo printing true:
private static final Pattern ADDRESS_STRING_PATTERN = Pattern.compile("^[\\p{L}\\d\\s\\p{P}\\p{S}]{1,200}$");
private static boolean isValidInput(final String input, Pattern pattern) {
return pattern.matcher(input).matches();
}
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(isValidInput("こんにちは、元気ですか",ADDRESS_STRING_PATTERN));
}
// => true

Find string that does not contain some substring

I have a one liner string that looks like this:
My db objects are db.main_flow_tbl, 'main_flow_audit_tbl',
main_request_seq and MAIN_SUBFLOW_TBL.
I want to use regular expressions to return database tables that start with main but do not contain words audit or seq, and irrespective of the case. So in the above example strings main_flow_tbl and MAIN_SUBFLOW_TBL shall return. Can someone help me with this please?
Here is a fully regex based solution:
public static void main(String[] args) throws Exception {
final String in = "My db objects are db.main_flow_tbl, 'main_flow_audit_tbl', main_request_seq and MAIN_SUBFLOW_TBL.";
final Pattern pat = Pattern.compile("main_(?!\\w*?(?:audit|seq))\\w++", Pattern.CASE_INSENSITIVE);
final Matcher m = pat.matcher(in);
while(m.find()) {
System.out.println(m.group());
}
}
Output:
main_flow_tbl
MAIN_SUBFLOW_TBL
This assumes that table names can only contain A-Za-Z_ which \w is the shorthand for.
Pattern breakdown:
main_ is the liternal "main" that you want tables to start with
(?!\\w*?(?:audit|seq)) is a negative lookahead (not followed by) which takes any number of \w characters (lazily) followed by either "audit" or "seq". This excludes tables names that contain those sequences.
\\w++ consume any table characters possesively.
EDIT
OP's comment they may contain numbers as well
In this case use this pattern:
main_(?![\\d\\w]*?(?:audit|seq))[\\d\\w]++
i.e. use [\\d\\w] rather than \\w
String str
while ((str.startsWith("main"))&&!str.contains("audit")||!str.contains("seq")){
//your code here
}
If the string matches
^main_(\w_)*(?!(?:audit|seq))
it should be what you want...

Java split regex non-greedy match not working

Why is non-greedy match not working for me? Take following example:
public String nonGreedy(){
String str2 = "abc|s:0:\"gef\";s:2:\"ced\"";
return str2.split(":.*?ced")[0];
}
In my eyes the result should be: abc|s:0:\"gef\";s:2 but it is: abc|s
The .*? in your regex matches any character except \n (0 or more times, matching the least amount possible).
You can try the regular expression:
:[^:]*?ced
On another note, you should use a constant Pattern to avoid recompiling the expression every time, something like:
private static final Pattern REGEX_PATTERN =
Pattern.compile(":[^:]*?ced");
public static void main(String[] args) {
String input = "abc|s:0:\"gef\";s:2:\"ced\"";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[abc|s:0:"gef";s:2, "]"
}
It is behaving as expected. The non-greedy match will match as little as it has to, and with your input, the minimum characters to match is the first colon to the next ced.
You could try limiting the number of characters consumed. For example to limit the term to "up to 3 characters:
:.{0,3}ced
To make it split as close to ced as possible, use a negative look-ahead, with this regex:
:(?!.*:.*ced).*ced
This makes sure there isn't a closer colon to ced.

Java using regex to verify an input string

g.:
String string="Marc Louie, Garduque Bautista";
I want to check if a string contains only words, a comma and spaces. i have tried to use regex and the closest I got is this :
String pattern = "[a-zA-Z]+(\\s[a-zA-Z]+)+";
but it doesnt check if there is a comma in there or not. Any suggestion ?
You need to use the pattern
^[A-Za-z, ]++$
For example
public static void main(String[] args) throws IOException {
final String input = "Marc Louie, Garduque Bautista";
final Pattern pattern = Pattern.compile("^[A-Za-z, ]++$");
if (!pattern.matcher(input).matches()) {
throw new IllegalArgumentException("Invalid String");
}
}
EDIT
As per Michael's astute comment the OP might mean a single comma, in which case
^[A-Za-z ]++,[A-Za-z ]++$
Ought to work.
Why not just simply:
"[a-zA-Z\\s,]+"
Use this will best
"(?i)[a-z,\\s]+"
If you mean "some words, any spaces and one single comma, wherever it occurs to be" then my feeling is to suggest this approach:
"^[^,]* *, *[^,]*$"
This means "Start with zero or more characters which are NOT (^) a comma, then you could find zero or more spaces, then a comma, then again zero or more spaces, then finally again zero or more characters which are NOT (^) a comma".
To validate String in java where No special char at beginning and end but may have some special char in between.
String strRGEX = "^[a-zA-Z0-9]+([a-zA-Z0-9-/?:.,\'+_\\s])+([a-zA-Z0-9])$";
String toBeTested= "TesADAD2-3t?S+s/fs:fds'f.324,ffs";
boolean testResult= Pattern.matches(strRGEX, toBeTested);
System.out.println("Test="+testResult);

How to remove special characters from a string?

I want to remove special characters like:
- + ^ . : ,
from an String using Java.
That depends on what you define as special characters, but try replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since you'd then either have to escape it or it would mean "any but these characters".
Another note: the - character needs to be the first or last one on the list, otherwise you'd have to escape it or it would define a range ( e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S} (keep in mind that in Java strings you'd have to escape back slashes: "\\p{P}\\p{S}").
A third way could be something like this, if you can exactly define what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can't explain them all, so have a look at the reference section of regular-expressions.info.
Here's less restrictive alternative to the "define allowed characters" approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can't use [\P{L}\P{Z}] (upper case P means not having that property), since that would mean "everything that is not a letter or not whitespace", which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.
This will replace all the characters except alphanumeric
replaceAll("[^A-Za-z0-9]","");
As described here
http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$##!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
String is = one
Number is = 9196390097
Try replaceAll() method of the String class.
BTW here is the method, return type and parameters.
public String replaceAll(String regex,
String replacement)
Example:
String str = "Hello +-^ my + - friends ^ ^^-- ^^^ +!";
str = str.replaceAll("[-+^]*", "");
It should remove all the {'^', '+', '-'} chars that you wanted to remove!
To Remove Special character
String t2 = "!##$%^&*()-';,./?><+abdd";
t2 = t2.replaceAll("\\W+","");
Output will be : abdd.
This works perfectly.
Use the String.replaceAll() method in Java.
replaceAll should be good enough for your problem.
You can remove single char as follows:
String str="+919595354336";
String result = str.replaceAll("\\\\+","");
System.out.println(result);
OUTPUT:
919595354336
If you just want to do a literal replace in java, use Pattern.quote(string) to escape any string to a literal.
myString.replaceAll(Pattern.quote(matchingStr), replacementStr)

Categories

Resources