Validate with REGEX alpha with support in many languages - java

Hi guys I am working in a java validator library. My question is How can I validate inputs to be alpha (no alphanumeric) in many languages. I have the following REGEX:
public AlphaValidator() {
super();
this.rule = "^[a-zA-Z[*]]+$"; // its fine with : angel, world, bottle, etc.
}
Its ok, but If the library is implemented for spanish inputs or french ones maybe with words like : vi un ñandu or árbol do not match with the REGEX.
I was writting the special characters like :
private String getSpanishFilter() {
return "-ñ-Ñ-á-Á-é-É-í-Í-ó-Ó-ú-Ú-ü-Ü";
}
private String getFrenchFilter() {
return "â-à-ç-é-ê-ë-è-ï-î-ô-û-ù-Â-À-Ç-É-Ê-Ë-È-Ï-Î-Ô-Û-Ù";
}
But I think this is not the best solution. Any help?

Don't understand why you have enclosed * inside a nested character class. That is nothing but union, and is as good as using just *. And to match unicode letters, you can use \p{L}.
And if you're already on Java 7, then you can use Pattern.UNICODE_CHARACTER_CLASS flag, or embedded flag - (?U) with your given pattern:
Pattern p = Pattern.compile("^[*\\w&&[^\\d_]]+$", Pattern.UNICODE_CHARACTER_CLASS);
And if you're keeping regex as string, then use embedded flag as:
rule = "(?U)^[*\\w&&[^\\d_]]+$";

Have you looked at the docs for Pattern?
Under "Classes for Unicode scripts, blocks, categories and binary properties":
\p{IsAlphabetic} An alphabetic character (binary property)
So your pattern could be:
"\\p{IsAlphabetic}+"

The shortest way with matches():
\\pL+ # no need to add anchors with matches() method
The shortest way with find():
\\PL # stop at the first non letter character
note: you can write \\p{L} and \\P{L} too, \\pL and \\PL are shortcuts.
But if you need to match only latin characters, it's better to use:
\\p{isLatin}+

Related

Extract specific data from string with regex

I want to capture multiple string which match some specific patterns,
For example my string is like
String textData = "#1_Label for UK#2_Label for US#4_Label for FR#";
I want to get string between two # which match with string like for UK
Output should like this
if match string is UK than
output should be 1_Label for UK
if match string is label than
output should be 1_Label for UK, 2_Label for US and 4_Label for FR
if match string is 1_ than
output should be 1_Label for UK
I don't want to extract data via array list and extraction should be case insensitive.
Can you please help me out from this problem?
Regards,
Ashish Mishra
You can use this regex for search:
#([^#]*?Label[^#]*)(?=#)
Replace Label with your search keyword.
RegEx Demo
Java Pattern:
Pattern p = Pattern.compile( "#([^#]*?" + Pattern.quote(keyword) + "[^#]*)(?=#)" );
If the data always is between two hashes, try a regex like this: (?i)#.*your_match.*# where your_match would be UK, label, 1_ etc.
Then use this expression in conjunction with the Pattern and Matcher classes.
If you want to match multiple strings, you'd need to exclude the hashes from the match by using look-around methods as well as reluctant modifiers, e.g. (?i)(?<=#).*?label.*?(?=#).
Short breakdown:
(?i) will make the expression case insensitive
(?<=#) is a positive look-behind, i.e. the match must be preceeded by a hash (but doesn't include the hash)
.*? matches any sequence of characters but is reluctant, i.e. it tries to match as few characters as possible
(?=#) is a positive look-ahead, which means the match must be followed by a hash (also not included in the match)
Without the look-around methods the hashes would be included in the match and thus using Matcher.find() you'd skip every other label in your test string, i.e. you'd get the matches #1_Label for UK# and #4_Label for FR# but not #2_Label for US#.
Without the relucatant modifiers the expression would match everything between the first and the last hash.
As an alternative and better, replace .*? with [^#]*, which would mean that the match cannot contain any hash, thus removing the need for reluctant modifiers as well as removing the problem that looking for US would match 1_Label for UK#2_Label for US.
So most probably the final regex you're after looks like this: (?i)(?<=#)[^#]*your_match[^#]*(?=#).
([^#]*UK[^#]*) for UK
([^#]*Label[^#]*) for Label
([^#]*1_[^#]*) for 1_
Try this.Grab the captures.See demo.
http://regex101.com/r/kQ0zR5/3
http://regex101.com/r/kQ0zR5/4
http://regex101.com/r/kQ0zR5/5
I have solved this problem with below pattern,
(?i)([^#]*?us[^#]*)(?=#)
Thank you so much Anubhava, VKS and Thomas for you reply.
Regards,
Ashish Mishra

Regular expression not working despite testing

I'm trying to enforce validation of an ID that includes the first two letters being letters and the next four being numbers, there can be one 0 i.e. 0333 but can never be full zeroes with 0000 therefore something like ID0000 is not allowed. The expression I came up with seems to check out when testing it online but doesn't seem to work when trying to enforce it in the program:
\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b
and heres the code I'm currently using to implement it:
String pattern = "/\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b/";
Pattern regEx = Pattern.compile(pattern);
String ingID = ingredID.getText().toString();
Matcher m = regEx.matcher(ingID);
if (m.matches()) {
ingredID.setError("Please enter a valid Ingrediant ID");
}
For some reason it doesn't seem to validate correctly with accepting ids like ID0000 when it shouldn't be. Any thoughts folks ?
Change your regex pattern to "\\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\\b"
Your problem is essentially that Java isn't all that Regex-friendly; you need to deal with the limitations of Java strings in order to create a string that can be used as a Regex pattern. Since \ is the escape character in Regex and the escape character in Java strings (and since there's no such thing as a raw string literal in Java), you must double-escape anything that must be escaped in the Regex in order to create a literal \ character within the Java string, which, when parsed as a Regex pattern, will be correctly treated as the escape character.
So, for instance, the Regex pattern /\b/ (where /, as mentioned in my comment, delimits the pattern itself) would be represented in Java as the string "\\b".

String split, words including accented characters

I'm using this regex:
x.split("[^a-zA-Z0-9']+");
This returns an array of strings with letters and/or numbers.
If I use this:
String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automated
TestCase
Java
But if I use this:
String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");
I got:
CEN01
Automa
o
Caso
Teste
Java
How can I modify this regex to include accented characters? (á,ã,õ, etc...)
From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
Since Character class contains isAlphabetic method you can use
name.split("[^\\p{IsAlphabetic}0-9']+");
You can also use
name.split("(?U)[^\\p{Alpha}0-9']+");
but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.
I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.
EDIT: Example
Another way would be to match on the character code you are looking for. For example
\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.
Example: \u00E0 matches à
Realize that the backslash will need to be escaped in Java if you are using it as a string literal.
Read more about it here.
You can use this:
String[] names = name.split("[^a-zA-Z0-9'\\p{L}]+");
System.out.println(Arrays.toString(names)); Will output:
[CEN01, Automação, Caso, Teste, java]
See this for more information.
Why not split on the separator characters?
String[] names = name.split("[_.]");
Instead of blacklisting all the characters you don't want, you could always whitlist the characters you want like :
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
But that is a personnal opinion.

Regex: Match a string between two tags in a string

I am new to Regexp. I am struck in writing regexp for below scenario. Can some one please help me in solving this?
If i have a String like the following:
<Tag1 attr="test"/>
<Tag2>
<Tag4 attr="test"/>
<Tag5 attr="test"/>
</Tag2>
<Tag3 attr="test"/>
Whats the regex to match 'test' between the <Tag2> and </Tag2> tags?
Output should match 'test' in both Tag4 and Tag5...
Any help would be highly appreciated..
Why are you using a regex for this? I am not familiar with the Java libraries, but I would imagine there is a library that would allow you to do XQueries using XPaths. That would be the simpler approach.
Here is a website that shows examples
Here is a SO question on XPath in Java
XPath is really more appropriate for this. This looks like duplicate post. Original
Perl has a couple of good xpath parsers on CPAN. But here's a good page on multiline regex parsing if you absolutely must use it.
All said before is totally true - however if you still want to practice some regex heres an alternative:
Doing it in one match is not possible since one of the inner groups will always be discarded (see this) , so you'll have to extract the inner passage first.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTagParse {
static String html = "<Tag1 attr=\"test\"/><Tag2> <Tag4 attr=\"test_one\"/> <Tag5 attr=\"test_two\"/></Tag2><Tag3 attr=\"test\"/>";
public static void main(String[] args) {
Matcher mat1 = Pattern.compile("Tag2>(.*)</Tag2").matcher(html);
mat1.find();
Matcher mat2 = Pattern.compile("<[^<>]*attr=\"([^\"]+)\"[^<>]>").matcher(mat1.group(1));
while(mat2.find()){
System.out.println(mat2.group(1));
}
}
}
anyways, you'd be much better off using XPath :)
I'm not in practice with java, but I can offer some guidance to the regular expression, I hope. If you know what the specific attribute and value is that you're looking for, you can use something like the following:
Pattern pattern = Pattern.compile("<tag[45].*attr\s*=\s*[\"']test['\"][^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("<Tag1 attr='test'/><Tag2><Tag4 attr='test'/><Tag5 attr='test'/></Tag2><Tag3 attr='test'/>");
matcher.matches();
the regex is made up of the following components:
match the literal string:
followed by either a 4 or a 5 (the [45] designation)
followed by any number of characters preceding the literal string: attr
followed by any number of spaces
followed by the literal character: =
followed by any number of spaces
followed by either the ' or " character
followed by the string literal: test
followed by either the ' or " character
followed by any character that is not >
followed by >
the point in adding some of these extra bits is simply to highlight that you may need/want to consider accounting for different coding styles, etc. note: I took the easy away out by setting the pattern as case-insensitive, but you can omit that and change your expression to check for the appropriate case (for example, if your attribute value is case-sensitive, you can change the 'tag' literal to be [tT][aA][gG] in order to allow matching the tag to be case-insensitive.
I'm apparently too slow to type, since jvataman has already answered your question, but perhaps there is some value in my writeup, so I'll post anyway.

How to exclude occurrence of a substring from a string using regex?

I have a string input in the following two forms.
1.
<!--XYZdfdjf., 15456, hdfv.4002-->
<!DOCTYPE
2.
<!--XYZdfdjf., 15456, hdfv.4002
<!DOCTYPE
I want to return a match if the form 2 is encountered and no match for the form 1.
Thus basically I want a regex that accepts arbitrarily all characters between <!-- and <!DOCTYPE, except when there is an occurance of --> in between.
I am using Pattern , Matcher and java regex.
Help is sought in terms of a regex specifically usable with Pattern.compile()
Thanks in advance.
Pattern p = Pattern.compile("(?s)<!--(?:(?!-->).)*<!DOCTYPE");
(?:(?!-->).)* matches one character at a time, after checking that it's not the first character of -->.
(?s) sets DOTALL mode (a.k.a. single-line mode), allowing the . to match newline characters.
If there's a possibility of two or more matches and you want to find them individually, you can replace the * with a non-greedy *?, like so:
"(?s)<!--(?:(?!-->).)*?<!DOCTYPE"
For example, applying that regex to the text of your question will find two matches, while the original regex will find one, longer match.
This seems like it is easily solved by using String.contains():
if (yourHtml.contains("-->")) {
// exclude
} else {
// extract the content you need
String content =
yourHtml.substring("<!--".length(), yourHtml.indexOf("<!DOCTYPE"));
}
I think you are looking too far into it.
\<!--([\s\S](?!--\>))*?(?=\<\!DOCTYPE)
this uses a negative lookahead to prevent the --> and a positive lookahead to find the <!DOCTYPE
Here's a good reference for atomic assertions (lookahead and behind).
I don't have a testing system handy so i can't give you the regex but you should look inside the Pattern documentation for something called negative lookahead assertion. This allows you to express rules of the form: Match this if not followed by that.
It should help you :)
A regular expression might not be the best answer to your problem. Have you tried splitting the first line away from everything else and seeing if it contains the -->?
Specifically, something like:
String htmlString;
String firstLine = htmlString.split("\r?\n")[0];
if(firstLine.contains("-->"))
;//no match
//match

Categories

Resources