I have address information and some junk in my DB and I have to just check if the string has zip code I need to process that. Can you explain how to check if a string has a 5 digits present. For example,
String addr = 10100 Trinity Parkway, 5th Floor Stockton, CA 95219;
So it has to match this string as it is having 5 digit zip code. Any way to check using Java Regular Expression?
Update:
String addr = "10100 Trinity Parkway, 5th Floor Stockton, CA 95219";
String addressMatcher = "\\d{5}";
if(addr.matches(addressMatcher)){
System.out.println(addr);
}
Above is the code I am using after getting the answers but none of the regex matches and prints the addr. Am I doing anything wrong?
Regards,
Karthik
The simple expression is ".*\\d{5}$", which says that you want any character 0 or more times, then any digit 5 times, and then the end of the string. Note that this accounts for needing to escape the slash in the Java string.
If you may have more characters at the end of the string, then you can append .* to the expression to match those. However, that may end up matching number in addresses as well, so make sure your data is in a consistent, expected format.
Regular expressions may not be sufficient in this case, since not all 5-digit numbers are actually zip codes. As well, some zip codes may include additional numbers (usually following a - after the first five digits.)
I am not quite sure if this is what you are looking for:
(\w\s*)+,(\s*(\w\s*),)\s*[A-Z]{2}\s*[1-9]{5}
This expression will match:
1 - any words with any spaces followed by comma,
2 - then optional repetitions of words separated by spaces followed by comma
3 - at the end , it is ended with two characters words (the state code), followed by zip code which is 5 digits (non-zero digits)
I hope this help
[Edit]
This expression will filter out numbers in the address that are not in the correct place to be zip code, like for example ( 10100) in your example
Related
Can anyone tell how I can write regex for a string that take one or more alphanumeric character followed by an even number of digits?
Valid:
a11a1121
bbbb11a1121
Invalid:
a11a1
I have tried ^[a-zA-Z*20-9]*$ but it is always giving true.
Can you please help in this regard?
The regex that you have mentioned will search for any number of [either a-z, or A-Z or 2 or 0-9]
You can break down your requirement to groups and then handle it accordingly.
Like you require at least one character. so you start with ^([a-zA-Z]+)$
Then you need numbers in the multiple of 2. so you add ^([a-zA-Z]+(\d\d)+)$
Now you need any number of combination of these. So the exp becomes: ^([a-zA-Z]+(\d\d)+)*$
You can use online tools like regex101 for these purpose. The provided regex in action here
You can achieve it with this regexp: ^[a-z0-9]*[a-z]+([0-9]{2})*$
Explanation :
[a-z0-9]*[a-z]+: a string of at least one character terminated by a non digit one
([0-9]{2})*: an odd sequence of digits (0 or 2*n digits). If the even sequence cannot be null, use ([0-9]{2})+ instead.
I want to replace numbers in a string if it is more than 3 digits (Phone numbers should be replaced) and it should not replace the number if it is followed by $ and if the number has decimal points. I used the below expression.
"\d{3,}+(?!\$/\.)"
Issues I face are , it is replacing numbers that are more than ten digits as i want to replace some numbers which are some ID's with more than 10 digits. Also if a number has more than 3 digits after the decimal , those numbers are also getting replaced. I dont want a number to be replaced if it has decimal points. can some body help?
For Eg, say a number string "3452678916381914". Actually it has to be replaced. But the above regex not replacing that. For numbers like $1234,45.567 - those numbers shouldn't be replaced. But above regex replacing 45.567
use lookahead and lookbehind regex, 1st assert start word boundary is not precede by a $ or ., then assert end word boundary is not follow by a $ or .
It works for both example you provided, you might need to tweak a little bit to handle some corner case
(?<![\$\.])\b\d{3,}\b(?![\$\.])
see demo, it match the first 2 but not the rest
3452678916381914 # match
1234 56789 # match
$1234,45.567
$1234
12.345
12345.6678
123$
I'm writing a Java code using regex to parse a content-page extracted from a PDF document.
In a string the regex must match: a digit (up to three) followed by a space (or many) followed by a word (or many [word: any sequence of characters]). And vise versa: (word(s) space(s) digit(s)), they all must be in the string. Also considering leading spaces and be case insensitive.
The extracted content-page could look something like this:
Directors’ responsibilities 8
Corporate governance 9
Remuneration report 10
the numbering-style is not consistent and number of spaces between digit and string do vary, so it could also look like:
01 Contents
02 Strategy and highlights
04 Chairman’s statement
The regex i'm using matches any number of words followed by any number of spaces and then a number of no more than 3 digits:
(?i)([a-z\\s])*[0-9]{1,3}(?i)
It works but not quite well, can't tell what I'm doing wrong? and I wish there is a way to detect both numbering-style (having the page numbers to the left or right of the string) instead of repeating the regex and flip the order.
Cheers
If you want to match phrases you should include any punctuation you want to match in your regex. AFAIK there is no way in regex to say if a phrase is "before or after", so you should flip one and append it with a |. Something along the lines of:
[a-zA-Z'".,!\s]+\d{1,3}|\d{1,3}[a-zA-Z'".,!\s]+
Also, you don't need two instances of (?i), as the regex will apply the case insensitivity until the end of the string or if it encounters a (?-i).
You can use this pattern with multiline mode, if there is always a number before or after each items:
"^(?:(?<nb1>\\d{1,3}) +)?(?<item>\\S+(?: +\\S+)*?)(?: +(?<nb2>\\d{1,3})|$)"
Then you can use m.group('nb1')+m.group('nb2') to always obtain the number for each whole match.
But if you must check there is at least a number, you must repeat the whole pattern:
"^(?:(?<nb1>\\d{1,3}) +(?<item1>\\S+(?: +\\S+)*)|(?<item2>\\S+(?: +\\S+)*) +(?<nb2>\\d{1,3})$"
Then:
item = m.group('item1')+m.group('item2');
nb = m.group('nb1')+m.group('nb2');
Notice: since the patterns are anchored at the begining and at the end, it is possible that you have to add some optional spaces to do them work: ^\\s* and \\s*$
Java is hanging with 100% CPU usage when I use the below string as input for a regular expression.
RegEx Used:
Here is the regular expression used for the description field in my application.
^([A-Za-z0-9\\-\\_\\.\\&\\,]+[\\s]*)+
String used for testing:
SaaS Service VLAN from Provider_One
2nd attempt with Didier SPT because the first one he gave me was wrong :-(
It works properly when I split the same string in different combinations. Like "SaaS Service VLAN from Provider_One", "first one he gave me was wrong :-(", etc. Java is hanging only for the above given string.
I also tried optimizing the regex as below.
^([\\w\\-\\.\\&\\,]+[\\s]*)+
Even with this is not working.
Another classic case of catastrophic backtracking.
You have nested quantifiers that cause a gigantic number of permutations to be checked when the regex arrives at the : in your input string which is not part of your character class (assuming you're using the .matches() method).
Let's simplify the problem to this regex:
^([^:]+)+$
and this string:
1234:
The regex engine needs to check
1234 # no repetition of the capturing group
123 4 # first repetition of the group: 123; second repetition: 4
12 34 # etc.
12 3 4
1 234
1 23 4
1 2 34
1 2 3 4
...and that's just for four characters. On your sample string, RegexBuddy aborts after 1 million attempts. Java will happily keep on chugging... before finally admitting that none of these combinations allows the following : to match.
How can you solve this?
You can forbid the regex from backtracking by using possessive quantifiers:
^([A-Za-z0-9_.&,-]++\\s*+)+
will allow the regex to fail faster. Incidentally, I removed all those unnecessary backslashes.
Edit:
A few measurements:
On the string "was wrong :-)", it takes RegexBuddy 862 steps to figure out a non-match.
For "me was wrong :-)", it's 1,742 steps.
For "gave me was wrong :-)", 14,014 steps.
For "he gave me was wrong :-)", 28,046 steps.
For "one he gave me was wrong :-)", 112,222 steps.
For "first one he gave me was wrong :-)", >1,000,000 steps.
First, you need to realize that your regexes CANNOT match the supplied input string. The strings contain a number of characters ('<' '>' '/' ':' and ')') that are not "word" characters.
So why is it taking so long?
Basically "catastrophic backtracking". More specifically, the repeating structures of your regex give an exponential number of alternatives for the regex backtracking algorithm to try!
Here's what your regex says:
One or more word characters
Followed by zero or more space characters
Repeat the previous 2 patterns as many times as you like.
The problem is with the "zero or more space characters" part. The first time, the matcher will match everything up to the first unexpected character (i.e. the '<'). Then it will back off a bit and try again with a different alternative ... that involves "zero spaces" before the last letter, then when that fails, it will move the "zero spaces" back one position.
The problem is that for String with N non-space characters, there as N different places that "zero spaces" can be matched, and that makes 2^N different combinations. That rapidly turns into a HUGE number as N grows, and the end result is hard to distinguish from an infinite loop.
Why are you matching whitespace separately from the other characters? And why are you anchoring the match at the beginning, but not at the end? If you want to make sure the string doesn't start or end with whitespace, you should do something like this:
^[A-Za-z0-9_.&,-]+(?:\s+[A-Za-z0-9_.&,-]+)*$
Now there's only one "path" the regex engine can take through the string. If it runs out of characters that match [A-Za-z0-9_.&,-] before reaching the end, and the next character doesn't match \s, it fails immediately. If it reaches the end while still matching whitespace characters, it fails because it's required to match at least one non-whitespace character after each run of whitespace.
If you want to make sure there's exactly one whitespace character separating the runs of non-whitespace, just remove the quantifier from \s+:
^[A-Za-z0-9_.&,-]+(?:\s[A-Za-z0-9_.&,-]+)*$
If you don't care where the whitespace is in relation to the non-whitespace, just match them all with the same character class:
^[A-Za-z0-9_.&,\s-]+$
I'm assuming you know that your regex won't match the given input because of the : and ( in the smiley, and you just want to know why it takes so long to fail.
And of course, since you're creating the regex in the form of a Java string literal, you would write:
"^[A-Za-z0-9_.&,-]+(?:\\s+[A-Za-z0-9_.&,-]+)*$"
or
"^[A-Za-z0-9_.&,-]+(?:\\s[A-Za-z0-9_.&,-]+)*$"
or
"^[A-Za-z0-9_.&,\\s-]+$"
(I know you had double backslashes in the original question, but that was probably just to get them to display properly, since you weren't using SO's excellent code formatting feature.)
I need to check that a file contains some amounts that match a specific format:
between 1 and 15 characters (numbers or ",")
may contains at most one "," separator for decimals
must at least have one number before the separator
this amount is supposed to be in the middle of a string, bounded by alphabetical characters (but we have to exclude the malformed files).
I currently have this:
\d{1,15}(,\d{1,14})?
But it does not match with the requirement as I might catch up to 30 characters here.
Unfortunately, for some reasons that are too long to explain here, I cannot simply pick a substring or use any other java call. The match has to be in a single, java-compatible, regular expression.
^(?=.{1,15}$)\d+(,\d+)?$
^ start of the string
(?=.{1,15}$) positive lookahead to make sure that the total length of string is between 1 and 15
\d+ one or more digit(s)
(,\d+)? optionally followed by a comma and more digits
$ end of the string (not really required as we already checked for it in the lookahead).
You might have to escape backslashes for Java: ^(?=.{1,15}$)\\d+(,\\d+)?$
update: If you're looking for this in the middle of another string, use word boundaries \b instead of string boundaries (^ and $).
\b(?=[\d,]{1,15}\b)\d+(,\d+)?\b
For java:
"\\b(?=[\\d,]{1,15}\\b)\\d+(,\\d+)?\\b"
More readable version:
"\\b(?=[0-9,]{1,15}\\b)[0-9]+(,[0-9]+)?\\b"