matching empty row with regex and skipping - java

I am trying to implement regex match for empty string coming from csv file which has last column consisting of row number
eg: "","","","","","","","","",5
The regex pattern which i am using is as (\W*\d\W) though for now it is working but in the longer run i am not sure whether it will fulfill the requirement of checking empty row with last column as Digit.
Could some better pattern be suggested. I am still new to regex.

You do not need a regex to match an empty string. Java has a special method for that isEmpty. Just call it on a string that you want to check:
String str = ...;
if (str.isEmpty()) {
// do something if the string is empty
} else {
// do something if the string is not empty
}
Now the reason why your regex does not work is because you match:
\W* - zero or more non word character
\d - any digit (0-9)
\W - one more non word character
This will match something completely different. Your regex will match sequences like:
[[[[[9]
You can read more about semantics of these regex symbols mean here.
If you want to match an empty String with regex you can try the following regex:
^$
Which means:
^ - match beginning of a line
$ - match end of a line
This will only match a line that has nothing between beginning of it and end of it which is an empty line.

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"

Replace all non numeric characters by only one word

I want to do the next replacement:
WORD1234 -> W1234
So, I'm using the regex:
([^\d]*)([0-9]+)([^\d]*)
Replacement: W$2
If the word is WORD1234AAAAA, using the previous regex I have the same result: W1234, which is what I want.
But if the word is WO12RD34 the result I have is: W12W34
What I want basically in all the cases is to remove all non-numeric characters and add the letter W at the beginning.
Update:
The input string does not always start with a W. It can be for example ABC12DE34 and the desired result is: FA1234. Meaning, remove all non-numeric characters and add a word at the beginning.
Try this:
String regex = "(?<start>^W)|(\\D)";
String replacement = "${start}";
System.out.println("WO12RD34".replaceAll(regex, replacement)); //prints W1234
System.out.println("WORD1234AAAAA".replaceAll(regex, replacement)); //prints W1234
With this regex, the "start" capturing group will only be set when the first character is matched. Otherwise, it will be empty.
The idea is that, when the start of the string followed by W is matched, the named "start" pattern would be initialised to ^W. Then, just replace ^W with itself.
Otherwise, when any non-digit character is matched, then the start pattern will not be set (and be empty). Then, also replace the non-digit character with nothing.

regular expression retrieve a portion of a string that not contain a string

I have some strings like the following:
it.mycompany.db.beans.str1.PD_T_CLASS
it.mycompany.db.beans.join.PD_T_CLASS
it.mycompany.db.beans.str2.PD_T_CLASS_1
it.mycompany.db.beans.join.PD_T_CLASS_1
PD_T_CLASS myVar = new PD_T_CLASS();
myVar.setPD_T_CLASS(something);
and I want to select "PD_" part to substitute it with "" (the void string) but only inf the entire line does not contain the string ".join."
what I want to achieve is:
it.mycompany.db.beans.str1.T_CLASS
it.mycompany.db.beans.join.PD_T_CLASS
it.mycompany.db.beans.str2.T_CLASS_1
it.mycompany.db.beans.join.PD_T_CLASS_1
T_CLASS myVar = new T_CLASS();
myVar.setT_CLASS(something);
The substitution is not a problem since I'm using eclipse search tool and will hit replace as soon as it show me the right result.
I have tried:
^((?!\.join\.).)*(PD_)*$ // whole string selected
^((?!\.join\.).)*(\bPD_\b)*$ // whole string selected
I start getting frustrated since I've searched a bit around (the ^((?!join bla bla come from those searches)
Can you help me?
You may use the following regex:
(?m)(?:\G(?!\A)|^(?!.*\.join\.))(.*?)PD_
and replace with
$1
See the regex demo
Details:
(?m) - a Pattern.MULTILINE inline modifier flag that will force ^ to match the beginning of a line rather than a whole string
(?:\G(?!\A)|^(?!.*\.join\.)) - either of the two alternatives:
\G(?!\A) - the end of the previous successful match
| - or
^(?!.*\.join\.) - start of a line that has no .join. text in it (as the (?!.*\.join\.) is a negative lookahead that will fail the match if it matches any 0+ chars other than line break chars (.*) and then .join.)
(.*?) - Capturing group #1 (referred to with the $1 backreference in the replacement pattern): any 0+ chars other than line breaks, as few as possible, up to the first occurrence of ...
PD_ - a literal PD_
The replacement is a $1 backreference to the first capturing group that will restore any text matched before PD_s.

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

Check string contains whitespace along with some other char sequence using regex in java

am using regex expression to check if a string contains white space.
my regex is : ^\\s+$
for example if my string is my name then regex matches should return true.
but it is returning true only if my string contains only spaces no other character.
How to check if a string contains a whitespace or tab or carriage return characters in between/start/end of some string.
^(.*\s+.*)+$ seems to work for me. Accepts anything as long as there is at least one space in the string. This will match the entire string.
If you only want to check for the presence of a space, you can just use \s without any begin or end markers in the string. The difference is that this will only match the individual spaces.
Your regex is not correct.
That's a string representing a regular expression. (as tchrist pointed out correctly)
The corresponding pattern that you get when using Pattern.compile() matches only strings containing one or more whitespace characters, starting from the beginning until the end. Thus, the matching string only consists of whitespace characters.
Try this string instead for Pattern.compile():
"\\s+"
The difference is that without the anchors "^" and "$" there may be other characters around the whitespace character. The whitespace character(s) may be everywhere in the string.
Using this pattern-string the whitespace character(s) must be at the beginning:
"^\\s+"
And here the sequence of whitespace characters has to be at the end:
"\\s+$"
Use org.apache.commons.lang.StringUtils.containsAny(). See http://commons.apache.org/lang/api-3.1/org/apache/commons/lang3/StringUtils.html.

Categories

Resources