How to strip a string with a regular expression

How to strip a string with a regular expression - java

I want to find all the negative floating point numbers in a string and store them in an array. I think my regex is correct, but something is wrong with my method.
Pattern pattern = Pattern.compile("[-]?[0-9]*[.][0-9]+$");
String[] results = pattern.split("|AAA--A A05_#A| |-999.999| |-55.7|");

Your regex anchors the match to the end of the string, which isn't what you want.
Likewise, Pattern.split doesn't do what you want. Here's some sample code to get you going:
Pattern pattern = Pattern.compile("[-]?[0-9]*[.][0-9]+");
String text = "|AAA--A A05_#A| |-999.999| |-55.7|";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
This prints:
-999.999
-55.7
Obviously you could add to a list or something similar within the while loop. I don't know of any method which returns a collection of all the matches, but you could easily write a utility method to do that yourself, based on code like the above.
EDIT: As noted in comments, if you only want to find negative values, the - shouldn't be optional (and it doesn't need to be in a set, either - just -? would have been fine).

Related

Regex For All String Except Certain Characters

I am trying to write a regular expression that matches a certain word that is not preceded by 2 dashes (--) or a slash and a star (/*). I tried some expression but none seem to work.
Below is the text I am testing on
a_func(some_param);
/* a comment initialization */
init;
I am trying to write a regex that will only match the word init in the last line alone, what I've tried so far is matching the word init in initialization and the last line, I tried to look for existing answers, and found that used negative lookahead, but it was still matching init in initialization. Below are the expressions I tried:
(?!\/\*).*(init)
[^(\-\-|\/\*)].*(init)
(?<!\/\*).*(init) While reading in regex101's quick reference, I found this negative lookbehind which I believe had a similar example to what I need but I was still not able to get what I want, should I look into the negative lookbehind more or is this not how I achieve what I want?
My knowledge in regular expression is not that extensive, so I don't know if it is possible for what I want or not, but is it doable?

Assuming the -- or /* are on the same line as the init, there are some options. As the commenters said, multiline comments will likely require stronger techniques.
The simplest way I know is to actually preprocess the strings to remove the --.*$ and /\*.*$, then look for init (or init\b if you don't want to match initialization):
String input = "if init then foo";
String checkme = input.replaceAll("--.*$", "").replaceAll("/\\*.*$", "");
Pattern pattern = Pattern.compile("init"); // or "init\\b"
Matcher matcher = pattern.matcher(checkme);
System.out.println(matcher.find());
You can also use negative lookbehind as in #olsli's answer.

You can start with:
String input = "/*init";
Pattern pattern = Pattern.compile("^((?!--|\\/\\*).)*init");
Matcher matcher = pattern.matcher(input);
System.out.println(matcher.find());

I have added more braces to separate things out. This should also work, tested it in Regexr and IDEONE
Pattern p = Pattern.compile("^(?!=((\\/\\*)|(--)))([.]*?)init[.]*$", Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
String s = "/* Initialisation";
Matcher m = p.matcher(s);
m.find(); /* should return you >-1 if there's a match

How can I correct my regular expression in Java?

I met a problem when I try to extract a segment from a string using Java. The original string is looks like test/data/20/0000893220-97-000850.txt, and I want to extract the segment which is behind the third /.
My regular expression is like
String m_str = "test/data/20/0000893220-97-000850.txt";
Pattern reg = Pattern.compile("[.*?].txt");
Matcher matcher = reg.matcher(m_str);
System.out.println(matcher.group(0));
The expected result is 0000893220-97-000850, but obviously, I failed. How can I correct this?

[^\/]+$
https://regex101.com/r/tS4nS2/2
This will extract the last segment in a string that contains after slashes. It would work great if you want that, as opposed to only the third section.
To find and extract the match, you don't need a match group (hence, no ()), however, you need to instruct the matcher to only look for the pattern, since .matches() will attempt to compare the entire string. Here is the relevant bit and here is a full example:
matcher.find(); //finds any occurrence of the pattern in the string
System.out.println(matcher.group()); //returns the entire occurence
Note the lack of index inside the call .group().
On a separate note, in Java, you don't necessarily need regex - extracting the last part can be done using plain Java
String matched = m_str.split('/')[2];
This would capture the third segment while
String[] matches = m_str.split('/');
String matched = matches[matches.length-1];
Would give you the last part.

Can I use regex to match every third occurrence of a specific character?

I have a string containing some delimited values:
1.95;1.99;1.78;10.9;11.45;10.5;25.95;26;45;21.2
What I'd like to achieve is a split by every third occurence of a semicolon, so my resulting String[] should contain this:
result[0] = "1.95;1.99;1.78";
result[1] = "10.9;11.45;10.5";
result[2] = "25.95;26;45";
result[3] = "21.2";
So far I've tried several regex solutions, but all I could get to was finding any patterns that are between the semi colons. For example:
(?<=^|;)[^;]*;?[^;]*;?[^;]*
Which matches the values I want, so that makes it impossible to use split() or am I missing something?
Unfortunately I can only supply the pattern used and have no possibility to add some looping through results of the above pattern.

String re = "(?<=\\G[^;]*;[^;]*;[^;]*);";
String text = "1.95;1.99;1.78;10.9;11.45;10.5;25.95;26;45;21.2";
String[] result = Pattern.compile(re).split(text);
Now the result is what you want
Hint: \G in java's regex is a boundary matcher like ^, it means 'end of previous match'

You can try something like this instead:
String s = "1.95;1.99;1.78;10.9;11.45;10.5;25.95;26;45;21.2";
Pattern p = Pattern.compile(".*?;.*?;.*?;");
Matcher m = p.matcher(s);
int lastEnd = -1;
while(m.find()){
System.out.println(m.group());
lastEnd = m.end();
}
System.out.println(s.substring(lastEnd));

You are correct. Since Java doesn't support indefinite-length lookbehind assertions (which you need if you want to check whether there are 3, 6, 9 or 3*n values before the current semicolon), you can't use split() for this. Your regex works perfectly with a "find all" approach, but if you can't apply that in your situation, you're out of luck.
In other languages (.NET-based ones, for example), the following regex would work:
;(?<=^(?:[^;]*;[^;]*;[^;]*;)*)

Would something like:
([0-9.]*;){3}
not work for your needs? The caveat is that there will be a trailing ; at the end of the group. You might be able to tweak the expression to trim that off however.
I just reread your question, and although this simple expression will work for matching groups, if you need to supply it to the split() method, it will unfortunately not do the job.

Using backreference to refer to a pattern rather than actual match

I am trying to write a regex which would match a (not necessarily repeating) sequence of text blocks, e.g.:
foo,bar,foo,bar
My initial thought was to use backreferences, something like
(foo|bar)(,\1)*
But it turns out that this regex only matches foo,foo or bar,bar but not foo,bar or bar,foo (and so on).
Is there any other way to refer to a part of a pattern?
In the real world, foo and bar are 50+ character long regexes and I simply want to avoid copy pasting them to define a sequence.

With a decent regex flavor you could use (foo|bar)(?:,(?-1))* or the like.
But Java does not support subpattern calls.
So you end up having a choice of doing String replace/format like in ajx's answer, or you could condition the comma if you know when it should be present and when not. For example:
(?:(?:foo|bar)(?:,(?!$|\s)|))+

Perhaps you could build your regex bit by bit in Java, as in:
String subRegex = "foo|bar";
String fullRegex = String.format("(%1$s)(,(%1$s))*", subRegex);
The second line could be factored out into a function. The function would take a subexpression and return a full regex that would match a comma-separated list of subexpressions.

The point of the back reference is to match the actual text that matches, not the pattern, so I'm not sure you could use that.
Can you use quantifiers like:
String s= "foo,bar,foo,bar";
String externalPattern = "(foo|bar)"; // comes from somewhere else
Pattern p = Pattern.compile(externalPattern+","+externalPattern+"*");
Matcher m = p.matcher(s);
boolean b = m.find();
which would match 2 or more instances of foo or bar (followed by commas)

How to make regex matching fail if checked string still has leftover characters?

I'm trying to check a string with a regular expression, and this check should only pass if the string contains only *h, *d, *w and/or *m where * can be any number.
So far I've got this:
Pattern p = Pattern.compile("([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)");
Matcher m = p.matcher(strToCheck);
if(m.find()){
//matching succesful code
}
And it works to detect if there are any of the number-letter combinations present in the checked string, but it also works if the input is, for instance, "12x5d", because it has "5d" in it. I don't know if this is a code problem or a regex problem. Is there a way to achieve what I want?
EDIT:
Thank you for your answers so far, but as requested, I'll try to clarify a bit. A string like "1w 2d 3h" or "1w 1w" is valid and should pass, but something like "1w X 2d 3h", "1wX 2d" or "w d h" should fail.

use m.matches() or add ^ and $ to the beginning and end of the regex resp.
edit but if you wan sequences of these delimited by whitespace (as mentioned in the comments) you can use
Pattern.compile("\\b\\d[hdwm]\\b");
Matcher m = p.matcher(strToCheck);
while(m.find()){
//matching succesful code
}

Firstly, I think you should use matches() instead of find(). The former matches the entire string against the regex, whereas the latter searches within the string.
Secondly, you can simplify the regex like so: "[0-9][hdwm]".
Finally, if the number can contain multiple digits, use the + operator: "[0-9]+[hdwm]"

try this:
Pattern p = Pattern.compile("[0-9][hdwm]");
Matcher m = p.matcher(strToCheck);
if(m.matches()){
//matching succesful code
}

If you want to only accept things like 5d as a complete word, rather than just part of one, you can use the \b "word border" markers in regex:
Pattern p = Pattern.compile("\\b([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)\\b");
This will let you match a string like "Dimension: 5h" while rejecting a string like "Dimension: 12wx5h".
(If, on the other hand, you only want to match if the entire string is just 5d or the like, then use matches() as others have suggested.)

You can write it like this "^\\d+[hdwm]$". Which should only match on the desired strings.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to strip a string with a regular expression - java

I want to find all the negative floating point numbers in a string and store them in an array. I think my regex is correct, but something is wrong with my method. Pattern pattern = Pattern.compile("[-]?[0-9]*[.][0-9]+$"); String[] results = pattern.split("|AAA--A A05_#A| |-999.999| |-55.7|");

Related

Regex For All String Except Certain Characters

How can I correct my regular expression in Java?

Can I use regex to match every third occurrence of a specific character?

Using backreference to refer to a pattern rather than actual match

How to make regex matching fail if checked string still has leftover characters?

Categories

Resources