Java regex pattern matcher - java

I have a string of the following format:
String name = "A|DescA+B|DescB+C|DescC+...X|DescX+"
So the repeating pattern is ?|?+, and I don't know how many there will be. The part I want to extract is the part before |...so for my example I want to extract a list (an ArrayList for example) that will contain:
[A, B, C, ... X]
I have tried the following pattern:
(.+)\\|.*\\+
but that doesn't work the way I want it to? Any suggestions?

To convert this into a list you can do like this:
String name = "A|DescA+B|DescB+C|DescC+X|DescX+";
Matcher m = Pattern.compile("([^|]+)\\|.*?\\+").matcher(name);
List<String> matches = new ArrayList<String>();
while (m.find()) {
matches.add(m.group(1));
}
This gives you the list:
[A, B, C, X]
Note the ? in the middle, that prevents the second part of the regex to consume the entire string, since it makes the * lazy instead of greedy.

You are consuming any character (.) and that includes the | so, the parser goes on munching everything, and once it's done taking any char, it looks for |, but there's nothing left.
So, try to match any character but | like this:
"([^|]+)\\|.*\\+"
And if it fits, make sure your all-but-| is at the beginning of the string using ^ and that there's a + at the end of the string with $:
"^([^|]+)\\|.*\\+$"
UPDATE: Tim Pietzcker makes a good point: since you are already matching until you find a |, you could just as well match the rest of the string and be done with it:
"^([^|]+).*\\+$"
UPDATE2: By the way, if you want to simply get the first part of the string, you can simplify things with:
myString.split("\\|")[0]

Another idea: Find all characters between + (or start of string) and |:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("(?<=^|[+])[^|]+");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

I think the easiest solution would be to split by \\+, then for each part apply the (.+?)\\|.* pattern to extract the group you need.

Related

java regexp get more than need

I have following regexp
http://[a-z./].*(js)
and the string
efwefewfhttp://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js
Code:
List<String> kk = new ArrayList<String>();
while (urlMatcher.find()){
kk.add(urlMatcher.group());
}
This regexp output is
http://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js
but should be 2 strings in result
How change regexp to get two string as result?
Use the following regex with lazy dot matching pattern:
http://[a-z./].*?js
^
See the regex demo
With this, you will match http://assets.main.com/zepto-1.1.3.min.js and http://assets.main.com/zepto-1.1.3.min.js.
The thing is that .* matches the whole line and then backtracks, checking if it can accommodate for the right-hand pattern. Thus it matches the longest possible substring (from the left-most up to the right-most). Lazy matching will match from the left-most to the first occurrence of the next subpattern yielding 2 matches.
See Watch Out for The Greediness! section.
Also, since these are links, and there should be no spaces, you can use \S (non-whitespace) shorthand char class:
http://[a-z./]\S*\.js
Also, the literal dot can be matched with \.. See another demo.
Lazy/greedy dot matching should be avoided as often as possible due to heavy backtracking they might involve!
Sample code:
String str = "efwefewfhttp://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js";
Pattern ptrn = Pattern.compile("http://[a-z./]\\S*\\.js");
Matcher urlMatcher = ptrn.matcher(str);
List<String> kk = new ArrayList<String>();
while (urlMatcher.find()){
kk.add(urlMatcher.group());
}
System.out.println(kk);
// [http://assets.main.com/zepto-1.1.3.min.js, http://assets.main.com/zepto-1.1.3.min.js]

Regex matching up to a character if it occurs

I need to match string as below:
match everything upto ;
If - occurs, match only upto - excluding -
For e.g. :
abc; should return abc
abc-xyz; should return abc
Pattern.compile("^(?<string>.*?);$");
Using above i can achieve half. but dont know how to change this pattern to achieve the second requirement. How do i change .*? so that it stops at forst occurance of -
I am not good with regex. Any help would be great.
EDIT
I need to capture it as group. i cant change it since there many other patterns to match and capture. Its only part of it that i have posted.
Code looks something like below.
public static final Pattern findString = Pattern.compile("^(?<string>.*?);$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}
Just use a negated char class.
^[^-;]*
ie.
Pattern p = Pattern.compile("^[^-;]*");
Matcher m = p.matcher(str);
while(m.find()) {
System.out.println(m.group());
}
This would match any character at the start but not of - or ;, zero or more times.
This should do what you are looking for:
[^-;]*
It matches characters that are not - or ;.
Tipp: If you don't feel sure with regular expressions there are great online solutions to test your input, e.g. https://regex101.com/
UPDATE
I see you have an issue in the code since you try to access .group in the Pattern object, while you need to use the .group method of the Matcher object:
public static String GetTheGroup(String str) {
Pattern findString = Pattern.compile("(?s)^(?<string>.*?)[;-]");
Matcher matcher = findString.matcher(str);
if (matcher.find())
{
return matcher.group("string"); //you have to change something here.
}
else
return "";
}
And call it as
System.out.println(GetTheGroup("abc-xyz;"));
See IDEONE demo
OLD ANSWER
Your ^(?<string>.*?);$ regex only matches 0 or more characters other than a newline from the beginning up to the first ; that is the last character in the string. I guess it is not what you expect.
You should learn more about using character classes in regex, as you can match 1 symbol from a specified character set that is defined with [...].
You can achieve this with a String.split taking the first element only and a [;-] regex that matches a ; or - literally:
String res = "abc-xyz;".split("[;-]")[0];
System.out.println(res);
Or with replaceAll with (?s)[;-].*$ regex (that matches the first ; or - and then anything up to the end of string:
res = "abc-xyz;".replaceAll("(?s)[;-].*$", "");
System.out.println(res);
See IDEONE demo
I have found the solution without removing groupings.
(?<string>.*?) matches everything upto next grouping pattern
(?:-.*?)? followed by a non grouping pattern starts with - and comes zero or once.
; end character.
So putting all together:
public static final Pattern findString = Pattern.compile("^(?<string>.*?)(?:-.*?)?;$");
if(findString.find())
{
return findString.group("string"); //cant change anything here.
}

Split to ArrayList using pattern matcher in Java

String s = "A..?-B^&';(,,,)G56.6C,,,M4788C..,,A1''";
String[] result = s.split("(?=[ABC])");
System.out.println(Arrays.toString(result));
Output:
[A..?-, B^&';(,,,)G56.6, C,,,M4788, C..,,, A1'']
Please refer to the The split in the above case. I am trying to separate strings based on A, B orC. How can I get the the same split strings into an ArrayList using pattern matcher? I could not figure out how to group in the below code.
Pattern p = Pattern.compile("(?=[ABC])");
Matcher m = p.matcher(s);
List<String> matches = new ArrayList<>();
while (m.find()) {
matches.add(m.group());
}
Also suppose I have few characters before first occurance of A, B or C and I want to combine with first element in ArrayList. ,,A..
Appreciate the help.
[ABC][^ABC]*
If I didn't ommit any edge case that should work with the code you provided
For the extra question, you could possibly add (^[^ABC]*)* to the beggining, but that makes it slower and look less readable, not to mention it will only work for single-line strings to check. I would recommend just parsing the beggining characters manually, treating it like a special case it is.

Regular expression matching "dictionary words"

I'm a Java user but I'm new to regular expressions.
I just want to have a tiny expression that, given a word (we assume that the string is only one word), answers with a boolean, telling if the word is valid or not.
An example... I want to catch all words that is plausible to be in a dictionary... So, i just want words with chars from a-z A-Z, an hyphen (for example: man-in-the-middle) and an apostrophe (like I'll or Tiffany's).
Valid words:
"food"
"RocKet"
"man-in-the-middle"
"kahsdkjhsakdhakjsd"
"JESUS", etc.
Non-valid words:
"gipsy76"
"www.google.com"
"me#gmail.com"
"745474"
"+-x/", etc.
I use this code, but it won't gave the correct answer:
Pattern p = Pattern.compile("[A-Za-z&-&']");
Matcher m = p.matcher(s);
System.out.println(m.matches());
What's wrong with my regex?
Add a + after the expression to say "one or more of those characters":
Escape the hyphen with \ (or put it last).
Remove those & characters:
Here's the code:
Pattern p = Pattern.compile("[A-Za-z'-]+");
Matcher m = p.matcher(s);
System.out.println(m.matches());
Complete test:
String[] ok = {"food","RocKet","man-in-the-middle","kahsdkjhsakdhakjsd","JESUS"};
String[] notOk = {"gipsy76", "www.google.com", "me#gmail.com", "745474","+-x/" };
Pattern p = Pattern.compile("[A-Za-z'-]+");
for (String shouldMatch : ok)
if (!p.matcher(shouldMatch).matches())
System.out.println("Error on: " + shouldMatch);
for (String shouldNotMatch : notOk)
if (p.matcher(shouldNotMatch).matches())
System.out.println("Error on: " + shouldNotMatch);
(Produces no output.)
This should work:
"[A-Za-z'-]+"
But "-word" and "word-" are not valid. So you can uses this pattern:
WORD_EXP = "^[A-Za-z]+(-[A-Za-z]+)*$"
Regex - /^([a-zA-Z]*('|-)?[a-zA-Z]+)*/
You can use above regex if you don't want successive "'" or "-".
It will give you accurate matching your text.
It accepts
man-in-the-middle
asd'asdasd'asd
It rejects following string
man--in--midle
asdasd''asd
Hi Aloob please check with this, Bit lengthy, might be having shorter version of this, Still...
[A-z]*||[[A-z]*[-]*]*||[[A-z]*[-]*[']*]*

Escape comma when using String.split

I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:
String [] parts = input.split(",");
And works great for input like:
a,b,c
Or
type=simple, output=Hello, repeat=true
Just to say something.
How can I escape the comma, so it doesn't match intermediate commas?
For instance, if I want to include a comma in one of the parts:
type=simple, output=Hello, world, repeate=true
I was thinking in something like:
type=simple, output=Hello\, world, repeate=true
But I don't know how to create the split to avoid matching the comma.
I've tried:
String [] parts = input.split("[^\,],");
But, well, is not working.
You can solve it using a negative look behind.
String[] parts = str.split("(?<!\\\\), ");
Basically it says, split on each ", " that is not preceeded by a backslash.
String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
System.out.println(s);
Output:
type=simple
output=Hello\, world
repeate=true
(ideone.com link)
If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:
String[] parts = str.split(", (?=\\w+=)");
Which says split on each ", " which is followed by some word-characters and an =
(ideone.com link)
I'm afraid, there's no perfect solution for String.split. Using a matcher for the three parts would work. In case the number of parts is not constant, I'd recommend a loop with matcher.find. Something like this maybe
final String s = "type=simple, output=Hello, world, repeat=true";
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,|$)");
final Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group(1));
You'll probably want to skip the spaces after the comma as well:
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,\\s*|$)");
It's not really complicated, just note that you need four backslashes in order to match one.
Escaping works with the opposite of aioobe's answer (updated: aioobe now uses the same construct but I didn't know that when I wrote this), negative lookbehind
final String s = "type=simple, output=Hello\\, world, repeate=true";
final String[] tokens = s.split("(?<!\\\\),\\s*");
for(final String item : tokens){
System.out.println("'" + item.replace("\\,", ",") + "'");
}
Output:
'type=simple'
'output=Hello, world'
'repeate=true'
Reference:
Pattern: Special Constructs
I think
input.split("[^\\\\],");
should work. It will split at all commas that are not preceeded with a backslash.
BTW if you are working with Eclipse, I can recommend the QuickRex Plugin to test and debug Regexes.

Categories

Resources