What would be the regex for this pattern? - java

My Java program, in certain point, receives a string containing a couple of key-value properties like this example:
param1=value Param2=values can have spaces PARAM3=values cant have equal characters
The parameters' name/key are composed by a single word (a-z, A-Z, _ and 0-9) and are followed by an = character (not separated by spaces) and it's value. The value is a text that can contain spaces and last until the end of the string or the begin of another parameter. (which is a word followed by equals and it's value, etc.)
I need to extract a Properties object (string-to-string map) from this string. I was trying to use regex to find each key-value set. The code is like this:
public static String createProperties(String str) {
Properties prop = new Properties();
Matcher matcher = Pattern.compile(some regex).match(str);
while (matcher.find()) {
String match = matcher.group();
String param = ...; // What comes before '='
String value = ...; // What comes after '='
prop.setProperty(param, value);
}
return prop;
}
But the regex wrote is not working correctly.
String regex = "(\\w+=.*)+";
Since .* tells the regex to get "anything" it found, it will match the entire string. I want to tell the regex to search until it finds another \\w=.*. (word followed by equals and something after)
How could I write this regex? Or what would be another solution for the problem using regex?

You can use a Negative Lookahead here.
(\\w+)=((?:(?!\\s*\\w+=).)*)
The key is placed inside capturing group #1 and the value is in capturing group #2. Note that I used \s inside the lookaround in order to prevent the value from having trailing whitespace.
Live Demo

One way among several:
List<String> paramNames = new ArrayList<String>();
List<String> paramValues = new ArrayList<String>();
Pattern regex = Pattern.compile("([^\\s=]+)=([^\\s=]+)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
paramNames.add(regexMatcher.group(1));
paramValues.add(regexMatcher.group(2));
}
The regex:
([^\\s=]+)=([^\\s=]+)
The code retrieves keys as Group 1, values as Group 2.
Explanation
([^\\s=]+) captures any chars that are not a whitespace or an equal to Group 1
= matches the literal =
([^\\s=]+) captures any chars that are not a whitespace or an equal to Group 2

Your regex would be,
(\\w+=(?:(?!\\w+=).)*)
DEMO
It captures the param=value pair upto the next param=. It captures three param=value pair into three separate groups.
Explanation:
\\w+= Matches one or more word characters followed by an = symbol.
(?:(?!\\w+=).)* A non-capturing group and a negative lookahead is used to match any characters not of characters in this \w+= format. So it captures upto the next param=

Related

Regex to match a list of exact strings with some variable characters

I'm looking for a way to match a list of parameters that include some predefined characters and some variable characters using Java's String#matches method. For instance:
Possible Parameter 1: abc;[variable lowercase letters with maybe an underscore]
Possible Parameter 2: cde;[variable lowercase letters with maybe an underscore]
Possible Parameter 3: g;4
Example 1: abc;erga_sd,cde;dfgef,g;4
Example 2: g;4,abc;dsfaweg
Example 3: cde;df_ger
Each of the parameters would be comma-separated but they can come in any order and include 1, 2, and/or 3 (no duplicates)
This is the regex I have so far that partially works:
(abc;[a-z_,]+){0,1}|(cde;[a-z,]+){0,1}|(g;4,){0,1}
The problem is that it also finds something like this valid: abc;dsfg,dfvser where the beginning of the string after the comma does not start with a valid abc; or cde; or g;4
As you said:
The problem is that it also finds something like this valid:
abc;dsfg,dfvser where the beginning of the string after the comma does
not start with a valid abc; or cde; or g;4
Therefore the valid entries will always have the patterns after the comma. What you can do is, you can split the each inputs with the delimiter "," and apply the valid regex pattern to the split elements and then combine the matching results of the split elements to get the matching result of the whole input line.
Your regex should be:
(abc;[a-z_]+)|(cde;[a-z_]+)|(g;4)
You'll get any of these three patterns just like you have mentioned in your post earlier, in a valid element which you've gotten by doing a split on the input line.
Here's the code:
String regex = "(abc;[a-z_]+)|(cde;[a-z_]+)|(g;4)";
boolean finalResult = true;
for (String input: inputList.split(",")) {
finalResult = finalResult && Pattern.matches(regex,input);
}
System.out.println(finalResult);
If you want to use matches, then the whole string has to match.
^(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$
Explanation
^ Start of string
(?: Non capture group
(?:abc|cde);[a-z_]+ match either abc; or cde; and 1+ chars a-z or _
| Or
g;4 Match literally
) Close non capture group
(?: Non capture group
,(?:(?:abc|cde);[a-z_]+|g;4) Match a comma, and repeat the first pattern
)* Close non capture group and optionally repeat
$ End of string
See a regex demo and a Java demo
Example code
String[] strings = {
"abc;erga_sd,cde;dfgef,g;4",
"g;4,abc;dsfaweg",
"cde;df_ger",
"g;4",
"abc;dsfg,dfvser"
};
String regex = "^(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$";
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.matches()) {
System.out.printf("Match for %s%n", s);
} else {
System.out.printf("No match for %s%n", s);
}
}
Output
Match for abc;erga_sd,cde;dfgef,g;4
Match for g;4,abc;dsfaweg
Match for cde;df_ger
Match for g;4
No match for abc;dsfg,dfvser
If there should not be any duplicate abc; cde or g;4 you can rule that out using a negative lookahead with a backreference to match the same twice at the start of the pattern.
^(?!.*(abc;|cde;|g;4).*\1)(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$
Regex demo

java regexp get more than need

I have following regexp
http://[a-z./].*(js)
and the string
efwefewfhttp://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js
Code:
List<String> kk = new ArrayList<String>();
while (urlMatcher.find()){
kk.add(urlMatcher.group());
}
This regexp output is
http://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js
but should be 2 strings in result
How change regexp to get two string as result?
Use the following regex with lazy dot matching pattern:
http://[a-z./].*?js
^
See the regex demo
With this, you will match http://assets.main.com/zepto-1.1.3.min.js and http://assets.main.com/zepto-1.1.3.min.js.
The thing is that .* matches the whole line and then backtracks, checking if it can accommodate for the right-hand pattern. Thus it matches the longest possible substring (from the left-most up to the right-most). Lazy matching will match from the left-most to the first occurrence of the next subpattern yielding 2 matches.
See Watch Out for The Greediness! section.
Also, since these are links, and there should be no spaces, you can use \S (non-whitespace) shorthand char class:
http://[a-z./]\S*\.js
Also, the literal dot can be matched with \.. See another demo.
Lazy/greedy dot matching should be avoided as often as possible due to heavy backtracking they might involve!
Sample code:
String str = "efwefewfhttp://assets.main.com/zepto-1.1.3.min.js fffhttp://assets.main.com/zepto-1.1.3.min.js";
Pattern ptrn = Pattern.compile("http://[a-z./]\\S*\\.js");
Matcher urlMatcher = ptrn.matcher(str);
List<String> kk = new ArrayList<String>();
while (urlMatcher.find()){
kk.add(urlMatcher.group());
}
System.out.println(kk);
// [http://assets.main.com/zepto-1.1.3.min.js, http://assets.main.com/zepto-1.1.3.min.js]

Regex: extracting a value in a string <Name_id = bob>?

What would be the correct regular expression (that I can use in Java) if I want to extract a value from the string below?
<Name_id = bob>
I know that \<(.*?)\> will extract everything between the angle brackets but I only need to extract "bob".
The only part of the string that will change will be "bob". I also want to make sure that if someone enters =bob as the Name_id, the string that pulled out will be just that and doesn't mess up the regular expression.
Use capturing groups to capture the characters you want.
"<Name_id\\s+=\\s+([^>]+)>"
OR
"<Name_id\\s+=\\s+([\w]+)>"
And then print group index 1 at the last. \s+ matches one or more space characters and \w+ matches one or more word characters.
String i = "<Name_id = bob>";
Matcher m = Pattern.compile("<Name_id\\s+=\\s+([^>]+)>").matcher(i);
while(m.find())
{
System.out.println(m.group(1));
}
Output:
bob

Splitting a string with a certain pattern in Java

I am writing a parser for a file containing the following string pattern:
Key : value
Key : value
Key : value
etc...
I am able to retrieve those lines one by one into a list. What I would like to do is to separate the key from the value for each one of those strings. I know there is the split() method that can take a Regex and do this for me, but I am very unfamiliar with them so I don't know what Regex to give as a parameter to the split() function.
Also, while not in the specifications of the file I am parsing, I would like for that Regex to be able to recognize the following patterns as well (if possible):
Key: value
Key :value
Key:value
etc...
So basically, whether there's a space or not after/before/after AND before the : character, I would like for that Regex to be able to detect it. What is the Regex that can achieve this?
In other words split method should look for : and zero or more whitespaces before or after it.
Key: value
^^
Key :value
^^
Key:value
^
Key : value
^^^
In that case split("\\s*:\\s*") should do the trick.
Explanation:
\\s represents any whitespace
* means one or more occurrences of element described before it
\\s* means zero or more whitespaces.
On the other hand you may want also to find entire key:value pair and place parts matching key and value in separate groups (you can even name groups as you like using (?<groupName>regex)). In that case you may use
Pattern p = Pattern.compile("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)");
Matcher m = p.matcher(yourData);
while(m.find()){
System.out.println("key = " + m.group("key"));
System.out.println("value = " + m.group("value"));
System.out.println("--------");
}
If you want to use String.split(), you could use this:
String input = "key : value";
String[] s = input.split("\\s*:\\s*");
String key = s[0];
String value = s[1];
This will split the String at the ":", but add all whitespaces in front of the ":" to it, so that you will receive a trimmed string.
Explanation:
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
Note that this solution will cause an ArrayIndexOutOfBoundsException if your input line does not contain the key-value-format as you defined it.
If you are not sure if the line really contain the key-value-String, maybe because you want to have an empty line at the end of your file like there normally is, you could do it like that:
String input = "key : value";
Matcher m = Pattern.compile("(\\S+)\\s*:\\s*(.+)").matcher(input);
if (m.matches())
{
String key = m.group(1); // note that the count starts by 1 here
String value = m.group(2);
}
Explanation:
\\S+ matches any non-whitespace String - if it contains whitespaces, the next part of the regex will be matches with this expression already. Note that the () around it mark so that you can get it's value by m.group().
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
The last group, .+, will match any string, containing whitespaces and so on.
you can use the split method but can pass delimiter as ":"
This splits the string when it sees ':', then you can trim the values to get the key and value.
String s = " keys : value ";
String keyValuePairs[] = s.split(":");
String key = keyValuePairs[0].trim();
String value = keyValuePairs[1].trim();
You can also make use of regex to simplify it.
String keyValuePairs[] = s.trim().split("[ ]*:[ ]*");
s.trim() will remove the spaces before and after the string (if you have it in your case), So sting will become "keys : value" and
[ ]*:[ ]*
to split the string with regular expression saying spaces (one or more) : spaces (one or more) as delimiter.
For a pure regex solution, you can use the following pattern (note the space at the beginning):
?: ?
See http://regexr.com/39evh
String[] tokensVal = str.split(":");
String key = tokensVal[0].trim();
String value = tokensVal[1].trim();

Excluding markup on lowercased parentheses letters

A string can contain one to many parentheses in lower case letters like String content = "This is (a) nightmare"; I want to transform the string to "<centamp>This is </centamp>(a) <centamp>nightmare</centamp>"; So basically add centamp markup around this string but if it has a lowercase letter in parentheses that should be excluded from the markup.
This is what I have tried so far, but it doesn't achieve the desired result. There could be none to many parentheses in a string and excluding it from the markup should happen for every parentheses.
Pattern pattern = Pattern.compile("^(.*)?(\\([a-z]*\\))?(.*)?$", Pattern.MULTILINE);
String content = "This is (a) nightmare";
System.out.println(content.matches("^(.*)?(\\([a-z]*\\))?(.*)?$"));
System.out.println(pattern.matcher(content).replaceAll("<centamp>$1$3</centamp>$2"));
This can be done in one replaceAll:
String outputString =
inputString.replaceAll("(?s)\\G((?:\\([a-z]+\\))*+)((?:(?!\\([a-z]+\\)).)+)",
"$1<centamp>$2</centamp>");
It allows a non-empty sequence of lower case English alphabet character inside bracket \\([a-z]+\\).
Features:
Whitespace only sequences are tagged.
There will be no tag surrounding empty string.
Explanation:
\G asserts the match boundary, i.e. the next match can only start from the end of last match. It can also match the beginning of the string (when we have yet to find any match).
Each match of the regex will contain a sequence of: 0 or more consecutive \\([a-z]+\\) (no space between allowed), and followed by at least 1 character that does not form \\([a-z]+\\) sequence.
0 or more consecutive \\([a-z]+\\) to cover the case where the string does not start with \\([a-z]+\\), and the case where the string does not contain \\([a-z]+\\).
In the pattern for this portion (?:\\([a-z]+\\))*+ - note that the + after * makes the quantifier possessive, in other words, it disallows backtracking. Simply put, an optimization.
One character restriction is necessary to prevent adding tag that encloses empty string.
In the pattern for this portion (?:(?!\\([a-z]+\\)).)+ - note that for every character, I check whether it is part of the pattern \\([a-z]+\\) before matching it (?!\\([a-z]+\\))..
(?s) flag will cause . to match any character including new line. This will allow a tag to enclose text that spans multiple lines.
You just replace all of the occurence of "([a-z])" with </centamp>$1<centamp> and then prepend <centamp> and append </centamp>
String content = "Test (a) test (b) (c)";
Pattern pattern = Pattern.compile("(\\([a-z]\\))");
Matcher matcher = pattern.matcher(content);
String result = "<centamp>" + matcher.replaceAll("</centamp>$1<centamp>") + "</centamp>";
note I wrote the above in the browser so there may be syntax errors.
EDIT Here's a full example with the simplest RegEx possible.
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String content = "test (a) (b) and (c)";
String result = "<centamp>" +
content.replaceAll("(\\([a-z]\\))", "</centamp>$1<centamp>") +
"</centamp>";
result = result.replaceAll("<centamp></centamp>", "");
System.out.print(result);
}
}
This is another solution which uses cleaner regex. The solution is longer, but it allows more flexibility in adjusting the condition to add tag.
The idea here is to match the parenthesis containing lower case characters (the part we don't want to tag), then use the indices from the matches to identify the portion we want to enclose in tag.
// Regex for the parenthesis containing only lowercase English
// alphabet characters
static Pattern REGEX_IN_PARENTHESIS = Pattern.compile("\\([a-z]+\\)");
private static String addTag(String str) {
Matcher matcher = REGEX_IN_PARENTHESIS.matcher(str);
StringBuilder sb = new StringBuilder();
// Index that we have processed up to last append into StringBuilder
int lastAppend = 0;
while (matcher.find()) {
String bracket = matcher.group();
// The string from lastAppend to start of a match is the part
// we want to tag
// If you want to, you can easily add extra logic to process
// the string
if (lastAppend < matcher.start()) { // will not tag if empty string
sb.append("<centamp>")
.append(str, lastAppend, matcher.start())
.append("</centamp>");
}
// Append the parenthesis with lowercase English alphabet as it is
sb.append(bracket);
lastAppend = matcher.end();
}
// The string from lastAppend to end of string (no more match)
// is the part we want to tag
if (lastAppend < str.length()) {
sb.append("<centamp>")
.append(str, lastAppend, str.length())
.append("</centamp>");
}
return sb.toString();
}

Categories

Resources