Regular expression matching issue with the following scenario - java

I am developing an application. User will enter some of the setting value in the server. When I ask for the value to the server through the inbuilt API. I am getting values like as a whole string:
for example-
name={abc};display={xyz};addressname={123}
Here the properties are name, display and address and there respective values are abc, xyz and 123.
I used to split with ; as first delimeter and = as a second dleimeter.
String[] propertyValues=iPropertiesStrings.split(";");
for(int i=0;i<propertyValues.length;i++)
{
if(isNullEmpty(propertyValues[i]))
continue;
String[] propertyValue=propertyValues[i].split("=");
if(propertyValue.length!=2)
mPropertyValues.put(propertyValue[0], "");
else
mPropertyValues.put(propertyValue[0], propertyValue[1]);
}
}
here mPropertyValues is hash map which is used for keeping property name and its value.
Problem is there can be string :
case 1: name={abc};display={ xyz=deno; demo2=pol };addressname={123}
case 2: name=;display={ xyz=deno; demo2=pol };addressname={123}
I want hashmap to be filled with :
case 1:
name ="abc"
display = "xyz= demo; demo2 =pol"
addressname = "123"
for case 2:
name =""
display = "xyz= demo; demo2 =pol"
addressname = "123"
I am looking for a regular expression to split these strings;

Assuming that there can't be nested {} this should do what you need
String data = "name=;display={ xyz=deno; demo2=pol };addressname={123}";
Pattern p = Pattern.compile("(?<name>\\w+)=(\\{(?<value>[^}]*)\\})?(;|$)");
Matcher m = p.matcher(data);
while (m.find()){
System.out.println(m.group("name")+"->"+(m.group("value")==null?"":m.group("value").trim()));
}
Output:
name->
display->xyz=deno; demo2=pol
addressname->123
Explanation
(?<name>\\w+)=(\\{(?<value>[^}]*)\\})?(;|$) can be split into parts where
(?<name>\\w+)= represents XXXX= and place XXXX in group named name (of property)
(\\{(?<value>[^}]*)\\})? is optional {XXXX} part where X can't be }. Also it will place XXXX part in group named value.
(;|$) represents ; OR end of data (represented by $ anchor) since formula is name=value; or in case of pair placed at the end of data name=value.

The following regex should match your criteria, and uses named capturing groups to get the three values you need.
name=\{(?<name>[^}])\};display=\{(?<display>[^}]+)\};addressname=\{(?<address>[^}]\)}

Assuming your dataset can change, a better parser may be more dynamic, building a Map from whatever is found in that return type.
The regex for this is pretty simple, given the cases you list above (and no nesting of {}, as others have mentioned):
Matcher m = Pattern.compile("(\\w+)=(?:\\{(.*?)\\})?").matcher(source_string);
while (m.find()) {
if (m.groupCount() > 1) {
hashMap.put(m.group(1), m.group(2));
}
}
There are, however, considerations to this:
If m.group(2) does not exist, "null" will be the value, (you can adjust that to be what you want with a tiny amount of logic).
This will account for varying data-sets - in case your data in the future changes.
What that regex does:
(\\w+) - This looks for one or more word characters in a row (A-z_) and puts them into a "capture group" (group(1))
= - The literal equals
(?:...)? - This makes the grouping not a capture group (will not be a .group(n), and the trailing ? makes it an optional grouping.
\\{(.*?)\\} - This looks for anything between the literals { and } (note: if a stray } is in there, this will break). If this section exists, the contents between {} will be in the second "capture group" (.group(2)).

Related

Filtering string between double or single quotations with varying spaces

I have these two variations of this string
name='Anything can go here'
name="Anything can go here"
where name= can have spaces like so
name=(text)
name =(text)
name = (text)
I need to extract the text between the quotes, I'm not sure what's the best way to approach this, should I just have mechanism to cut the string off at quotes and do you have an example where I wont have many case handling, or should I use regex.
I'm not sure I understand the question exactly but I'll give it my best shot:
If you want to just assign a variable name2 to the string inside the quotation marks then you can easily do :
String name = 'Anything can go here';
String name2= name.replace("'","");
name2 = name2.replace("\"","");
You're wanting to get Anything can go here whether it's in between single quotes or double quotes. Regex has the capabilities of doing this regardless of the spaces before or after the "=" by using the following pattern:
"[\"'](.+)[\"']"
Breakdown:
[\"'] - Character class consisting of a double or single quote
(.+) - One or more of any character (may or may not match line terminators stored in capture group 1
[\"'] - Character class consisting of a double or single quote
In short, we are trying to capture anything between single or double quotes.
Example:
public static void main(String[] args) {
List<String> data = new ArrayList(Arrays.asList(
"name='Anything can go here'",
"name = \"Really! Anything can go here\""
));
for (String d : data) {
Matcher matcher = Pattern.compile("[\"'](.+)[\"']").matcher(d);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
Results:
Anything can go here
Really! Anything can go here

Subtracting characters in a back reference from a character class in java.util.regex.Pattern

Is it possible to subtract the characters in a Java regex back reference from a character class?
e.g., I want to use String#matches(regex) to match either:
any group of characters that are [a-z'] that are enclosed by "
Matches: "abc'abc"
Doesn't match: "1abc'abc"
Doesn't match: 'abc"abc'
any group of characters that are [a-z"] that are enclosed by '
Matches: 'abc"abc'
Doesn't match: '1abc"abc'
Doesn't match: "abc'abc"
The following regex won't compile because [^\1] isn't supported:
(['"])[a-z'"&&[^\1]]*\1
Obviously, the following will work:
'[a-z"]*'|"[a-z']*"
But, this style isn't particularly legible when a-z is replaced by a much more complex character class that must be kept the same in each side of the "or" condition.
I know that, in Java, I can just use String concatenation like the following:
String charClass = "a-z";
String regex = "'[" + charClass + "\"]*'|\"[" + charClass + "']*\"";
But, sometimes, I need to specify the regex in a config file, like XML, or JSON, etc., where java code is not available.
I assume that what I'm asking is almost definitely not possible, but I figured it wouldn't hurt to ask...
One approach is to use a negative look-ahead to make sure that every character in between the quotes is not the quotes:
(['"])(?:(?!\1)[a-z'"])*+\1
^^^^^^
(I also make the quantifier possessive, since there is no use for backtracking here)
This approach is, however, rather inefficient, since the pattern will check for the quote character for every single character, on top of checking that the character is one of the allowed character.
The alternative with 2 branches in the question '[a-z"]*'|"[a-z']*" is better, since the engine only checks for the quote character once and goes through the rest by checking that the current character is in the character class.
You could use two patterns in one OR-separated pattern, expressing both your cases:
// | case 1: [a-z'] enclosed by "
// | | OR
// | | case 2: [a-z"] enclosed by '
Pattern p = Pattern.compile("(?<=\")([a-z']+)(?=\")|(?<=')([a-z\"]+)(?=')");
String[] test = {
// will match group 1 (for case 1)
"abcd\"efg'h\"ijkl",
// will match group 2 (for case 2)
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
Output
efg'h
null
null
efg"h
Note
There is nothing stopping you from specifying the enclosing characters or the character class itself somewhere else, then building your Pattern with components unknown at compile-time.
Something in the lines of:
// both strings are emulating unknown-value arguments
String unknownEnclosingCharacter = "\"";
String unknownCharacterClass = "a-z'";
// probably want to catch a PatternSyntaxException here for potential
// issues with the given arguments
Pattern p = Pattern.compile(
String.format(
"(?<=%1$s)([%2$s]+)(?=%1$s)",
unknownEnclosingCharacter,
unknownCharacterClass
)
);
String[] test = {
"abcd\"efg'h\"ijkl",
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
// note: only main group here
System.out.println(m.group());
}
}
Output
efg'h

Java Regex is including new line in match

I'm trying to match a regular expression to textbook definitions that I get from a website.
The definition always has the word with a new line followed by the definition. For example:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
Here's my snippet
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
All help is greatly appreciated. Thanks in advance!
Try using the Pattern.MULTILINE option
Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.
Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.
With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
So changing mtch.group() to mtch.group(1) should do the trick:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}
A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string
(?s)[Your Expression]
Basically (?s) also tells dot to match all characters, including line breaks
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Just replace:
String result = mtch.group();
By:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group (e.g. (\\w+)) .
Try the next:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\\w+)\\r?\\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1")
); // prints "Zither"
}

Stuck in regular expression

I have 3 strings that contain 2 fields and 2 values per string. I need a regular expression for the strings so I can get the data. Here are the 3 strings:
TTextRecordByLanguage{Text=Enter here the amount to transfer from your compulsory book saving account to your compulsory checking account; Id=55; }
TTextRecordByLanguage{Text=Hello World, CaribPayActivity!; Id=2; }
TTextRecordByLanguage{Text=(iphone); Id=4; }
The 2 fields are Text and Id, so I need an expression that gets the data between the Text field and the semi-colon (;). Make sure special symbols and any data are included.
Update ::
What i have tried.....
Pattern pinPattern = Pattern.compile("Text=([a-zA-Z0-9 \\E]*);");
ArrayList<String> pins = new ArrayList<String>();
Matcher m = pinPattern.matcher(soapObject.toString());
while (m.find()) {
pins.add(m.group(1));
s[i] = m.group(1);
}
Log.i("TAG", "ARRAY=>"+ s[i]);
I suggest a RE like this:
Text=.*?;
e.g: a returned of the last string should be
Text=(iphone);
then you may eliminate Text= and ; out of string as you want the content only.

Regular expression, match content of specific XML tag, but without the tag itself

I am banging my head against this regular expression the whole day.
The task looks simple, I have a number of XML tag names and I must replace (mask) their content.
For example
<Exony_Credit_Card_ID>242394798</Exony_Credit_Card_ID>
Must become
<Exony_Credit_Card_ID>filtered</Exony_Credit_Card_ID>
There are multiple such tags with different names
How do I match any text inside but without matching the tag itself?
EDIT: I should clarify again. Grouping and then using the group to avoid replacing the text inside does not work in my case, because when I add the other tags to the expression, the group number is different for the subsequent matches. For example:
"(<Exony_Credit_Card_ID>).+(</Exony_Credit_Card_ID>)|(<Billing_Postcode>).+(</Billing_Postcode>)"
replaceAll with the string "$1filtered$2" does not work because when the regex matches Billing_Postcode its groups are 3 and 4 instead of 1 and 2
String resultString = subjectString.replaceAll(
"(?x) # (multiline regex): Match...\n" +
"<(Exony_Credit_Card_ID|Billing_Postcode)> # one of these opening tags\n" +
"[^<>]* # Match whatever is contained within\n" +
"</\\1> # Match corresponding closing tag",
"<$1>filtered</$1>");
In your situation, I'd use this:
(?<=<(Exony_Credit_Card_ID|tag1|tag2)>)(\\d+)(?=</(Exony_Credit_Card_ID|tag1|tag2)>)
And then replace the matches with filtered, as the tags are excluded from the returned match. As your goal is to hide sensitive data, it's better to be safe and use an "agressive" matching, trying to match as much possibly sensitive data, even if sometimes it is not.
You may need to adjust the tag content matcher ( \\d+ ) if the data contains other characters, like whitespaces, slashes, dashes and such.
I have not debugged this code but you should use something like this:
Pattern p = Pattern.compile("<\\w+>([^<]*)<\\w+>");
Matcher m = p.matcher(str);
if (m.find()) {
String tagContent = m.group(1);
}
I hope it is a good start.
I would use something like this :
private static final Pattern PAT = Pattern.compile("<(\\w+)>(.*?)</\\1>");
private static String replace(String s, Set<String> toReplace) {
Matcher m = PAT.matcher(s);
if (m.matches() && toReplace.contains(m.group(1))) {
return '<' + m.group(1) + '>' + "filtered" + "</" + m.group(1) + '>';
}
return s;
}
I know you said that relying on group numbers does not do in your case ... but I can't really see how. Could you not use something of the sort :
xmlString.replaceAll("<(Exony_Credit_Card_ID|tag2|tag3)>([^<]+)</(\\1)>", "<$1>filtered</$1>");
? This works on the basic samples I used as a test.
edit: just to decompose :
"<(Exony_Credit_Card_ID|tag2|tag3)>" + // matches the tag itself
"([^<]+)" + // then anything in between the opening and closing of the tag
"</(\\1)>" // and finally the end tag corresponding to what we matched as the first group (Exony_Credit_Card_ID, tag1 or tag2)
"<$1>" + // Replace using the first captured group (tag name)
"filtered" + // the "filtered" text
"</$1>" // and the closing tag corresponding to the first captured group

Categories

Resources