Canonical equivalence in Pattern - java

I am referring to the test harness listed here http://docs.oracle.com/javase/tutorial/essential/regex/test_harness.html
The only change I made to the class is that the pattern is created as below:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex(Pattern.CANON_EQ set): "),Pattern.CANON_EQ);
As the tutorial at http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html suggests I put in the pattern or regex as a\u030A and string to match as \u00E5 but it ends on a No Match Found. I saw both the strings are a small case 'a' with a ring on top.
Have I not understood the use case correctly?

The behavior you're seeing has nothing to do with the Pattern.CANON_EQ flag.
Input read from the console is not the same as a Java string literal. When the user (presumably you, testing out this flag) types \u00E5 into the console, the resultant string read by console.readLine is equivalent to "\\u00E5", not "å". See for yourself: http://ideone.com/lF7D1
As for Pattern.CANON_EQ, it behaves exactly as described:
Pattern withCE = Pattern.compile("^a\u030A$",Pattern.CANON_EQ);
Pattern withoutCE = Pattern.compile("^a\u030A$");
String input = "\u00E5";
System.out.println("Matches with canon eq: "
+ withCE.matcher(input).matches()); // true
System.out.println("Matches without canon eq: "
+ withoutCE.matcher(input).matches()); // false
http://ideone.com/nEV1V

Related

Replace repeated xml tags value using regex

Input -
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
Tried follwing things using to mask values within using
String op = ipXmlString .replaceAll("<accntNo>(.+?)</accntNo>", "######");
But above code masks all the values
<root><accntNoGrp>######</accntNoGrp><accntNoGrp>######</accntNoGrp></root>
Expected Output:
<root><accntNoGrp><accntNo>#####67</accntNo></accntNoGrp><accntNoGrp><accntNo>#####23</accntNo></accntNoGrp></root>
How to achieve this using java regex ?Could someone help
Your replacement is wrong, you need to include the <accntNo> tag in the actual replacement. Also, it appears that you want to show the last two characters/numbers of the account number. In this case, we can capture this information during the match and use it in the replacement.
Code:
String op = ipXmlString.replaceAll("<accntNo>(?:.+?)(.{2})</accntNo>", "<accntNo>######$1</accntNo>");
Explanation:
<accntNo> match an opening tag
(?:.+?) match, but do not capture, anything up until the first
(.{2}) two characters before closing tag (and capture this)
</accntNo> match a closing tag
Note here that by using ?: inside a parenthesis in the pattern, we tell the regex engine to not capture it. There is no point in capturing anything before the last two characters of the account number because we don't want to us it.
The $1 quantity in the replacement refers to the first capture group. In this case, it is the last two characters of the account number. Hence, we build the replacement string you want this way.
Demo here:
Rextester
Try this code:
public static void main(String[] args) {
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
String replaceAll = ipXmlString.replaceAll("\\d+", "######");
System.out.println(replaceAll);
}
Prints:
<root><accntNoGrp><accntNo>######</accntNo></accntNoGrp><accntNoGrp><accntNo>######</accntNo></accntNoGrp></root>

Escape special characters using Regex in java [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Java Regex - Pattern.match fails to deal with lazy match

This is the sample code
String m_testPattern = "AB.*?";
String m_testMatcherString = "ABCDCDCDCD";
final Pattern pattern = Pattern.compile(m_testPattern);
final Matcher matcher = pattern.matcher(m_testMatcherString);
if (matcher.matches()) {
// This means the regex matches
System.out.println("Successful comparison");
} else {
// match failed
System.out.println("Comparison failed !!!");
}
Ideally the match operation should result in a failure and give me output as "Comparison failed !!!"
But this code snippet gives me "Successful comparison" as output
I checked online regex tools with the same input and the result was different
I did the trial in this site http://regexr.com/v1/
Here when I put AB.*? in the regex and ABCDCDCDCD as the string to be compared, then the search stops at AB.
This means the comparison performed is a Lazy Comparison and not a greedy one
Can anyone please explain why the same use case fails in case of Java Pattern.match function ?
My test case is something like
1. regex AB\wCD should match with ABZCD plus fail at AB2CD
2. AB\w{2}CD would match ABZZCD
3. AB\d{1,3}CD should match AB555CD or AB6CD or AB77CD plus fail at ABCD or AB9999CD etc
4. AB.* should match AB(followed by anything)
5. AB.*? should fail if input like ABCDCDCD is given for comparison
All the 4 steps is passed successfully while using matcher.matches() function <br/>
Only the fifth one gives a wrong answer. (5th scenario also gives a success message eventhough it should fail)
Thanks in advance
matches()
return true if the whole string matches the given pattern.
find()
tries to find a substring that matches the pattern.

Java(Apex) RegEx not working?

I am having trouble with a regex in salesforce, apex. As I saw that apex is using the same syntax and logic as apex, I aimed this at java developers also.
I debugged the String and it is correct. street equals 'str 3 B'.
When using http://www.regexr.com/, the regex works('\d \w$').
The code:
Matcher hasString = Pattern.compile('\\d \\w$').matcher(street);
if(hasString.matches())
My problem is, that hasString.matches() resolves to false. Can anyone tell me if I did something somewhere wrong? I tried to use it without the $, with difference casing, etc. and I just can't get it to work.
Thanks in advance!
You need to use find instead of matches for partial input match as matches attempts to match complete input text.
Matcher hasString = Pattern.compile("\\d \\w$").matcher(street);
if(hasString.find()) {
// matched
System.out.println("Start position: " + hasString.start());
}

Java Regex with Pattern and Matcher

I am using Pattern and Matcher classes from Java ,
I am reading a Template text and I want to replace :
src="scripts/test.js" with src="scripts/test.js?Id=${Id}"
src="Servlet?Template=scripts/test.js" with src="Servlet?Id=${Id}&Template=scripts/test.js"
I'm using the below code to execute case 2. :
//strTemplateText is the Template's text
Pattern p2 = Pattern.compile("(?i)(src\\s*=\\s*[\"'])(.*?\\?)");
Matcher m2 = p2.matcher(strTemplateText);
strTemplateText = m2.replaceAll("$1$2Id=" + CurrentESSession.getAttributeString("Id", "") + "&");
The above code works correctly for case 2. but how can I create a regex to combine both cases 1. and 2. ?
Thank you
You don't need a regular expression. If you change case 2 to
replace Servlet?Template=scripts/test.js with Servlet?Template=scripts/test.js&Id=${Id}
all you need to do is to check whether the source string does contain a ? if not add ?Id=${Id} else add &Id=${Id}.
After all
if (strTemplateText.contains("?") {
strTemplateText += "&Id=${Id}";
}
else {
strTemplateText += "?Id=${Id}";
}
does the job.
Or even shorter
strTemplate += strTemplateText.contains("?") ? "&Id=${Id}" : "?Id=${Id}";
Your actual question doesn't match up so well with your example code. The example code seems to handle a more general case, and it substitutes an actual session Id value instead of a reference to one. The code below takes the example code to be more indicative of what you really want, but the same approach could be adapted to what you asked in the question text (using a simpler regex, even).
With that said, I don't see any way to do this with a single replaceAll() because the replacement text for the two cases is too different. You could nevertheless do it with one regex, in one pass, if you used a different approach:
Pattern p2 = Pattern.compile("(src\\s*=\\s*)(['\"])([^?]*?)(\\?.*?)?\\2",
Pattern.CASE_INSENSITIVE);
Matcher m2 = p2.matcher(strTemplateText);
StringBuffer revisedText = new StringBuffer();
while (m2.find()) {
// Append the whole match except the closing quote
m2.appendReplacement(revisedText, "$1$2$3$4");
// group 4 is the optional query string; null if none was matched
revisedText.append((m2.group(4) == null) ? '?' : '&');
revisedText.append("Id=");
revisedText.append(CurrentESSession.getAttributeString("Id", ""));
// append a copy of the opening quote
revisedText.append(m2.group(2));
}
m2.appendTail(revisedText);
strTemplateText = revisedText.toString();
That relies on BetaRide's observation that query parameter order is not significant, although the same general approach could accommodate a requirement to make Id the first query parameter, as in the question. It also matches the end of the src attribute in the pattern to the correct closing delimiter, which your pattern does not address (though it needs to do to avoid matching text that spans more than one src attribute).
Do note that nothing in the above prevents a duplicate query parameter 'Id' being added; this is consistent with the regex presented in the question. If you want to avoid that with the above approach then in the loop you need to parse the query string (when there is one) to determine whether an 'Id' parameter is already present.
You can do the following:
//strTemplateText is the Template's text
String strTemplateText = "src=\"scripts/test.js\"";
strTemplateText = "src=\"Servlet?Template=scripts/test.js\"";
java.util.regex.Pattern p2 = java.util.regex.Pattern.compile("(src\\s*=\\s*[\"'])(.*?)((?:[\\w\\s\\d.\\-\\#]+\\/?)+)(?:[?]?)(.*?\\=.*)*(['\"])");
java.util.regex.Matcher m2 = p2.matcher(strTemplateText);
System.out.println(m2.matches());
strTemplateText = m2.replaceAll("$1$2$3?Id=" + CurrentESSession.getAttributeString("Id", "") + (m2.group(4)==null? "":"&") + "$4$5");
System.out.println(strTemplateText);
It works on both cases.
If you are using java > 1.6; then, you could use custom-named group-capturing features for making the regex exp. more human-readable and easier to debug.

Categories

Resources