Escape special characters using Regex in java [duplicate] - java

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.

Since Java 1.5, yes:
Pattern.quote("$5");

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.

First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}

Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Related

RegEx for matching between any two HTML tags

I have the following content :
<div class="TEST-TEXT">hi</span>
first young CEO's TEST-TEXT
<span class="test">hello</span>
I am trying to match the TEST-TEXT string to replace it is value but only when it is a text and not within an attribute value.
I have checked the concepts of look-ahead and look-behind in Regex but the current issue with that is that it needs to use a fixed width for the match here is a link regex-match-all-characters-between-two-html-tags that show case a very similar case but with an exception that there is a span with a class to create a match
also checked the link regex-match-attribute-in-a-html-code
here are two regular expressions I am trying with :
\"([^"]*)\"
(?s)(?<=<([^{]*)>)(.+?)(?=</.>)
both are not working for me try using [https://regex101.com/r/ApbUEW/2]
I expect it to match only the string when it is a text
current behavior it matches both cases
Edit : I want the text to be dynamic and not specific to TEST-TEXT
Something like this should help:
\>([^"<]*)\<
EDIT:
Without open and close tags included:
(?<=\>)([^"<]*)(?=\<)
Try TEST-TEXT(?=<\/a>)
TEST-TEXT matches TEST-TEXT
?= look ahead to check closing tag </a>
see at
regex101
Here, we might just add a soft boundary on the right of the desired output, which you have been already doing, then a char list for the desired output, then collect, after that we can make a replacement by using capturing groups (). Maybe similar to this:
([A-Z-]+)(<\/)
Demo
This snippet is just to show that the expression might be valid:
const regex = /([A-Z-]+)(<\/)/gm;
const str = `<div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span><div class="TEST-TEXT">hi</span><a href=\\"https://en.wikipedia.org/wiki/TEST-TEXT\\">first young CEO's
TEST-TEXT</a><span class="test">hello</span>`;
const subst = `NEW-TEXT$2`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.
Maybe this will help?
String html = "<div class=\"TEST-TEXT\">hi</span>\n" +
"first young CEO's TEST-TEXT\n" +
"<span class=\"test\">hello</span>";
Pattern pattern = Pattern.compile("(<)(.*)(>)(.*)(TEST-TEXT)(.*)</.*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()){
System.out.println(matcher.group(5));
}
A RegEx for that a string between any two HTML tags
(?![^<>]*>)(TEST\-TEXT)

Replace repeated xml tags value using regex

Input -
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
Tried follwing things using to mask values within using
String op = ipXmlString .replaceAll("<accntNo>(.+?)</accntNo>", "######");
But above code masks all the values
<root><accntNoGrp>######</accntNoGrp><accntNoGrp>######</accntNoGrp></root>
Expected Output:
<root><accntNoGrp><accntNo>#####67</accntNo></accntNoGrp><accntNoGrp><accntNo>#####23</accntNo></accntNoGrp></root>
How to achieve this using java regex ?Could someone help
Your replacement is wrong, you need to include the <accntNo> tag in the actual replacement. Also, it appears that you want to show the last two characters/numbers of the account number. In this case, we can capture this information during the match and use it in the replacement.
Code:
String op = ipXmlString.replaceAll("<accntNo>(?:.+?)(.{2})</accntNo>", "<accntNo>######$1</accntNo>");
Explanation:
<accntNo> match an opening tag
(?:.+?) match, but do not capture, anything up until the first
(.{2}) two characters before closing tag (and capture this)
</accntNo> match a closing tag
Note here that by using ?: inside a parenthesis in the pattern, we tell the regex engine to not capture it. There is no point in capturing anything before the last two characters of the account number because we don't want to us it.
The $1 quantity in the replacement refers to the first capture group. In this case, it is the last two characters of the account number. Hence, we build the replacement string you want this way.
Demo here:
Rextester
Try this code:
public static void main(String[] args) {
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
String replaceAll = ipXmlString.replaceAll("\\d+", "######");
System.out.println(replaceAll);
}
Prints:
<root><accntNoGrp><accntNo>######</accntNo></accntNoGrp><accntNoGrp><accntNo>######</accntNo></accntNoGrp></root>

make .replaceFirst() start after a specific character

Is there any way I can make .replaceFirst() start to replace only a after a specific string? e.g. I know that regex don't do well with html, and i have html text consisting of 1 h2 head and one paragraph.
Now the keywords i replace using my software work flawlessly, however sometimes the keywords are also replaced within the title. Is there any way to make java know to start raplacing AFTER the very first
</h2>
String?
If you want a regex to solution (so that it makes no difference if you use replaceFirst() or replaceAll()), I can suggest using capture groups:
(?s)(<\/h2.+)\b(keyword)\b(?=.*<\/h2>.*$)
String regex = "(?s)(<\\/h2.+)\\b(keyword)\\b(?=.*<\\/h2>.*$)";
Replace the "keyword" with your word, and use "$1[replacement_keyword]" as a replacement string.
Here is a code example:
String input = "<title>Replacing keywords with keyword</title>\n"+
"<body>\n"+
"<h2>Titles</h2>\n"+
"<p>Par with keywords and keyword</p>\n"+
"<h2>Titles</h2>\n"+
"<p>Par with keywords and keyword</p>\n"+
"</body>";
String regex = "(?s)(<\\/h2.+)\\b(keyword)\\b(?=.*<\\/h2>.*$)";
String keytoreplacewith = "NEW_COOL_KEYWORD";
String output = input.replaceFirst(regex, "$1"+keytoreplacewith);
System.out.println(output);
Output:
<title>Replacing keywords with keyword</title>
<body>
<h2>Titles</h2>
<p>Par with keywords and NEW_COOL_KEYWORD</p>
<h2>Titles</h2>
<p>Par with keywords and keyword</p>
</body>

Regex for removing part of a line if it is preceded by some word in Java

There's a properties language bundle file:
label.username=Username:
label.tooltip_html=Please enter your username.</center></html>
label.password=Password:
label.tooltip_html=Please enter your password.</center></html>
How to match all lines that have both "_html" and "</center></html>" in that order and replace them with the same line except the ending "</center></html>". For example, line:
label.tooltip_html=Please enter your username.</center></html>
should become:
label.tooltip_html=Please enter your username.
Note: I would like to do this replacement using an IDE (IntelliJ IDEA, Eclipse, NetBeans...)
Since you clarified that this regex is to be used in the IDE, I tested this in Eclipse and it works:
FIND:
(_html.*)</center></html>
REPLACE WITH:
$1
Make sure you turn on the Regular expressions switch in the Find/Replace dialog. This will match any string that contains _html.* (where the .* greedily matches any string not containing newlines), followed by </center></html>. It uses (…) brackets to capture what was matched into group 1, and $1 in the replacement substitutes in what group 1 captured.
This effectively removes </center></html> if that string is preceded by _html in that line.
If there can be multiple </center></html> in a line, and they are all to be removed if there's a _html_ to their left, then the regex will be more complicated, but it can be done in one regex with \G continuing anchor if absolutely need be.
Variations
Speaking more generally, you can also match things like this:
(delete)this part only(please)
This now creates 2 capturing groups. You can match strings with this pattern and replace with $1$2, and it will effectively delete this part only, but only if it's preceded by delete and followed by please. These subpatterns can be more complicated, of course.
if (line.contains("_html=")) {
line = line.replace("</center></html>", "");
}
No regExp needed here ;) (edit) as long as all lines of the property file are well formed.
String s = "label.tooltip_html=Please enter your password.</center></html>";
Pattern p = Pattern.compile("(_html.*)</center></html>");
Matcher m = p.matcher(s);
System.out.println(m.replaceAll("$1"));
Try something like this:
Pattern p = Pattern.compile(".*(_html).*</center></html>");
Matcher m = p.matcher(input_line); // get a matcher object
String output = input_line;
if (m.matches()) {
String output = input_line.replace("</center></html>", "");
}
/^(.*)<\/center><\/html>/
finds you the
label.tooltip_html=Please enter your username.
part. then you can just put the string together correctly.

java email extraction regular expression?

I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.
Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";
Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.
a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);
The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Categories

Resources