Catching full string inside regex class - java

I'm currently trying to deal with "leetspeak" in regex. I have a class with a letter, and it will be filled with possible "leet" alternatives in it. However, some of those alternatives are multiple characters long, and I'm having a hard time figuring out how to include those in a class. For example
[kK"|<"]
Now I understand quotation marks don't work like that, but I can't find a way to have this match either k, K, or |< without it matching the | or < individually.
My questions is how can I include a string of characters within a class?
Also, I want to make sure it's treated literally, so I will need to include \Q and \E somewhere in the solution.

You could use a class for both k and K then match |< by itself.
"[kK]|\\|<"
If you are wanting to include \Q and \E ...
"[kK]|\\Q|<\\E"

"k|K|\\|<"
The pipe allows you to "or" a multicharacter string and escaping it with a backslash allows you to include a pipe in such a string. You'll need to escape the backslash with another backslash if the string is in quotation marks, so the backslash can be placed as such in the Regex.

Use this regex:
[kK]|\|<
In Java, you need to escape the backslash, so this becomes
[kK]|\\|<
Option 2: escape the leet
As you suggested yourself, using \\Q some leet \\E lets you match anything without worrying that you may need to escape a special regex character.
Explanation
The character class [kK] matches one char that is either a k or a K
OR |
\|< matches |<

Related

Matching sequence of unicode value in Java with regular expression

I have a text file that contains some sequence of unicode characters value like
"{"\u0985\u0982\u09b6\u0998\u099f\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u0982\u09b6\u09bf","\u0985\u0982\u09b6\u09be\u0999\u09cd\u0995\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u09a6\u09bf","\u0985\u0982\u09b6\u09be\u09a8\u09cb"}"
I am trying to match and group values inside the quotes using Pattern class in java like below but can not find any match.
Pattern p = Pattern.compile("\"(\\[u]{1}\\w+)+\"");
Example
I am actually willing to find out where is the technical error in my given regexp.
Try something more like this:
Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\"");
In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.
First, this bit [u]{1} means that you want to match values from the list only once, so you can replace it with simply u
Once that is done, your regex wants to match a quote, a slash, then a u, then another slash, then one or more w's, then a slash. It is matching w's instead of word characters because you have too many slashes before it.
Happy coding!
Edit
Try replacing the \\ before the u with a \\\\. \u is not valid in some regex's and so when you put in a Java string, it's probably becoming \u, breaking the regex

Why do I need two slashes in Java Regex to find a "+" symbol?

Just something I don't understand the full meaning behind. I understand that I need to escape any special meaning characters if I want to find them using regex. And I also read somewhere that you need to escape the backslash in Java if it's inside a String literal. My question though is if I "escape" the backslash, doesn't it lose its meaning? So then it wouldn't be able to escape the following plus symbol?
Throws an error (but shouldn't it work since that's how you escape those special characters?):
replaceAll("\+\s", ""));
Works:
replaceAll("\\+\\s", ""));
Hopefully that makes sense. I'm just trying to understand the functionality behind why I need those extra slashes when the regex tutorials I've read don't mention them. And things like "\+" should find the plus symbol.
There are two "escapings" going on here. The first backslash is to escape the second backslash for the Java language, to create an actual backslash character. The backslash character is what escapes the + or the s for interpretation by the regular expression engine. That's why you need two backslashes -- one for Java, one for the regular expression engine. With only one backslash, Java reports \s and \+ as illegal escape characters -- not for regular expressions, but for an actual character in the Java language.
Funda behind extra slashes is that , first slash '\' is escape for the string and second slash '\' is escape for the regex.

Java escaping regex meta characters and constrct

I am trying to form regex pattern from a string containing non meta-characters - (%, &) and meta characters - ([, ], {, },|).
Question is, I want to(how to) identify any character that is potential meta character of java Pattern and escape it using "\\" and then I can replace some of non meta characters with regex meta character .* or .+
e.g. input string = "%abc&xy[z,p)"
1st step output( where I need help to identify and escape all meta char) - "%abc&xy\\[z,p\\)"
2nd setp output( where I would do custom char replacement(no help needed here)) - ".*abc.+\\[z,p\\)"
p.s. - I don't think Pattern.quote() or Pattern.Literal is answer here. As of now only option I see is to have map of those meta chars and inspect each character against it.
The Java regexp patterns can be found here: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.
You should have a special look at \Q and \E, which are designed for escaping large portions of text without specially handling all.
If I understand your request right, you want e.g. have e.g. a marker like MYCODE to change into .*, then the change could be:
add \Q at the beginning
add \E at the end
replace MYCODE with \E.*\Q
Didn't test this in Java myself, but it is the same principle as in perl.
So Match all the {MYCODE open brackets becomes \QMatch all the {\E.*\Q open brackets\E.
You probably need to escape any \ inside the block. Not sure about this.

Java Regex - Trying to create a specific pattern but I'm not quite sure what to do

In my Java application, I'm trying to specify a Pattern that would match anything that's not made up of either uppercase letters, lowercase letters, or dashes. So I want it to match anything that doesn't contain A-Z, a-z, or '-'. I'm new to using regular expressions so I just wanted to see if I was even close to getting this right. This is what I have:
Pattern.compile("[^A-Z]&&[^a-z]&&[^\\-]");
I'm not even sure if I need the escape characters for the dash or if I do, whether it should be two backslashes instead of one. I'm also not sure about the format overall. Thanks for any help.
Building off of #Joe's answer:
Pattern.compile("[^A-Za-z\\-]");
But you need to use double backlash since you need to escape the \ which is escaping the -
You don't need to say "AND NOT", you can just lump them all in together:
Pattern.compile("[^A-Za-z\-]");
With regards to escaping, a single backslash \ escapes the character immediately after it, so \- gives you a textual - character, and \\ gives you a textual \ character. In your original post, the \\- escapes the backslash but not the hyphen, so you end up matching "not backslash or hyphen".

How to undo replace performed by regex?

In java, I have the following regex ([\\(\\)\\/\\=\\:\\|,\\,\\\\]) which is compiled and then used to escape each of the special characters ()/=:|,\ with a backslash as follows escaper.matcher(value).replaceAll("\\\\$1")
So the string "A/C:D/C" would end up as "A\/C\:D\/C"
Later on in the process, I need to undo that replace. That means I need to match on the combination of \(, \), \/ etc. and replace it with the character immediately following the backslash character. A backslash followed by any other character should not be matched and there could be cases where a special character will exist without the preceeding backslash, in which case it shouldn't match either.
Since I know all of the cases I could do something like
myString.replaceAll("\\(", "(").replaceAll("\\)", ")").replaceAll("\\/", "/")...
but I wonder if there is a simpler regex that would allow me to perform the replace for all the special characters in a single step.
That seems pretty straightforward. If this were your original code (excess escapes removed):
Pattern escaper = Pattern.compile("([()/=:|,\\\\])");
String escaped = escaper.matcher(original).replaceAll("\\\\$1");
...the opposite would be:
Pattern unescaper = Pattern.compile("\\\\([()/=:|,\\\\])");
String unescaped = unescaper.matcher(escaped).replaceAll("$1");
If you weren't escaping and unescaping backslashes themselves (as you're doing), you would have problems, but this should work fine.
I don't know java regex flavor but this work with PCRE
replace \\ followed by ([()/=:|,\\]) by $1
in perl you can do
$str =~ s#\\([()/=:|,\\])#$1#g;

Categories

Resources