How to escape special characters in the regex ***(.*) - java

I am new to Java. Can somebody help me?
Is there any method available in Java which escapes the special characters in the below regex automatically?
Before escaping ***(.*) and after escaping \\*\\*\\*(.*)
I don't want to escape (.*) here.

On the face of it, Pattern.quote appears to do the job.
However, looking at the detail of your question, it appears that you want / expect to be able to escape some meta-characters and not others. Pattern.quote won't do that if you apply it to a single string. Rather, it will quote each and every character. (For the record, it doesn't use backslashes. It uses "\E" and "\Q".\ which neatly avoids the cost of parsing the string to find characters that need escaping.)
But the real problem is that you haven't said how the quoter should decide which meta-characters to escape and which ones to leave intact. For instance, how does it know to escape the first three '' characters, but not the "."?
Without a clearer specification, your question is pretty much unanswerable. And even with a specification, there is little chance of finding an easy way to do this.
IMO, a better approach would be to do the escaping before you assemble the pattern from its component parts ... assuming that's what is going on here.

Related

Java regex not matching German "Umlaut" OR underscore

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?
If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

How to use regex to remove punctuations in a sentence

I am trying to take from a file all the valid words. Valid words are defined as normal characters that can appear like so:
don't won't can't
and I have to ignore commas periods and exclamation points.
I have gotten the expression to just get characters but now it won't get words like don't and can't or won't.
This is the expression I am using "[^A-Za-z]+" and I have tried "\'[^A-Za-z]+" but this breaks and allows all characters. Does anyone have any idea what I can use to get normal words including don't and won't and can't and such words.
Thank you very much
[^A-Za-z] Would mean anything NOT matching those character ranges! Try this:
[A-Za-z']
You may need to escape the single quote, in which case you'll probably need to escape the slash that escapes it:
[A-Za-z\\']
Another way (using abbreviations) is: \b[\w']+
This will match letters from any language and exclude numbers.
\b[\p{L}\!\'\?]+
Here is a very good resource for regular expressions.
http://www.regular-expressions.info/

What is a regular expression for control characters?

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][#-z]
I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.
Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?#_\[\]^\\], so \\\^[A-Z?#_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.
Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)
If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.
Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..

Java regex help

A string must not include spaces or special characters. Only a-z, A-Z, 0-9, the underscore, and the period characters are allowed.
How do I achieve this?
Update:
All the solutions posted worked for me.
Thanks everyone for helping out.
if (!myString.matches("^[a-zA-Z0-9._]*$")) {
// fail ...
}
or you can use the \w character class (shorthand for [a-zA-Z_0-9])
if (!myString.matches("^[\\w.]*$")) {
// fail ...
}
I am certain by the time I finish typing this, you will have received you answer. So here is some genuine advice to go with it - Take the time (hour or so) to learn the basics of regular expressions.
You will be surprised how often they show up in solutions to 'real world' problems.
Great testing resource -> http://gskinner.com/RegExr/
A different solution:
text = text.replaceAll("[\\w.]", "");
It removes the unwanted characters instead of just detecting them.
From Sun's website:
\w A word character: [a-zA-Z_0-9]
"[\\w,]+" should do the trick
You could simply delete all the characters that don't match the set [a-zA-Z0-9_.]. Alternatively you could replace characters not in the set with a valid character (e.g. the underscore). Finally you could altogether reject any string that does not consist solely of characters in the permitted set.
You can either make a "all characters must be one of these" regular expression or simply ask if any of the characters you dislike are present at all and if so reject the string. I believe the latter will be the easiest to write and understand later.

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.
As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").
I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.
the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

Categories

Resources