unicode regex pattern not working - java

I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT:
I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above
Edit3:
ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text.
The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}

Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.
On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:
(?:[\\u05d0\\u05df]){2,}

It's not clear what you're trying to do. If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. Try this to match all Unicode characters in the range:
[\\u05d0-\\u0eff]

Looks like you have unnecessary \\ in your input string. Following works by replacing your specified unicode character range in regex:
String text = "\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "###"));
OUTPUT:
###
</span>
Note that in our input text you had \\n and \\u05db etc that I have fixed.

Related

Regex-How to prevent repeated special characters?

I don't have an experience on Regular Expressions. I need to a regular expression which doesn't allow to repeat of special characters (+-*/& etc.)
The string can contain digits, alphanumerics, and special characters.
This should be valid : abc,df
This should be invalid : abc-,df
i will be really appreciated if you can help me ! Thanks for advance.
Two solutions presented so far match a string that is not allowed.
But the tilte is How to prevent..., so I assume that the regex
should match the allowed string. It means that the regex should:
match the whole string if it does not contain 2
consecutive special characters,
not match otherwise.
You can achieve this putting together the following parts:
^ - start of string anchor,
(?!.*[...]{2}) - a negative lookahead for 2 consecutive special
characters (marked here as ...), in any place,
a regex matching the whole (non-empty) string,
$ - end of string anchor.
So the whole regex should be:
^(?!.*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2}).+$
Note that within a char class (between [ and ]) a backslash
escaping the following char should be placed before - (if in
the middle of the sequence), closing square bracket,
a backslash itself and / (regex terminator).
Or if you want to apply the regex to individual words (not the whole
string), then the regex should be:
\b(?!\S*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2})\S+
[\,\+\-\*\/\&]{2,} Add more characters in the square bracket if you want.
Demo https://regex101.com/r/CBrldL/2
Use the following regex to match the invalid string.
[^A-Za-z0-9]{2,}
[^\w!\s]{2,} This would be a shortest version to match any two consecutive special characters (ignoring space)
If you want to consider space, please use [^\w]{2,}

Matching sequence of unicode value in Java with regular expression

I have a text file that contains some sequence of unicode characters value like
"{"\u0985\u0982\u09b6\u0998\u099f\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u0982\u09b6\u09bf","\u0985\u0982\u09b6\u09be\u0999\u09cd\u0995\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u09a6\u09bf","\u0985\u0982\u09b6\u09be\u09a8\u09cb"}"
I am trying to match and group values inside the quotes using Pattern class in java like below but can not find any match.
Pattern p = Pattern.compile("\"(\\[u]{1}\\w+)+\"");
Example
I am actually willing to find out where is the technical error in my given regexp.
Try something more like this:
Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\"");
In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.
First, this bit [u]{1} means that you want to match values from the list only once, so you can replace it with simply u
Once that is done, your regex wants to match a quote, a slash, then a u, then another slash, then one or more w's, then a slash. It is matching w's instead of word characters because you have too many slashes before it.
Happy coding!
Edit
Try replacing the \\ before the u with a \\\\. \u is not valid in some regex's and so when you put in a Java string, it's probably becoming \u, breaking the regex

Regex: Illegal hexadecimal escape sequence

I am new to Regex..I wrote the following regex to check phone numbers in javascript: ^[0-9\+\-\s\(\)\[\]\x]*$
Now, I try to the same thing in java using the following code:
public class testRegex {
public static void main(String[] args){
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\x]*$";
String phone="98650056";
System.out.println(phone.matches(regex));
}
However, I always get the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal hexadecimal escape sequence near index 21^[0-9\+\-\s\\(\\)\\[\\]\x]*$
Please advise.
Since you are trying to match what I assume is x (as in a phone extension), it needs to be escaped with four backslashes, or not escaped at all; otherwise \x is interpreted as a hexidecimal escape code. Because \x is interpreted as a hex code without the two to four additional required chars it's an error.
[\\x] \x{nn} or {nnnn} (hex code nn to nnnn)
[\\\\x] x (escaped)
[x] x
So the pattern becomes:
String regex="^[-0-9+()\\s\\[\\]x]*$";
Escaped Alternatives:
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]x]*$";
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\\\x]*$";
You have waaaay too many back slashes!
Firstly, to code a literal backslash in java, you must write two of them.
Secondly, most characters lose their special regex meaning when in a character class.
Thirdly, \x introduces a hex literal - you don't want that.
Write your regex like this:
String regex="^[0-9+\\s()\\[\\]x-]*$";
Note how you don't need to escape the hyphen in a character class when it appears first or last.

How to spot * in regular expressions?

I want to spot and delete all lines that have *** in them. How can I do this?
I tried to use regex but got
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 6
Here is my regular expression: (?m)^**.*.
.........text...........
***..........text....... //want to delete this line
........................
The * character in a regular expression has a special meaning. To show the Pattern you don't mean for this special meaning, you have to "escape" it. The easiest way to do it is to put your expression through Pattern.quote().
For example:
String searchFor = Pattern.quote("***");
Then use that string to search
Note that* is a special character in regex so you have to use \\*
Your expression will be: (?m)^\\*\\*.\\*
This is not perfect, but it'll get you started:
// 4 lines, 2 of each containing "***" at random locations
String input = "abc***def\nghijkl\n***mnop\n**blah";
// replacing multiline pattern starting with any character 0 or more times,
// followed by 3 escaped "*"s,
// followed by any character 0 or more times
System.out.println(input.replaceAll("(?m).*\\*{3}.*", ""));
Output:
ghijkl
**blah
If the three asterisks are not always at the begining of the line, you can use this pattern that removes newlines too:
(\r?\n)?[^\r\n*]*\Q***\E.*((1)?|\r?\n?)
If all you're doing is looking for three specific characters together in a string, you don't need a regex at all:
if (line.contains("***")) {
...
}
(But if things get more complicated and you do need a regex, then use a backslash or Pattern.quote as the other answers say.)
(This is assuming you're reading lines one at a time, instead of having one big long buffer containing all the lines with newline characters. Some of the other answers handle the latter case.)

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Categories

Resources