Matching sequence of unicode value in Java with regular expression

Matching sequence of unicode value in Java with regular expression - java

I have a text file that contains some sequence of unicode characters value like
"{"\u0985\u0982\u09b6\u0998\u099f\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u0982\u09b6\u09bf","\u0985\u0982\u09b6\u09be\u0999\u09cd\u0995\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u09a6\u09bf","\u0985\u0982\u09b6\u09be\u09a8\u09cb"}"
I am trying to match and group values inside the quotes using Pattern class in java like below but can not find any match.
Pattern p = Pattern.compile("\"(\\[u]{1}\\w+)+\"");
Example
I am actually willing to find out where is the technical error in my given regexp.

Try something more like this:
Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\"");
In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.

First, this bit [u]{1} means that you want to match values from the list only once, so you can replace it with simply u
Once that is done, your regex wants to match a quote, a slash, then a u, then another slash, then one or more w's, then a slash. It is matching w's instead of word characters because you have too many slashes before it.
Happy coding!
Edit
Try replacing the \\ before the u with a \\\\. \u is not valid in some regex's and so when you put in a Java string, it's probably becoming \u, breaking the regex

Related

Regular expression not working despite testing

I'm trying to enforce validation of an ID that includes the first two letters being letters and the next four being numbers, there can be one 0 i.e. 0333 but can never be full zeroes with 0000 therefore something like ID0000 is not allowed. The expression I came up with seems to check out when testing it online but doesn't seem to work when trying to enforce it in the program:
\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b
and heres the code I'm currently using to implement it:
String pattern = "/\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\b/";
Pattern regEx = Pattern.compile(pattern);
String ingID = ingredID.getText().toString();
Matcher m = regEx.matcher(ingID);
if (m.matches()) {
ingredID.setError("Please enter a valid Ingrediant ID");
}
For some reason it doesn't seem to validate correctly with accepting ids like ID0000 when it shouldn't be. Any thoughts folks ?

Change your regex pattern to "\\b(?![A-Z]{2}[0]{4})[A-Z]{2}[0-9]{4}\\b"

Your problem is essentially that Java isn't all that Regex-friendly; you need to deal with the limitations of Java strings in order to create a string that can be used as a Regex pattern. Since \ is the escape character in Regex and the escape character in Java strings (and since there's no such thing as a raw string literal in Java), you must double-escape anything that must be escaped in the Regex in order to create a literal \ character within the Java string, which, when parsed as a Regex pattern, will be correctly treated as the escape character.
So, for instance, the Regex pattern /\b/ (where /, as mentioned in my comment, delimits the pattern itself) would be represented in Java as the string "\\b".

Split on non arabic characters

I have a String like this
أصبح::ينال::أخذ::حصل (على)::أحضر
And I want to split it on non Arabic characters using java
And here's my code
String s = "أصبح::ينال::أخذ::حصل (على)::أحضر";
String[] arr = s.split("^\\p{InArabic}+");
System.out.println(Arrays.toString(arr));
And the output was
[, ::ينال::أخذ::حصل (على)::أحضر]
But I expect the output to be
[ينال,أخذ,حصل,على,أحضر]
So I don't know what's wrong with this?

You need a negated class, and to do that, you need square brackets [ ... ]. Try to split with this:
"[^\\p{InArabic}]+"
If \\p{InArabic} matches any arabic character, then [^\\p{InArabic}] will match any non-arabic character.
Another option you can consider is an equivalent syntax, using P instead of p to indicate the opposite of the \\p{InArabic} character class like #Pshemo mentioned:
"\\P{InArabic}+"
This works just like \\W is the opposite of \\w.
The only possible advantage you get with the first syntax over the second (again like #Pshemo mentioned), is that if you want to add other characters to the list of characters which shouldn't match, for example, if you want to match all non \\p{InArabic} except periods, the first one is more flexible:
"[^\\p{InArabic}.]+"
^
Otherwise, if you really want to use \\P{InArabic}, you'll need subtraction within classes:
"[\\P{InArabic}&&[^.]]+"

The expression you want is "\\P{InArabic}+"
This means match any (non-zero) number of characters that are not Arabic.

Using regex to match beginning and end of string [Java]

I have a list of files in a folder:
maze1.in.txt
maze2.in.txt
maze3.in.txt
I've used substring to remove the .txt extensions.
How do I use regex to match the front and the back of the file name?
I need it to match "maze" at the front and ".in" at the back, and the middle must be a digit (can be single or double digit).
I've tried the following
if (name.matches("name\\din")) {
//dosomething
}
It doesn't match anything. What is the correct regex expression to use?

I'm a little confused what you are asking for in particular
^(maze[0-9]*\.in)$
This will match maze(any number).in
^(maze[0-9]*\.in)\.txt$
this will match maze(any number).in.txt -- excludes the .txt NO NEED FOR USING SUB STRING!
Edit live on Debuggex
The think i would be wary about as of right now is the capture groups... I'm not particularly sure what you are doing with this regex. However, I believe explaining capture groups could benefit you.
A capture group for instance is denoted by () this is basically store them in the pattern array and is a way to parse stuff.
example maze1.in.txt
So if you want to capture the entire line minus .txt i would use this ^(maze[0-9]*\.in\.txt)$
However, if I wanted to capture things separately I would do this ^(maze)([0-9]*)(\.in)\.txt$ this will exclude .txt but include maze, the number, and .in IN separate indexes of the pattern array.

Your original solution doesn't work because string "name" is not in your text. It is "maze".
You can try this
name.matches("maze\\d{1,2}\\.in")
d{1,2} is used to match a digit(can be single or double digit).

You need regex anchors that tell the regex to
start at the beginning: ^
and signal the end of the string: $
^maze[\d]{0,2}\.in$
or in Java:
name.matches("^maze[\\d]{0,2}\\.in$");
Also, your regex wasn't matching strings with a dot (.) which would not accept your examples given. You need to add \. to the regex to accept dots because . is a special character.

It is always good to think of what you are trying to do in english, before you create regular expressions.
You want to match a word maze followed by a digit, followed by a literal period . followed by another word.
word `\w` matches a word character
digit `\d` matches a single digit
period `\.` matches a literal period
word `\w` matches a word character
putting it all together into a single string you get (keep in mind the double backslash for the Java escape and the pluses to repeat the previous match one or more times):
"\\w+\\d\\.\\w+"
The above is the generic case for any file name in the format xxx1.yyy, if you wanted to match maze and in specifically, you can just add those in as literal strings.
"maze\\d+\\.in"
example: http://ideone.com/rS7tw1

name.matches("^maze[0-9]+\\.in\\.txt$")

unicode regex pattern not working

I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT:
I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above
Edit3:
ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text.
The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}

Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.
On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:
(?:[\\u05d0\\u05df]){2,}

It's not clear what you're trying to do. If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. Try this to match all Unicode characters in the range:
[\\u05d0-\\u0eff]

Looks like you have unnecessary \\ in your input string. Following works by replacing your specified unicode character range in regex:
String text = "\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "###"));
OUTPUT:
###
</span>
Note that in our input text you had \\n and \\u05db etc that I have fixed.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));

Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here

Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.

Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Matching sequence of unicode value in Java with regular expression - java

Try something more like this: Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\""); In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.

Related

Regular expression not working despite testing

Split on non arabic characters

Using regex to match beginning and end of string [Java]

unicode regex pattern not working

Remove everything from a string upto a certain character and optionally a string if it follows too

Categories

Resources