Split on non arabic characters

Split on non arabic characters - java

I have a String like this
أصبح::ينال::أخذ::حصل (على)::أحضر
And I want to split it on non Arabic characters using java
And here's my code
String s = "أصبح::ينال::أخذ::حصل (على)::أحضر";
String[] arr = s.split("^\\p{InArabic}+");
System.out.println(Arrays.toString(arr));
And the output was
[, ::ينال::أخذ::حصل (على)::أحضر]
But I expect the output to be
[ينال,أخذ,حصل,على,أحضر]
So I don't know what's wrong with this?

You need a negated class, and to do that, you need square brackets [ ... ]. Try to split with this:
"[^\\p{InArabic}]+"
If \\p{InArabic} matches any arabic character, then [^\\p{InArabic}] will match any non-arabic character.
Another option you can consider is an equivalent syntax, using P instead of p to indicate the opposite of the \\p{InArabic} character class like #Pshemo mentioned:
"\\P{InArabic}+"
This works just like \\W is the opposite of \\w.
The only possible advantage you get with the first syntax over the second (again like #Pshemo mentioned), is that if you want to add other characters to the list of characters which shouldn't match, for example, if you want to match all non \\p{InArabic} except periods, the first one is more flexible:
"[^\\p{InArabic}.]+"
^
Otherwise, if you really want to use \\P{InArabic}, you'll need subtraction within classes:
"[\\P{InArabic}&&[^.]]+"

The expression you want is "\\P{InArabic}+"
This means match any (non-zero) number of characters that are not Arabic.

Related

Could I give Java a regular expression when java should not split an string?

Can I give the String.split method a parameter which tells it when it must not split the given string? In my particular case, I have text documents with lots of text and symbols. But in every file there are many different symbols. This is what I want to achieve:
string.split(not(A-Z,ß,ä,ö,ü));
So basically, I want String.split to only split whenever it finds a character that is not part of the German set of characters.
I hope you can help me.

There are three tokens in regular expressions that allow you to do exactly what you want to achieve:
[] creates a character class which contains all characters that are listed inside. In your particular case, you'd want this to be [a-zßäöü] as this character group contains all characters a through z, ß, ä, ö and ü.
^ negates the contents of a character class. So, using the character class from above, you'd use [^a-zßäöü] if you wanted to match any character that is not part of the character group.
Additionally, adding (?i) in front of your regular expression causes it to be case insensitive, which allows your expression to match the uppercase letters as well without having to actually add them to your expression.
So, adding those three tokens together, you get the regular expression (?i)[^a-zßäöü]. Now the only thing left is to put them into your String.split method and you're done:
string.split("(?i)[^a-zßäöü]");

Mr.Human,
If I'm understanding your question correctly, you want to split a string on non-German characters?
So,
abcdöyüp
becomes
a, b, c, dö, yü, p
If that is the case, then unfortunately you need to specify the set of characters that are non-German, e.g. [A-Z] to split on. If you are trying to accomplish something other than this, please clarify and/or provide an example.

Matching sequence of unicode value in Java with regular expression

I have a text file that contains some sequence of unicode characters value like
"{"\u0985\u0982\u09b6\u0998\u099f\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u0982\u09b6\u09bf","\u0985\u0982\u09b6\u09be\u0999\u09cd\u0995\u09bf\u09a4","\u0985\u0982\u09b6\u09be\u09a6\u09bf","\u0985\u0982\u09b6\u09be\u09a8\u09cb"}"
I am trying to match and group values inside the quotes using Pattern class in java like below but can not find any match.
Pattern p = Pattern.compile("\"(\\[u]{1}\\w+)+\"");
Example
I am actually willing to find out where is the technical error in my given regexp.

Try something more like this:
Pattern p = Pattern.compile("\"(\\\\u[0-9a-f]{4})+\"");
In order to match the string \u you need the regex \\u, and to express that regex as a Java string literal means \\\\u. Following the u there must be exactly four hex digits.

First, this bit [u]{1} means that you want to match values from the list only once, so you can replace it with simply u
Once that is done, your regex wants to match a quote, a slash, then a u, then another slash, then one or more w's, then a slash. It is matching w's instead of word characters because you have too many slashes before it.
Happy coding!
Edit
Try replacing the \\ before the u with a \\\\. \u is not valid in some regex's and so when you put in a Java string, it's probably becoming \u, breaking the regex

Java regular expression matching two consecutive consonants

I'm trying to match only strings with two consecutive consonants. but no matter what input I give to myString this never evaluates to true, so I have to assume something is wrong with the syntax of my regex. Any ideas?
if (Pattern.matches("([^aeiou]&&[^AEIOU]){2}", myString)) {...}
Additional info:
myString is a substring of at most two characters
There is no whitespace, as this string is the output of a .split with a whitespace delimiter
I'm not worried about special characters, as the program just concatenates and prints the result, though if you'd like to show me how to include something like [b-z]&&[^eiou] in your answer I would appreciate it.
Edit:
After going through these answers and testing a little more, the code I finally used was
if (myString.matches("(?i)[b-z&&[^eiou]]{2}")) {...}

[^aeiou] matches non-letter characters as well, so you should use a different pattern:
Pattern rx = Pattern.compile("[bcdfghjklmnpqrstuvwxyz]{2}", Pattern.CASE_INSENSITIVE);
if (rx.matches(myString)) {...}
If you would like to use && for an intersection, you can do it like this:
"[a-z&&[^aeiou]]{2}"
Demo.

To use character class intersection, you need to wrap your syntax inside of a bracketed expression. The below matches characters that are both lowercase letters and not vowels.
[a-z&&[^aeiou]]{2}

Extract variable from a simple mathematical equation (Java, String)

I will be handling a bunch of strings that will be of the following format:
"2*salary"
"salary+2"
"2*salary/3"
My goal is to pull out just "salary". I do not however want to eliminate non-characters because I might have something like "2*id3", a mixture of characters and numbers as the variable name (note: it will never be all numbers). I currently use:
Pattern pattern = Pattern.compile("[\\w_]+");
However, for something like "2*salary" this results in "2" and "salary" being found.

You're probably looking for this:
Pattern.compile("[a-zA-Z]\\w+");
... in other words, match the sequence of characters that begins with a letter. That'll match 'salary', but won't match '2' (and '2salary' too).
If you in fact do need to match 2salary, use this:
Pattern.compile("[0-9]*[A-Za-z]\\w+");
(I have replaced [\w_] with just \w, it actually includes underscore).

That is because 2*salary matches twice your "word" character definition \w which is [a-zA-Z0-9_], the first is 2 and the and match is salary
In your case you need something like "[a-zA-Z][\w]*"

Regex excluding square brackets

I am new to regex. I have this regex:
\[(.*[^(\]|\[)].*)\]
Basically it should take this:
[[a][b][[c]]]
And be able to replace with:
[dd[d]]
abc, d are unrelated. Needless to say the regex bit isn't working. it replaces the entire string with "d" in this case.
Any explanation or aid would be great!
EDIT:
I tried another regex,
\[([^\]]{0})\]
This one worked for the case where brackets contain no inner brackets and nothing else inside. But it doesn't work for the described case.

You need to know that . dot is special character which represents "any character beside new line mark" and * is greedy so it will try to find maximal match.
In your regex \[(.*[^(\]|\[)].*)\] first .* will represent maximal set of characters between [ and [^(\]|\[)].*)\]] and this part can be understood as non [ or ] character, optional other characters .* and finally ]. So this regex will match your entire input.
To get rid of that problem remove both .* from your regex. Also you don't need to use | or ( ) inside [^...].
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^\\]\\[]\\]", "d"));
Output: [dd[d]]

\[(\[a\])(\[b\])\[(\[c\])\]\]
If you need to double backslashes in the current context (such as you are placing it in a "" style string):
\\[(\\[a\\])(\\[b\\])\\[(\\[c\\])\\]\\]
An example replacement for a, b and c is [^\]]*, or if you need to escape backslashes [^\\]]*.
Now you can replace capture one, capture two and capture three each with d.
If the string you are replacing in is not exactly of that format, then you want to do a global replacement with
(\[a\])
replacing a,
(\[[^\]]*\])
doubling backslashes,
(\\[[^\\]]*\\])

Try this:
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^]\\[]]", "d"));
if a,b,c are in real world more than one character, use this:
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^]\\[]++]", "d"));
The idea is to use a character class that contains all characters but [ and ]. The class is: [^]\\[] and other square brackets in the pattern are literals.
Note that a literal closing square bracket don't need to be escaped at the first position in a character class and outside a character class.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split on non arabic characters - java

The expression you want is "\\P{InArabic}+" This means match any (non-zero) number of characters that are not Arabic.

Related

Could I give Java a regular expression when java should not split an string?

Matching sequence of unicode value in Java with regular expression

Java regular expression matching two consecutive consonants

Extract variable from a simple mathematical equation (Java, String)

Regex excluding square brackets

Categories

Resources