Merging 2 regex that allow only English and Arabic characters - java

I have a string and I want to remove any other character such as (0..9!##$%^&*()_., ...) and keep only alphabetic characters.
After looking up and doing some tests, I got 2 regexes formats:
String str = "123hello!#$% مرحبا. ok";
str = str.replaceAll("[^a-zA-Z]", "");
str = str.replaceAll("\\P{InArabic}+", "");
System.out.println(str);
This should return "hello مرحبا ok".
But of course, this will return an empty string because we're removing any non-Latin characters in the first regex then we remove any non-Arabic characters in the second regex.
My question is, how can I merge these 2 regexes in one to keep only Arabic and English characters only.

Use lowercase p since negation is handled with ^ and no quantifier is needed (but wouldn't hurt) since using replaceAll:
String str = "123hello!#$% مرحبا. ok";
str = str.replaceAll("[^a-zA-Z \\p{InArabic}]", "");
System.out.println(str);
Prints:
hello مرحبا ok
Note based on your expected results you want spaces included so a space is in the character list.

Related

Regex to find the first word in a string java without using the string name

I am having a string which can have a sentence containing symbols and numbers and also the sentence can have different lengths
For Example
String myString = " () Huawei manufactures phones"
And the next time myString can have the following words
String myString = " * Audi has amazing cars &^"
How can i use regex to get the first word from the string so that the only word i get in the first myString is "Huawei" and the word i get on the second myString is Audi
Below is what i have tried but it fails when there is a space before the first words and symbols
String regexString = myString .replaceAll("\\s.*","")
You may use this regex with a capture group for matching:
^\W*\b(\w+).*
and replace with: $1
RegEx Demo
Java Code:
s = s.replaceAll("^\\W*\\b(\\w+).*", "$1");
RegEx Details:
^: Start
\W*: Match 0 or more non-word characters
\b: Word boundary
(\w+): Match 1+ word characters and capture it in group #1
.*: Match anything aftereards
See how you get on with:
s = s.replaceAll("^[^\\p{Alpha}]*", "");

JAVA: Replacing words in string

I want to replace words in a string, but I am having little difficulties. Here is what I want to do. I have string:
String a = "I want to replace some words in this string";
It should work like some kind of a translator. I am doing this with String.replaceAll(), but it doesn't work completely because of this. Let's say I am translating from English to German, than this should be the output (Ich means I in German).
String toTranslate = "I";
String translated = "Ich";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
Now the output of the String a will be this:
"ich want to replace some words ich**n** **th**ich**s** **str**ich**ng**"
How to replace just the words, not the subwords in the words?
replaceAll uses regex, so you may add word boundaries or look-around mechanisms to check if there are no non-space characters surrounding word you want to replace.
String toTranslate = "I";
String translated = "Ich";
a = a.replaceAll("(?<!\\S)"+toTranslate.toLowerCase()+"(?!\\S)", translated.toLowerCase());
You can also add quotation mechanism to escape any regex metacharacters like + * ( inside word you want to replace. BTW you don't need to change your string to lower case, simply add case-insensitive flag to regex (?i).
a = a.replaceAll("(?i)(?<!\\S)"+Pattern.quote(toTranslate)+"(?!\\S)", translated.toLowerCase());
Use split(" ") for getting each word in the sentence. And then use replaceAll on each word.
String a = "I want to replace some words in this string";
String toTranslate = "I";
String translated = "Ich";
String newString[]=a.split(" ");
for (String string : newString) {
string=string.replaceAll(toTranslate, toTranslate.toLowerCase());//Adding this line ensures you dont miss any uppercase toTranslate
string=string.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
System.out.println("after translation ="+string);
}
String toTranslate = "I ";
String translated = "Ich ";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
If you add a space after the "I" it should replace it when it comes to the word "Ich" but if your word ends in a "I" then thats another problem
If you assume that I will always be capitalized in English as it should be then
a = a.replaceAll(toTranslate, translated);
will work, otherwise you need to replace both cases
a = a.replaceAll(toTranslate, translated);
a = a.replaceAll("([^a-zA-Z])("+toTranslate.toLowerCase()+")([^a-zA-Z])", "$1"+translated.toLowerCase()+"$3");
Here is a working example
Yes, the word boundaries are the solution. I just did this in the regex:
text.replaceAll("\\b" + parts1[i] + "\\b", map.element.value);
Don't be confused with the second argument it's string (from Hash table).
You can use RegEx's word bound, which is \b
String toTranslate = "\\bI\\b";
String translated = "Ich";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
This should ensure I is separated entirely into its own word
Edit: I misread the question and realized you want whole words. See above, as I have accounted for that

Checking if String contains another whole String

So, I've been trying to find online if there's a way to have a String be search for another whole string in java. Unfortunately, I haven't found anything that works.
What I mean is this:
String str = "this is a test";
If I search for this is it should return true. But if I search for this i it should be false.
I've tried using String.matches(), but that won't work because some of the strings being search may have a [, ], ?, etc in it - which would throw it off. Using String.indexOf(search) != -1 won't work either because it would return true for partial words.
Use \b, the zero-width word boundary delimiter, in a regex.
String str = "this is a test";
String search = "this is";
Pattern p = Pattern.compile(String.format("\\b%s\\b", Pattern.quote(search)));
boolean matches = p.matcher(Pattern.quote(str)).find();
If you are also separate words from non alphabetic character and not only whitespaces you can use lookaround mechanisms. Try maybe this way
String str = "[this] is...";
String search = "[this] is";
Pattern p = Pattern.compile("(?!<\\p{IsAlphabetic})"
+ Pattern.quote(search) + "(?!\\p{IsAlphabetic})");
boolean matches = p.matcher(str).find();
It will check if matched part has no alphabetic characters before or after it.
Note: \\p{IsAlphabetic} includes all Unicode alphabetic characters like ż ź ć, not only a-z A-Z range.

Regular Expression - inserting space after comma only if succeeded by a letter or number

In Java I want to insert a space after a String but only if the character after the comma is succeeded by a digit or letter. I am hoping to use the replaceAll method which uses regular expressions as a parameter. So far I have the following:
String s1="428.0,chf";
s1 = s1.replaceAll(",(\\d|\\w)",", ");
This code does successfully distinguish between the String above and one where there is already a space after the comma. My problem is that I can't figure out how to write the expression so that the space is inserted. The code above will replace the c in the String shown above with a space. This is not what I want.
s1 should look like this after executing the replaceAll: "428.0 chf"
s1.replaceAll(",(?=[\da-zA-Z])"," ");
(?=[\da-zA-Z]) is a positive lookahead which would look for a digit or a word after ,.This lookahead would not be replaced since it is never included in the result.It's just a check
NOTE
\w includes digit,alphabets and a _.So no need of \d.
A better way to represent it would be [\da-zA-Z] instead of \w since \w also includes _ which you do not need 2 match
Try this, and note that $1 refers to your matched grouping:
s1.replaceAll(",(\\d|\\w)"," $1");
Note that String.replaceAll() works in the same way as a Matcher.replaceAll(). From the doc:
The replacement string may contain references to captured subsequences
String s1="428.0,chf";
s1 = s1.replaceAll(",([^_]\\w)"," $1"); //Match alphanumeric except '_' after ','
System.out.println(s1);
Output: -
428.0 chf
Since \w matches digits, words, and an underscore, So, [^_] negates the underscore from \w..
$1 represents the captured group.. You captured c after , here, so replace c with _$1 -> _c.. "_" represent a space..
Try this....
public class Tes {
public static void main(String[] args){
String s1="428.0,chf";
String[] sArr = s1.split(",");
String finalStr = new String();
for(String s : sArr){
finalStr = finalStr +" "+ s;
}
System.out.println(finalStr);
}
}

Remove doubled letter from a string using java

I need to remove a doubled letter from a string using regex operations in java.
Eg: PRINCEE -> PRINCE
APPLE -> APLE
Simple Solution (remove duplicate characters)
Like this:
final String str = "APPLEE";
String replaced = str.replaceAll("(.)\\1", "$1");
System.out.println(replaced);
Output:
APLE
Not just any Chracters, Letters only
As #Jim comments correctly, the above matches any double character, not just letters. Here are a few variations that just match letters:
// the basics, ASCII letters. these two are equivalent:
str.replaceAll("([A-Za-z])\\1", "$1");
str.replaceAll("(\\p{Alpha})\\1", "$1");
// Unicode Letters
str.replaceAll("(\\p{L})\\1", "$1");
// anything where Character.isLetter(ch) returns true
str.replaceAll("(\\p{javaLetter})\\1", "$1");
References:
For additional reference:
Character.isLetter(ch) (javadocs)
any method in Character of
the form Character.isXyz(char)
enables a pattern named
\p{javaXyz} (mind the
capitalization). This mechanism is
described in the Pattern
javadocs
Unicode blocks and categories can
also be matched with the \p and
\P constructs as in Perl. \p{prop}
matches if the input has the
property prop, while \P{prop} does
not match if the input has that
property. This mechanism is also
described in the Pattern
javadocs
String s = "...";
String replaced = s.replaceAll( "([A-Z])\\1", "$1" );
If you want to replace just duplicate ("AA"->"A", "AAA" -> "AA") use
public String undup(String str) {
return str.replaceAll("(\\w)\\1", "$1");
}
To replace triplicates etc use: str.replaceAll("(\\w)\\1+", "$1");
To replace only a single dupe is a long string (AAAA->AAA, AAA->AA) use: str.replaceAll("(\\w)(\\1+)", "$2");
This can be done simply by iterating over the String instead of having to resort to regexes.
StringBuilder ret=new StringBuilder(text.length());
if (text.length()==0) return "";
ret.append(text.charAt(0));
for(int i=1;i<text.length();i++){
if (text.charAt(i)!=text.charAt(i-1))
ret.append(text.charAt(i));
}
return ret.toString();

Categories

Resources