Split string by array of characters - java

i want to split a string by array of characters,
so i have this code:
String target = "hello,any|body here?";
char[] delim = {'|',',',' '};
String regex = "(" + new String(delim).replaceAll("(.)", "\\\\$1|").replaceAll("\\|$", ")");
String[] result = target.split(regex);
everything works fine except when i want to add a character like 'Q' to delim[] array,
it throws exception :
java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 11
(\ |\,|\||\Q)
so how can i fix that to work with non-special characters as well?
thanks in advance

how can i fix that to work with non-special characters as well
Put square brackets around your characters, instead of escaping them. Make sure that if ^ is included in your list of characters, you need to make sure it's not the first character, or escape it separately if it's the only character on the list.
Dashes also need special treatment - they need to go at the beginning or at the end of the regex.
String delimStr = String(delim);
String regex;
if (delimStr.equals("^") {
regex = "\\^"
} else if (delimStr.charAt(0) == '^') {
// This assumes that all characters are distinct.
// You may need a stricter check to make this work in general case.
regex = "[" + delimStr.charAt(1) + delimStr + "]";
} else {
regex = "[" + delimStr + "]";
}

Using Pattern.quote and putting it in square brackets seems to work:
String regex = "[" + Pattern.quote(new String(delim)) + "]";
Tested with possible problem characters.

Q is not a control character in a regex, so you do not have to put the \\ before it (it only serves to mark that you must interpret the following character as a literal, and not as a control character).
Example
`\\.` in a regex means "a dot"
`.` in a regex means "any character"
\\Q fails because Q is not special character in a regex, so it does not need to be quoted.
I would make delim a String array and add the quotes to these values that need it.
delim = {"\\|", ..... "Q"};

Related

Parsing special character using Java Regex

I have a requirement where i need to remove those special characters from a string which are not in the array list. The current code removes all special character when found ,
String Modified_remark = final_remark.replaceAll("[^\\x00-\\x7F]", "");
This code removes all special character from the string , But i want to retain certain items like Angstrom Symbol (Å) & Micron Symbol (μ)
For example if i place the allowed special character in Array , i want the code to skip the replacement and if not matching then replace with "" (Empty quotes).
String[] allowedChar = {Å, μ};
To be added more when requested by User's. Can anyone help with this logic.
Just add all the allowedChars to the exception list in your regex:
final_remark.replaceAll("[^\\x00-\\x7F" + String.join("", allowedChar) + "]", "");
Demo: https://ideone.com/iQWvHI
Update
As Wiktor Stribiżew rightly pointed out, the this simple code breaks if allowedChar contains some regex special characters. Since the requirements imply allowedChar to contain only non-ACSII characters, we may add a condition on allowedChar as follows:
String[] allowedChar = {"Å", "μ", "]"};
String allowedChars = "";
for (String ch : allowedChar)
if (ch.matches("^[^\\x00-\\x7F]$"))
allowedChars += ch;
String Modified_remark = final_remark.replaceAll("[^\\x00-\\x7F" + allowedChars + "]", "");
Demo: https://ideone.com/94513e

Java pattern regex with escape characters

I want to replace ";" with "\n" except when it's escaped with a leading '\'. I haven't figured out the correct regex.
Here is what I have:
String s = "abc;efg\\;hij;pqr;xyz\\;123"
s.replaceAll("\\[^\\\\];", "\\\\n");
I'd expect the above string to be replaced with "abc\nefg\;hij;pqr;xyz\;123"
Use a negative look behind:
s = s.replaceAll("(?<!\\\\);", "\n");
The expression (?<!\\) (coded as a java string literal "(?<!\\\\)") means "the previous character should not be a backslash"
Test code:
String s = "abc;efg\\;hij;pqr;xyz\\;123";
s = s.replaceAll("(?<!\\\\);", "\n");
System.out.println(s);
Output:
abc
efg\;hij
pqr
xyz\;123

Error when splitting a string in java

I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces.
This is the code i'm currently using,
String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");
which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example
String inputString = "\"Word\" some more text.";
Is giving me this output
arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
I want the output to give me an array with
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.
EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string
Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because
by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).
Try maybe this way
String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\" some more text.";
String[] arrayOfWords = inputString.replaceAll(
"^" + regexDelimiters,"").split(regexDelimiters);
for (String s : arrayOfWords)
System.out.println("'" + s + "'");
output:
'Word'
'some'
'more'
'text'
A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:
How to prevent java.lang.String.split() from creating a leading empty string?
So in short form you would have:
String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);
Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?
String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}
just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

Escaping special characters in Java Regular Expressions

Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
This would be very handy in dynamically building a regular expression, without having to manually escape each individual character.
For example, consider a simple regex like \d+\.\d+ that matches numbers with a decimal point like 1.2, as well as the following code:
String digit = "d";
String point = ".";
String regex1 = "\\d+\\.\\d+";
String regex2 = Pattern.quote(digit + "+" + point + digit + "+");
Pattern numbers1 = Pattern.compile(regex1);
Pattern numbers2 = Pattern.compile(regex2);
System.out.println("Regex 1: " + regex1);
if (numbers1.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
System.out.println("Regex 2: " + regex2);
if (numbers2.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
Not surprisingly, the output produced by the above code is:
Regex 1: \d+\.\d+
Match
Regex 2: \Qd+.d+\E
No match
That is, regex1 matches 1.2 but regex2 (which is "dynamically" built) does not (instead, it matches the literal string d+.d+).
So, is there a method that would automatically escape each regex meta-character?
If there were, let's say, a static escape() method in java.util.regex.Pattern, the output of
Pattern.escape('.')
would be the string "\.", but
Pattern.escape(',')
should just produce ",", since it is not a meta-character. Similarly,
Pattern.escape('d')
could produce "\d", since 'd' is used to denote digits (although escaping may not make sense in this case, as 'd' could mean literal 'd', which wouldn't be misunderstood by the regex interpeter to be something else, as would be the case with '.').
Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
If you are looking for a way to create constants that you can use in your regex patterns, then just prepending them with "\\" should work but there is no nice Pattern.escape('.') function to help with this.
So if you are trying to match "\\d" (the string \d instead of a decimal character) then you would do:
// this will match on \d as opposed to a decimal character
String matchBackslashD = "\\\\d";
// as opposed to
String matchDecimalDigit = "\\d";
The 4 slashes in the Java string turn into 2 slashes in the regex pattern. 2 backslashes in a regex pattern matches the backslash itself. Prepending any special character with backslash turns it into a normal character instead of a special one.
matchPeriod = "\\.";
matchPlus = "\\+";
matchParens = "\\(\\)";
...
In your post you use the Pattern.quote(string) method. This method wraps your pattern between "\\Q" and "\\E" so you can match a string even if it happens to have a special regex character in it (+, ., \\d, etc.)
I wrote this pattern:
Pattern SPECIAL_REGEX_CHARS = Pattern.compile("[{}()\\[\\].+*?^$\\\\|]");
And use it in this method:
String escapeSpecialRegexChars(String str) {
return SPECIAL_REGEX_CHARS.matcher(str).replaceAll("\\\\$0");
}
Then you can use it like this, for example:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*" + escapeSpecialRegexChars(text) + ".*");
}
We needed to do that because, after escaping, we add some regex expressions. If not, you can simply use \Q and \E:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*\\Q" + text + "\\E.*")
}
The only way the regex matcher knows you are looking for a digit and not the letter d is to escape the letter (\d). To type the regex escape character in java, you need to escape it (so \ becomes \\). So, there's no way around typing double backslashes for special regex chars.
The Pattern.quote(String s) sort of does what you want. However it leaves a little left to be desired; it doesn't actually escape the individual characters, just wraps the string with \Q...\E.
There is not a method that does exactly what you are looking for, but the good news is that it is actually fairly simple to escape all of the special characters in a Java regular expression:
regex.replaceAll("[\\W]", "\\\\$0")
Why does this work? Well, the documentation for Pattern specifically says that its permissible to escape non-alphabetic characters that don't necessarily have to be escaped:
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
For example, ; is not a special character in a regular expression. However, if you escape it, Pattern will still interpret \; as ;. Here are a few more examples:
> becomes \> which is equivalent to >
[ becomes \[ which is the escaped form of [
8 is still 8.
\) becomes \\\) which is the escaped forms of \ and ( concatenated.
Note: The key is is the definition of "non-alphabetic", which in the documentation really means "non-word" characters, or characters outside the character set [a-zA-Z_0-9].
Use this Utility function escapeQuotes() in order to escape strings in between Groups and Sets of a RegualrExpression.
List of Regex Literals to escape <([{\^-=$!|]})?*+.>
public class RegexUtils {
static String escapeChars = "\\.?![]{}()<>*+-=^$|";
public static String escapeQuotes(String str) {
if(str != null && str.length() > 0) {
return str.replaceAll("[\\W]", "\\\\$0"); // \W designates non-word characters
}
return "";
}
}
From the Pattern class the backslash character ('\') serves to introduce escaped constructs. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.
Example: String to be matched (hello) and the regex with a group is (\(hello\)). Form here you only need to escape matched string as shown below. Test Regex online
public static void main(String[] args) {
String matched = "(hello)", regexExpGrup = "(" + escapeQuotes(matched) + ")";
System.out.println("Regex : "+ regexExpGrup); // (\(hello\))
}
Agree with Gray, as you may need your pattern to have both litrals (\[, \]) and meta-characters ([, ]). so with some utility you should be able to escape all character first and then you can add meta-characters you want to add on same pattern.
use
pattern.compile("\"");
String s= p.toString()+"yourcontent"+p.toString();
will give result as yourcontent as is

Categories

Resources