How to replace special characters in a string? - java

I have a string with lots of special characters. I want to remove all those, but keep alphabetical characters.
How can I do this?

That depends on what you mean. If you just want to get rid of them, do this:
(Update: Apparently you want to keep digits as well, use the second lines in that case)
String alphaOnly = input.replaceAll("[^a-zA-Z]+","");
String alphaAndDigits = input.replaceAll("[^a-zA-Z0-9]+","");
or the equivalent:
String alphaOnly = input.replaceAll("[^\\p{Alpha}]+","");
String alphaAndDigits = input.replaceAll("[^\\p{Alpha}\\p{Digit}]+","");
(All of these can be significantly improved by precompiling the regex pattern and storing it in a constant)
Or, with Guava:
private static final CharMatcher ALNUM =
CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z'))
.or(CharMatcher.inRange('0', '9')).precomputed();
// ...
String alphaAndDigits = ALNUM.retainFrom(input);
But if you want to turn accented characters into something sensible that's still ascii, look at these questions:
Converting Java String to ASCII
Java change áéőűú to aeouu
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars

I am using this.
s = s.replaceAll("\\W", "");
It replace all special characters from string.
Here
\w : A word character, short for [a-zA-Z_0-9]
\W : A non-word character

You can use the following method to keep alphanumeric characters.
replaceAll("[^a-zA-Z0-9]", "");
And if you want to keep only alphabetical characters use this
replaceAll("[^a-zA-Z]", "");

Replace any special characters by
replaceAll("\\your special character","new character");
ex:to replace all the occurrence of * with white space
replaceAll("\\*","");
*this statement can only replace one type of special character at a time

Following the example of the Andrzej Doyle's answer, I think the better solution is to use org.apache.commons.lang3.StringUtils.stripAccents():
package bla.bla.utility;
import org.apache.commons.lang3.StringUtils;
public class UriUtility {
public static String normalizeUri(String s) {
String r = StringUtils.stripAccents(s);
r = r.replace(" ", "_");
r = r.replaceAll("[^\\.A-Za-z0-9_]", "");
return r;
}
}

string Output = Regex.Replace(Input, #"([ a-zA-Z0-9&, _]|^\s)", "");
Here all the special characters except space, comma, and ampersand are replaced. You can also omit space, comma and ampersand by the following regular expression.
string Output = Regex.Replace(Input, #"([ a-zA-Z0-9_]|^\s)", "");
Where Input is the string which we need to replace the characters.

Here is a function I used to remove all possible special characters from the string
let name = name.replace(/[&\/\\#,+()$~%!.„'":*‚^_¤?<>|#ª{«»§}©®™ ]/g, '').toLowerCase();

You can use basic regular expressions on strings to find all special characters or use pattern and matcher classes to search/modify/delete user defined strings. This link has some simple and easy to understand examples for regular expressions: http://www.vogella.de/articles/JavaRegularExpressions/article.html

You can get unicode for that junk character from charactermap tool in window pc and add \u e.g. \u00a9 for copyright symbol.
Now you can use that string with that particular junk caharacter, don't remove any junk character but replace with proper unicode.

For spaces use "[^a-z A-Z 0-9]" this pattern

Related

Insert character before specific character Java

I have been taking a look at the regular expressions and how to use it in Java for the problem I have to solve. I have to insert a \ before every ". This is what I have:
public class TestExpressions {
public static void main (String args[]) {
String test = "$('a:contains(\"CRUCERO\")')";
test = test.replaceAll("(\")","$1%");
System.out.println(test);
}
}
The ouput is:
$('a:contains("%CRUCERO"%)')
What I want is:
$('a:contains(\"CRUCERO\")')
I have changed % for \\ but have an error StringIndexOutofBounds don't know why. If someone can help me I would appreciate it, thank you in advance.
I have to insert a \ before every "
You can try with replace which automatically escapes all regex metacharacters and doesn't use any special characters in replacement part so you can simply use String literals you want to be put in matched part.
So lets just replace " with \" literal. You can write it as
test = test.replace("\"", "\\\"");
If you want to insert backspace before quote then use:
test = test.replaceAll("(\")","\\\\$1"); // $('a:contains(\"CRUCERO\")')
Or if you want to avoid already escaped quote then use negative lookbehind:
String test = "$('a:contains(\\\"CRUCERO\")')";
test = test.replaceAll("((?<!\\\\)\")","\\\\$1"); // $('a:contains(\"CRUCERO\")')
String result = subject.replaceAll("(?i)\"CRUCERO\"", "\\\"CRUCERO\\\"");
EXPLANATION:
Match the character string “"CRUCERO"” literally (case insensitive) «"CRUCERO"»
Ignore unescaped backslash «\»
Insert the character string “"CRUCERO” literally «"CRUCERO»
Ignore unescaped backslash «\»
Insert the character “"” literally «"»
If your goal is escape text for Java strings, then instead of regular expressions, consider using
String escaped = org.apache.commons.lang.StringEscapeUtils.
escapeJava("$('a:contains(\"CRUCERO\")')");
System.out.println(escaped);
Output:
$('a:contains(\"CRUCERO\")')
JavaDoc: http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#escapeJava(java.lang.String)

How to get an alphanumeric String from any string in Java? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars
How to replace special characters in a string?
I would like to format some String such as "I>Télé" to something like "itele".
The idea is that I want my String to be lower case (done), without whitespaces (done), no accents or special characters (like >, <, /, %, ~, é, #, ï etc).
It is okay to delete occurences of special characters, but I want to keep letters while removing accents (as I did in my example). Here is what I did, but I don't think that the good solution is to replace every é,è,ê,ë by "e", than do it again for "i","a" etc, and then remove every special character...
String name ="I>télé" //example
String result = name.toLowerCase().replace(" ", "").replace("é","e").........;
The purpose of that is to provide a valid filename for resources for an Android app, so if you have any other idea, I'll take it !
You can use the java.text.Normalizer class to convert your text into normal Latin characters followed by diacritic marks (accents), where possible. So for example, the single-character string "é" would become the two character string ['e', {COMBINING ACUTE ACCENT}].
After you've done this, your String would be a combination of unaccented characters, accent modifiers, and the other special characters you've mentioned. At this point you could filter the characters in your string using only a whitelist to keep what you want (which could be as simple as [A-Za-z0-9] for a regex, depending on what you're after).
An approach might look like:
String name ="I>télé"; //example
String normalized = Normalizer.normalize(name, Form.NFD);
String result = normalized.replaceAll("[^A-Za-z0-9]", "");
You can do something like
String res = ""
for (char c : name.toCharArray()) {
if (Character.isLetter(c) ||Character.isDigit(c))
res += c
}
//Normalize using the method below
http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/
public static String stripAccents(String s) {
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
return s;
}
try using ascii code. may this link will help

Escaping special characters in Java Regular Expressions

Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
This would be very handy in dynamically building a regular expression, without having to manually escape each individual character.
For example, consider a simple regex like \d+\.\d+ that matches numbers with a decimal point like 1.2, as well as the following code:
String digit = "d";
String point = ".";
String regex1 = "\\d+\\.\\d+";
String regex2 = Pattern.quote(digit + "+" + point + digit + "+");
Pattern numbers1 = Pattern.compile(regex1);
Pattern numbers2 = Pattern.compile(regex2);
System.out.println("Regex 1: " + regex1);
if (numbers1.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
System.out.println("Regex 2: " + regex2);
if (numbers2.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
Not surprisingly, the output produced by the above code is:
Regex 1: \d+\.\d+
Match
Regex 2: \Qd+.d+\E
No match
That is, regex1 matches 1.2 but regex2 (which is "dynamically" built) does not (instead, it matches the literal string d+.d+).
So, is there a method that would automatically escape each regex meta-character?
If there were, let's say, a static escape() method in java.util.regex.Pattern, the output of
Pattern.escape('.')
would be the string "\.", but
Pattern.escape(',')
should just produce ",", since it is not a meta-character. Similarly,
Pattern.escape('d')
could produce "\d", since 'd' is used to denote digits (although escaping may not make sense in this case, as 'd' could mean literal 'd', which wouldn't be misunderstood by the regex interpeter to be something else, as would be the case with '.').
Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
If you are looking for a way to create constants that you can use in your regex patterns, then just prepending them with "\\" should work but there is no nice Pattern.escape('.') function to help with this.
So if you are trying to match "\\d" (the string \d instead of a decimal character) then you would do:
// this will match on \d as opposed to a decimal character
String matchBackslashD = "\\\\d";
// as opposed to
String matchDecimalDigit = "\\d";
The 4 slashes in the Java string turn into 2 slashes in the regex pattern. 2 backslashes in a regex pattern matches the backslash itself. Prepending any special character with backslash turns it into a normal character instead of a special one.
matchPeriod = "\\.";
matchPlus = "\\+";
matchParens = "\\(\\)";
...
In your post you use the Pattern.quote(string) method. This method wraps your pattern between "\\Q" and "\\E" so you can match a string even if it happens to have a special regex character in it (+, ., \\d, etc.)
I wrote this pattern:
Pattern SPECIAL_REGEX_CHARS = Pattern.compile("[{}()\\[\\].+*?^$\\\\|]");
And use it in this method:
String escapeSpecialRegexChars(String str) {
return SPECIAL_REGEX_CHARS.matcher(str).replaceAll("\\\\$0");
}
Then you can use it like this, for example:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*" + escapeSpecialRegexChars(text) + ".*");
}
We needed to do that because, after escaping, we add some regex expressions. If not, you can simply use \Q and \E:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*\\Q" + text + "\\E.*")
}
The only way the regex matcher knows you are looking for a digit and not the letter d is to escape the letter (\d). To type the regex escape character in java, you need to escape it (so \ becomes \\). So, there's no way around typing double backslashes for special regex chars.
The Pattern.quote(String s) sort of does what you want. However it leaves a little left to be desired; it doesn't actually escape the individual characters, just wraps the string with \Q...\E.
There is not a method that does exactly what you are looking for, but the good news is that it is actually fairly simple to escape all of the special characters in a Java regular expression:
regex.replaceAll("[\\W]", "\\\\$0")
Why does this work? Well, the documentation for Pattern specifically says that its permissible to escape non-alphabetic characters that don't necessarily have to be escaped:
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
For example, ; is not a special character in a regular expression. However, if you escape it, Pattern will still interpret \; as ;. Here are a few more examples:
> becomes \> which is equivalent to >
[ becomes \[ which is the escaped form of [
8 is still 8.
\) becomes \\\) which is the escaped forms of \ and ( concatenated.
Note: The key is is the definition of "non-alphabetic", which in the documentation really means "non-word" characters, or characters outside the character set [a-zA-Z_0-9].
Use this Utility function escapeQuotes() in order to escape strings in between Groups and Sets of a RegualrExpression.
List of Regex Literals to escape <([{\^-=$!|]})?*+.>
public class RegexUtils {
static String escapeChars = "\\.?![]{}()<>*+-=^$|";
public static String escapeQuotes(String str) {
if(str != null && str.length() > 0) {
return str.replaceAll("[\\W]", "\\\\$0"); // \W designates non-word characters
}
return "";
}
}
From the Pattern class the backslash character ('\') serves to introduce escaped constructs. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.
Example: String to be matched (hello) and the regex with a group is (\(hello\)). Form here you only need to escape matched string as shown below. Test Regex online
public static void main(String[] args) {
String matched = "(hello)", regexExpGrup = "(" + escapeQuotes(matched) + ")";
System.out.println("Regex : "+ regexExpGrup); // (\(hello\))
}
Agree with Gray, as you may need your pattern to have both litrals (\[, \]) and meta-characters ([, ]). so with some utility you should be able to escape all character first and then you can add meta-characters you want to add on same pattern.
use
pattern.compile("\"");
String s= p.toString()+"yourcontent"+p.toString();
will give result as yourcontent as is

How to remove special characters from a string?

I want to remove special characters like:
- + ^ . : ,
from an String using Java.
That depends on what you define as special characters, but try replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since you'd then either have to escape it or it would mean "any but these characters".
Another note: the - character needs to be the first or last one on the list, otherwise you'd have to escape it or it would define a range ( e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S} (keep in mind that in Java strings you'd have to escape back slashes: "\\p{P}\\p{S}").
A third way could be something like this, if you can exactly define what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can't explain them all, so have a look at the reference section of regular-expressions.info.
Here's less restrictive alternative to the "define allowed characters" approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can't use [\P{L}\P{Z}] (upper case P means not having that property), since that would mean "everything that is not a letter or not whitespace", which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.
This will replace all the characters except alphanumeric
replaceAll("[^A-Za-z0-9]","");
As described here
http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$##!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
String is = one
Number is = 9196390097
Try replaceAll() method of the String class.
BTW here is the method, return type and parameters.
public String replaceAll(String regex,
String replacement)
Example:
String str = "Hello +-^ my + - friends ^ ^^-- ^^^ +!";
str = str.replaceAll("[-+^]*", "");
It should remove all the {'^', '+', '-'} chars that you wanted to remove!
To Remove Special character
String t2 = "!##$%^&*()-';,./?><+abdd";
t2 = t2.replaceAll("\\W+","");
Output will be : abdd.
This works perfectly.
Use the String.replaceAll() method in Java.
replaceAll should be good enough for your problem.
You can remove single char as follows:
String str="+919595354336";
String result = str.replaceAll("\\\\+","");
System.out.println(result);
OUTPUT:
919595354336
If you just want to do a literal replace in java, use Pattern.quote(string) to escape any string to a literal.
myString.replaceAll(Pattern.quote(matchingStr), replacementStr)

Remove doubled letter from a string using java

I need to remove a doubled letter from a string using regex operations in java.
Eg: PRINCEE -> PRINCE
APPLE -> APLE
Simple Solution (remove duplicate characters)
Like this:
final String str = "APPLEE";
String replaced = str.replaceAll("(.)\\1", "$1");
System.out.println(replaced);
Output:
APLE
Not just any Chracters, Letters only
As #Jim comments correctly, the above matches any double character, not just letters. Here are a few variations that just match letters:
// the basics, ASCII letters. these two are equivalent:
str.replaceAll("([A-Za-z])\\1", "$1");
str.replaceAll("(\\p{Alpha})\\1", "$1");
// Unicode Letters
str.replaceAll("(\\p{L})\\1", "$1");
// anything where Character.isLetter(ch) returns true
str.replaceAll("(\\p{javaLetter})\\1", "$1");
References:
For additional reference:
Character.isLetter(ch) (javadocs)
any method in Character of
the form Character.isXyz(char)
enables a pattern named
\p{javaXyz} (mind the
capitalization). This mechanism is
described in the Pattern
javadocs
Unicode blocks and categories can
also be matched with the \p and
\P constructs as in Perl. \p{prop}
matches if the input has the
property prop, while \P{prop} does
not match if the input has that
property. This mechanism is also
described in the Pattern
javadocs
String s = "...";
String replaced = s.replaceAll( "([A-Z])\\1", "$1" );
If you want to replace just duplicate ("AA"->"A", "AAA" -> "AA") use
public String undup(String str) {
return str.replaceAll("(\\w)\\1", "$1");
}
To replace triplicates etc use: str.replaceAll("(\\w)\\1+", "$1");
To replace only a single dupe is a long string (AAAA->AAA, AAA->AA) use: str.replaceAll("(\\w)(\\1+)", "$2");
This can be done simply by iterating over the String instead of having to resort to regexes.
StringBuilder ret=new StringBuilder(text.length());
if (text.length()==0) return "";
ret.append(text.charAt(0));
for(int i=1;i<text.length();i++){
if (text.charAt(i)!=text.charAt(i-1))
ret.append(text.charAt(i));
}
return ret.toString();

Categories

Resources