Regular Expression: Replace except from specific characters and whitespace - java

I am coding in Java and I have a string where I want to keep letters, digits, ":", "-" and whitespaces and remove everything else. So, I have used this piece of code:
str=str.replaceAll("[^\\dA-Za-z#:-\\s*]", "");
It doesn't work.
It does work fine until
str=str.replaceAll("[^\\dA-Za-z#:-]", "");
where everything else, except from letters, digits and the characters ":" and "-" is removed
But when I am trying to add the condition for whitespace characters I am facing problems.
I would appreciate your help.
Thank you in advance.

- when used within character class depicts range..
In your case you were actually trying to match characters from range : to \s which is an invalid range..
Move - to the start
[^-\\dA-Za-z#:\\s]
or end
[^\\dA-Za-z#:\\s-]

The dash must be the first or last character in a character class, or it will be interpreted as a range indicator (as in [A-Z]); in your case [:-\\s] is a meaningless range. Use
str = str.replaceAll("[^\\dA-Za-z#:\\s-]+", "");
(or did you want to keep asterisks in your text, too)?

Related

Java, Regex, strip unwanted characters [trailing, leading, between]

i need help for an regular expression to strip unwanted characters from an String (in Java).
I solved this issue with 4 regular expression following each other.
The replace will be called many times [peeks: 50+ times/sec] it and decreases performance.
But i think it sure possible with an single expression, so the performance will be increased a little.
The TestString is
" ! ... my-Cruc i#l_\\/Disp lay.Na#m3 ?;()! "
The tasks i like to perform with regex
Remove all leading non-alpha charcters – [Beginning of String]
Remove all trailing non-alphanumeric characters – [End of String]
Remove all non-alphanumeric characters(except [_-.]) between
So the result will be
my-Cruil_Display.Nam3
The Problem is the switch between, the built-in patterns Alnum and alpha, depending on position in string (beginning, end) and the exception characters [_-.] between them.
I tried this many times in the last few days, but i do not get it to work.
Removing leading non-alpha characters is working with regex
^([^\\p{Alpha}]+)?
But if i append the „between“ it doesnt work longer anything
Removing trailing non-alpha charcter with regex
([^\\p{Alnum}]+$)
is working , but not im combination with all other regex
One of the last tries are
(^[^\\p{Alpha}]+)?[^\\p{Alnum}\\._-]+([^\\p{Alnum}]+$)
Can anyone help to get this working
You may use
^\P{Alpha}+|\P{Alnum}+$|[^\p{Alnum}_.-]
Java:
s = s.replaceAll("^\\P{Alpha}+|\\P{Alnum}+$|[^\\p{Alnum}_.-]", "");
Or, to make it Unicode aware, add the (?U) flag:
s = s.replaceAll("(?U)^\\P{Alpha}+|\\P{Alnum}+$|[^\\p{Alnum}_.-]", "");
Details
^\P{Alpha}+ - any 1 or more chars other than alphabetic chars at the start of the string
| - or
\P{Alnum}+$ - any 1 or more chars other than alphanumeric chars at the end of the string
| - or
[^\p{Alnum}_.-] - any char other than alphanumeric, _, . and - chars anywhere in the string
See the regex demo.

Need to remove special characters from strings written in a file in java

I have a text file which contains data. Some special char comes in the file. I need to remove all "special" characters, ie:
],à,>,¤,`,ƒ,Š,¥,Œ,^,>¤,°,ã,Ãé,–«»°,NÂ,N,º,?¿Ññ,ß,ä,º,ô5,ª,é ,ª,§,Á
These need to be replaced with a space chat, not removed.
I have one constraint that I have to store output in a String, because I need to pass that string further in TIBCO. I have written the following code but it is removing everything. As I need to have + and - symbol in file.
str = str.replaceAll("[^\\w\\s]*", "");
Any help appreciated.
Firstly, if you need to replace with whitespace and not with blank, why are you replacing with blank?
You could just use a white list of all chars you want to keep by adding plus and minus signs to the character class:
.replaceAll("[^\\w\\s.,+-]", " ")
I also added the dot and comma, since you probably want these too.
But it looks like a blanket character would be better, since all chars you don't want are above 127:
.replaceAll("[\u0080-\uffff]", " ")
You can add other chars you don't want to this character class as you need.
Note: In both cases, I removed the quantifier *, because you want a 1-for-1 replacement. If you use * the regex will match between every character, and match a sequence of unwanted chars, which will mess up your file.

regex strip spaces hyphen

I am unable to strip one space before and after a hyphen. I have tried: -
sample.replaceAll("[\\s\\-\\s]","")
and permutations to no avail. I dont want to strip all spaces, neither all the intervening spaces. I am trying to parse a string based on " " but want to eliminate "-". Any insight appreciated.
[\s\-\s] is a character class, and does not matches space followed by - followed by space. It matches any of the characters - space, and -, and replace them with empty string.
You can use this: -
sample = sample.replaceAll("[ ]-[ ]","-");
Or, even String.replace would work here. You don't really need a replaceAll: -
sample = sample.replace(" - ", "-");

Regular expressions: all words after my current one are gone

I need to remove all strings from my text file, such as:
flickr:user=32jdisffs
flickr:user=acssd
flickr:user=asddsa89
I'm currently using fields[i] = fields[i].replaceAll(" , flickr:user=.*", "");
however the issue with this is approach is that any word after flickr:user= is removed from the content, even after the space.
thanks
You probably need
replaceAll("flickr:user=[0-9A-Za-z]+", "");
flickr:user=\w+ should do it:
String noFlickerIdsHere = stringWithIds.replaceAll("flickr:user=\\w+", "");
Reference:
\w = A word character: [a-zA-Z_0-9]
Going by the question as stated, chances are that you want:
fields[i] = fields[i].replaceAll(" , flickr:user=[^ ]* ", ""); // or " "
This will match the string, including the value of user up to but not including the first space, followed by a space, and replace it either by a blank string, or a single space. However this will (barring the comma) net you an empty result with the input you showed. Is that really what you want?
I'm also not sure where the " , " at the beginning fits into the example you showed.
The reason for your difficulties is that an unbounded .* will match everything from that point up until the end of the input (even if that amounts to nothing; that's what the * is for). For a line-based regular expression parser, that's to the end of the line.

Removing all whitespace characters except for " "

I consider myself pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky: I want to trim all whitespace, except the space character: ' '.
In Java, the RegEx I have tried is: [\s-[ ]], but this one also strips out ' '.
UPDATE:
Here is the particular string that I am attempting to strip spaces from:
project team manage key
Note: it would be the characters between "team" and "manage". They appear as a long space when editing this post but view as a single space in view mode.
Try using this regular expression:
[^\S ]+
It's a bit confusing to read because of the double negative. The regular expression [\S ] matches the characters you want to keep, i.e. either a space or anything that isn't a whitespace. The negated character class [^\S ] therefore must match all the characters you want to remove.
Using a Guava CharMatcher:
String text = ...
String stripped = CharMatcher.WHITESPACE.and(CharMatcher.isNot(' '))
.removeFrom(text);
If you actually just want that trimmed from the start and end of the string (like String.trim()) you'd use trimFrom rather than removeFrom.
There's no subtraction of character classes in Java, otherwise you could use [\s--[ ]], note the double dash. You can always simulate set subtraction using intersection with the complement, so
[\s&&[^ ]]
should work. It's no better than [^\S ]+ from the first answer, but the principle is different and it's good to know both.
I solved it with this:
anyString.replace(/[\f\t\n\v\r]*/g, '');
It is just a collection of all possible white space characters excluding blank (so actually
\s without blanks). It includes tab, carriage return, new line, vertical tab and form feed characters.

Categories

Resources