I know that removing whitespaces is as easy as String.trim(). But my string contains tab (\t) characters which I would like to keep.
Example:
"teststring\t\t\t ".trimSpaceNotTab() => "teststring\t\t\t"
My current implementation is to use split();
String[] arr = tabbedString.split("\t");
Then joining them somewhere as a string.
I find this implementation slow and ugly.
Is there a better way in Java where I can retain the tabs?
How about
tabbedString.replaceAll("[ \n\x0B\f\r]","")
Function used - String.replaceAll()
In case you'd like to also go for tabs and remove them, use a predefined character class \s
Pattern Summary
Go through the string and ask each Char if its whitespace using the isSpaceChar
Use a regular expression that replace all white space but not tab(\t).
Related
I have a text file which contains lot of permutations and combinations of special characters, white space and data.
I am storing the content of this file into an array list, and if i am not using useDelimiter() function, Java is reading my text perfectly.
The only issue is that its not accepting comma (,) and dot (.) as delimiter.
I know I can use input.useDelimiter(",|.| |\n") to use comma , dot, space as delimiter and others options as well, but then the results I get are not correct as java gives me now.
Is there a way to instruct java to use comma and dot as delimiters along with whatever default delimiter it uses?
Thanks in advance for your help :)
Regards,
Rahul
The default delimiter for Scanner is defined as the pattern \p{javaWhitespace}+, so if you want to also treat comma and dot as a delimiter, try
input.useDelimiter("(\\p{javaWhitespace}|\\.|,)+");
Note you need to escape dot, as that is a special character in regular expressions.
Use escaped character like this:
input.useDelimiter("\\.");
You could do this:
String str = "...";
List<String> List = Arrays.asList(str.split(","));
Basically the .split() method will split the string according to (in this case) delimiter you are passing and will return an array of strings.
However, you seem to be after a List of Strings rather than an array, so the array must be turned into a list by using the Arrays.asList() utility. Just as an FYI you could also do something like so:
String str = "...";
ArrayList<String> List = Arrays.asList(str.split(","));
I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.
I use this regular to validate many of the input fields of my java web app:
"^[a-zA-Z0-9]+$"
But i need to modify it, because i have a couple of fields that need to allow blank spaces(for example: Address).
How can i modify it to allow blank spaces(if possible not at the start).
I think i need to use some scape character like \
I tried a few different combinations but none of them worked. Can somebody help me with this regex?
I'd suggest using this:
^[a-zA-Z0-9][a-zA-Z0-9 ]+$
It adds two things: first, you're guaranteed not to have a space at the beginning, while allowing characters you need. Afterwards, letters a-z and A-Z are allowed, as well as all digits and spaces (there's a space at the end of my regex).
If you want to use only a whitespace, you can do:
^[a-zA-Z0-9 ]+$
If you want to include tabs \t, new-line \n \r\n characters, you can do:
^[a-zA-Z0-9\s]+$
Also, as you asked, if you don't want the whitespace to be at the begining:
^[a-zA-Z0-9][a-zA-Z0-9 ]+$
Use this: ^[a-zA-Z0-9]+[a-zA-Z0-9 ]+$. This should work. First atom ensures that there must be at least one character at beginning.
try like this ^[a-zA-Z0-9 ]+$ that is, add a space in it
This regex dont allow spaces at the end of string, one downside it accepts underscore character also.
^(\w+ )+\w+|\w+$
Try this one: I assume that any input with a length of at least one character is valid. The previously mentioned answers does not take that into account.
"^[a-zA-Z0-9][a-zA-Z0-9 ]*$"
If you want to allow all whitespace characters, replace the space by "\s"
I consider myself pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky: I want to trim all whitespace, except the space character: ' '.
In Java, the RegEx I have tried is: [\s-[ ]], but this one also strips out ' '.
UPDATE:
Here is the particular string that I am attempting to strip spaces from:
project team manage key
Note: it would be the characters between "team" and "manage". They appear as a long space when editing this post but view as a single space in view mode.
Try using this regular expression:
[^\S ]+
It's a bit confusing to read because of the double negative. The regular expression [\S ] matches the characters you want to keep, i.e. either a space or anything that isn't a whitespace. The negated character class [^\S ] therefore must match all the characters you want to remove.
Using a Guava CharMatcher:
String text = ...
String stripped = CharMatcher.WHITESPACE.and(CharMatcher.isNot(' '))
.removeFrom(text);
If you actually just want that trimmed from the start and end of the string (like String.trim()) you'd use trimFrom rather than removeFrom.
There's no subtraction of character classes in Java, otherwise you could use [\s--[ ]], note the double dash. You can always simulate set subtraction using intersection with the complement, so
[\s&&[^ ]]
should work. It's no better than [^\S ]+ from the first answer, but the principle is different and it's good to know both.
I solved it with this:
anyString.replace(/[\f\t\n\v\r]*/g, '');
It is just a collection of all possible white space characters excluding blank (so actually
\s without blanks). It includes tab, carriage return, new line, vertical tab and form feed characters.
I need to be able to split an input String by commas, semi-colons or white-space (or a mix of the three). I would also like to treat multiple consecutive delimiters in the input as a single delimiter. Here's what I have so far:
String regex = "[,;\\s]+";
return input.split(regex);
This works, except for when the input string starts with one of the delimiter characters, in which case the first element of the result array is an empty String. I do not want my result to have empty Strings, so that something like, ",,,,ZERO; , ;;ONE ,TWO;," returns just a three element array containing the capitalized Strings.
Is there a better way to do this than stripping out any leading characters that match my reg-ex prior to invoking String.split?
Thanks in advance!
No, there isn't. You can only ignore trailing delimiters by providing 0 as a second parameter to String's split() method:
return input.split(regex, 0);
but for leading delimiters, you'll have to strip them first:
return input.replaceFirst("^"+regex, "").split(regex, 0);
If by "better" you mean higher performance then you might want to try creating a regular expression that matches what you want to match and using Matcher.find in a loop and pulling out the matches as you find them. This saves modifying the string first. But measure it for yourself to see which is faster for your data.
If by "better" you mean simpler, then no I don't think there is a simpler way than the way you suggested: removing the leading separators before applying the split.
Pretty much all splitting facilities built into the JDK are broken one way or another. You'd be better off using a third-party class such as Splitter, which is both flexible and correct in how it handles empty tokens and whitespaces:
Splitter.on(CharMatcher.anyOf(";,").or(CharMatcher.WHITESPACE))
.omitEmptyStrings()
.split(",,,ZERO;,ONE TWO");
will yield an Iterable<String> containing "ZERO", "ONE", "TWO"
You could also potentially use StringTokenizer to build the list, depending what you need to do with it:
StringTokenizer st = new StringTokenizer(",,,ZERO;,ONE TWO", ",; ", false);
while(st.hasMoreTokens()) {
String str = st.nextToken();
//add to list, process, etc...
}
As a caveat, however, you'll need to define each potential whitespace character separately in the second argument to the constructor.