java split string with regex - java

I want to split string by setting all non-alphabet as separator.
String[] word_list = line.split("[^a-zA-Z]");
But with the following input
11:11 Hello World
word_list contains many empty string before "hello" and "world"
Please kindly tell me why. Thank You.

Because your regular expression matches each individual non-alpha character. It would be like separating
",,,,,,Hello,World"
on commas.
You will want an expression that matches an entire sequence of non-alpha characters at once such as:
line.split("[^a-zA-Z][^a-zA-Z]*")
I still think you will get one leading empty string with your example since it would be like separating ",Hello,World" if comma were your separator.

Here's your string, where each ^ character shows a match for [^a-zA-Z]:
11:11 Hello World
^^^^^^ ^
The split method finds each of these matches, and basically returns all substrings between the ^ characters. Since there's six matches before any useful data, you end up with 5 empty substrings before you get the string "Hello".
To prevent this, you can manually filter the result to ignore any empty strings.

Will the following do?
String[] word_list = line.replaceAll("[^a-zA-Z ]","").replaceAll(" +", " ").trim().split("[^a-zA-Z]");
What I am doing here is removing all non-alphabet characters before doing the split and then replacing multiple spaces by a single space.

Related

How to check and replace a sequence of characters in a String?

Here what the program is expectiong as the output:
if originalString = "CATCATICATAMCATCATGREATCATCAT";
Output should be "I AM GREAT".
The code must find the sequence of characters (CAT in this case), and remove them. Plus, the resulting String must have spaces in between words.
String origString = remixString.replace("CAT", "");
I figured out I have to use String.replace, But what could be the logic for finding out if its not cat and producing the resulting string with spaces in between the words.
First off, you probably want to use the replaceAll method instead, to make sure you replace all occurrences of "CAT" within the String. Then, you want to introduce spaces, so instead of an empty String, replace "CAT" with " " (space).
As pointed out by the comment below, there might be multiple spaces between words - so we use a regular expression to replace multiple instances of "CAT" with a single space. The '+' symbol means "one or more",.
Finally, trim the String to get rid of leading and trailing white space.
remixString.replaceAll("(CAT)+", " ").trim()
You can use replaceAll which accepts a regular expression:
String remixString = "CATCATICATAMCATCATGREATCATCAT";
String origString = remixString.replaceAll("(CAT)+", " ").trim();
Note: the naming of replace and replaceAll is very confusing. They both replace all instances of the matching string; the difference is that replace takes a literal text as an argument, while replaceAll takes a regular expression.
Maybe this will help
String result = remixString.replaceAll("(CAT){1,}", " ");

String.split() not working as intended

I'm trying to split a string, however, I'm not getting the expected output.
String one = "hello 0xA0xAgoodbye";
String two[] = one.split(" |0xA");
System.out.println(Arrays.toString(two));
Expected output: [hello, goodbye]
What I got: [hello, , , goodbye]
Why is this happening and how can I fix it?
Thanks in advance! ^-^
If you'd like to treat consecutive delimiters as one, you could modify your regex as follows:
"( |0xA)+"
This means "a space or the string "0xA", repeated one or more times".
(\\s|0xA)+ This will match one or more number of space or 0xA in the text and split them
This result is caused by multiple consecutive matches in the string. You may wrap the pattern with a grouping construct and apply a + quantifier to it to match multiple matches:
String one = "hello 0xA0xAgoodbye";
String two[] = one.split("(?:\\s|0xA)+");
System.out.println(Arrays.toString(two));
A (?:\s|0xA)+ regex matches 1 or more whitespace symbols or 0XA literal character sequences.
See the Java online demo.
However, you will still get an empty value as the first item in the resulting array if the 0xA or whitespaces appear at the start of the string. Then, you will have to remove them first:
String two[] = one.replaceFirst("^(?:\\s|0xA)+", "").split("(?:\\s+|0xA)+");
See another Java demo.

Check string contains whitespace along with some other char sequence using regex in java

am using regex expression to check if a string contains white space.
my regex is : ^\\s+$
for example if my string is my name then regex matches should return true.
but it is returning true only if my string contains only spaces no other character.
How to check if a string contains a whitespace or tab or carriage return characters in between/start/end of some string.
^(.*\s+.*)+$ seems to work for me. Accepts anything as long as there is at least one space in the string. This will match the entire string.
If you only want to check for the presence of a space, you can just use \s without any begin or end markers in the string. The difference is that this will only match the individual spaces.
Your regex is not correct.
That's a string representing a regular expression. (as tchrist pointed out correctly)
The corresponding pattern that you get when using Pattern.compile() matches only strings containing one or more whitespace characters, starting from the beginning until the end. Thus, the matching string only consists of whitespace characters.
Try this string instead for Pattern.compile():
"\\s+"
The difference is that without the anchors "^" and "$" there may be other characters around the whitespace character. The whitespace character(s) may be everywhere in the string.
Using this pattern-string the whitespace character(s) must be at the beginning:
"^\\s+"
And here the sequence of whitespace characters has to be at the end:
"\\s+$"
Use org.apache.commons.lang.StringUtils.containsAny(). See http://commons.apache.org/lang/api-3.1/org/apache/commons/lang3/StringUtils.html.

How should I split my string using regular expression?

I have string which should be split on "." (point) and " " (space). I have tried:
s.split("[\\s\\.]")
but it doesn't work, because it hasn't split this string normally - "123 456 . 11323 1".
How should I change my regular expression?
I think, what you want is this:
s.split("[\\s\\.]+");
Note the +. You don't seem to want to split on every single (!) occurrence of whitespace or dots. You want to match all lengths of combinations of whitespace or dots. That's why you have to greedy match as many as possible of those characters
Simply use "[\\s.]+" as the regex.
You will get a lot of blank spaces if you only split on a single character.
s.split("[\\s\\.]+")
will produce "123", "456", "11323", "1".
The + causes it to treat any run of spaces and dots as a single break instead of returning a string between adjacent spaces and dots.
You might still get blank strings at either end of your results since given " 123" it will split between the start of the string and "123".

Regex for matching alternating sequences

I'm working in Java and having trouble matching a repeated sequence. I'd like to match something like:
a.b.c.d.e.f.g.
and be able to extract the text between the delimiters (e.g. return abcdefg) where the delimiter can be multiple non-word characters and the text can be multiple word characters. Here is my regex so far:
([\\w]+([\\W]+)(?:[\\w]+\2)*)
(Doesn't work)
I had intended to get the delimiter in group 2 with this regex and then use a replaceAll on group 1 to exchange the delimiter for the empty string giving me the text only. I get the delimiter, but cannot get all the text.
Thanks for any help!
Replace (\w+)\W+ by $1
Replace (\w+)(\W+|$) with $1. Make sure that global flag is turned on.
It replaces a sequence of word chars followed by a sequence of non-word-chars or end-of-line with the sequence of words.
String line = "Am.$#%^ar.$#%^gho.$#%^sh";
line = line.replaceAll("(\\w+)(\\W+|$)", "$1");
System.out.println(line);//prints my name
Why not use String.split?
Why not ..
find all occurences of (\w+) and then concatenate them; or
find all non word characters (\W+) and then use Matcher.html#replaceAll with an empty string?

Categories

Resources