splitting string and keep characters (regex pattern) - java

I would like to split a String and despair on the regex pattern.
I need to split a string like this: Hi I want "to split" this (String) to a String array like this:
String [] array = {"Hi", "I", "want", """, "to", "split", """, "this", "(", "string", ")"};
This is what I have tried, but it deletes the delimiter.
public static void main(String[] args) {
String string = "Hi \"why should\" (this work)";
String[] array;
array = string.split("\\s"
+ "|\\s(?=\")"
+ "|\\w(?=\")"
+ "|\"(?=\\w)"
+ "|\\s(?=\\()"
+ "|\\w(?=\\))"
+ "|\\((?=\\w)");
for (String str : array) {
System.out.println(str);
}
}
Result:
Hi
why
shoul
"
this
wor
)

You can match the tokens with the regex \w+|[\w\s], assuming that you want the punctuation characters to end up in different tokens:
String input = "Hi I want \"to split\" this (String).";
Matcher matcher = Pattern.compile("\\w+|[^\\w\\s]").matcher(input);
List<String> out = new ArrayList<>();
while (matcher.find()) {
out.add(matcher.group());
}
The output ArrayList contains:
[Hi, I, want, ", to, split, ", this, (, String, ), .]
You might want to use (?U) flag to make the \w and \s follows the Unicode definition of word and whitespace character. By default, \w and \s only recognizes word and whitespace characters in ASCII range.
For the sake of completeness, here is the solution in split(), which works on Java 8 and above. There will be an extra empty string at the beginning in Java 7.
String tokens[] = input.split("\\s+|(?<![\\w\\s])(?=\\w)|(?<=\\w)(?![\\w\\s])|(?<=[^\\w\\s])(?=[^\\w\\s])");
The regex is rather complex, since the empty string splits between punctuation character and word character need to avoid the cases already split by \s+.
Since the regex in the split solution is quite a mess, please use the match solution instead.

What language are you trying to write this in?
You could write regex groups something like: (.+)(\s)
This would match any quantity of characters followed by a space

Related

Replace all characters between two delimiters using regex

I'm trying to replace all characters between two delimiters with another character using regex. The replacement should have the same length as the removed string.
String string1 = "any prefix [tag=foo]bar[/tag] any suffix";
String string2 = "any prefix [tag=foo]longerbar[/tag] any suffix";
String output1 = string1.replaceAll(???, "*");
String output2 = string2.replaceAll(???, "*");
The expected outputs would be:
output1: "any prefix [tag=foo]***[/tag] any suffix"
output2: "any prefix [tag=foo]*********[/tag] any suffix"
I've tried "\\\\\[tag=.\*?](.\*?)\\\\[/tag]" but this replaces the whole sequence with a single "\*".
I think that "(.\*?)" is the problem here because it captures everything at once.
How would I write something that replaces every character separately?
you can use the regex
\w(?=\w*?\[)
which would match all characters before a "[\"
see the regex demo, online compiler demo
You can capture the chars inside, one by one and replace them by * :
public static String replaceByStar(String str) {
String pattern = "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)";
while (str.matches(pattern)) {
str = str.replaceAll(pattern, "$1*$2");
}
return str;
}
Use like this it will print your tx2 expected outputs :
public static void main(String[] args) {
System.out.println(replaceByStar("any prefix [tag=foo]bar[/tag] any suffix"));
System.out.println(replaceByStar("any prefix [tag=foo]loooongerbar[/tag] any suffix"));
}
So the pattern "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)" :
(.*\\[tag=.*\\].*) capture the beginning, with eventually some char in the middle
\\w is for the char you want to replace
(.*\\[\\/tag\\].*) capture the end, with eventually some char in the middle
The substitution $1*$2:
The pattern is (text$1)oneChar(text$2) and it will replace by (text$1)*(text$2)

Splitting text by punctuation and special cases like :) or space

I have a following string:
Hello word!!!
or
Hello world:)
Now I want to split this string to an array of string which contains Hello,world,!,!,! or Hello,world,:)
the problem is if there was space between all the parts I could use split(" ")
but here !!! or :) is attached to the string
I also used this code :
String Text = "But I know. For example, the word \"can\'t\" should";
String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
System.out.println(s);
}
which I found it from here but not really helpful in my case:
Splitting strings through regular expressions by punctuation and whitespace etc in java
Can anyone help?
Seems to me like you do not want to split but rather capture certain groups. The thing with split string is that it gets rid of the parts that you split by (so if you split by spaces, you don't have spaces in your output array), therefore if you split by "!" you won't get them in your output. Possibly this would work for capturing the things that you are interested in:
(\w+)|(!)|(:\))/g
regex101
Mind you don't use string split with it, but rather exec your regex against your string in whatever engine/language you are using. In Java it would be something like:
String input = "Hello world!!!:)";
Pattern p = Pattern.compile("(\w+)|(!)|(:\))");
Matcher m = p.matcher(input);
List<String> matches = new ArrayList<String>();
while (m.find()) {
matches.add(m.group());
}
Your matches array will have:
["Hello", "world", "!", "!", "!", ":)"]

Spliting a sentence when a hyphen is encountered in java

I have following code in my program. It splits a line when a hyphen is encountered and stores each word in the String Array 'tokens'. But I want the hyphen also to be stored in the String Array 'tokens' when it is encountered in a sentence.
String[] tokens = line.split("-");
The above code splits the sentence but also totally ignores the hyphen in the resulting array.
What can I do to store hyphen also in the resulting array?
Edit : -
Seems like you want to split on both whitespaces and hyphen but keeping only the hyphen in the array (As, I infer from your this line - stores each word in the String Array), you can use this: -
String[] tokens = "abc this is-a hyphen def".split("((?<=-)|(?=-))|\\s+");
System.out.println(Arrays.toString(tokens));
Output: -
[abc, this, is, -, a, hyphen, def]
For handling spaces before and after hyphen, you can first trim those spaces using replaceAll method, and then do split: -
"abc this is - a hyphen def".replaceAll("[ ]*-[ ]*", "-")
.split("((?<=-)|(?=-))|\\s+");
Previous answer : -
You can use this: -
String[] tokens = "abc-efg".split("((?<=-)|(?=-))");
System.out.println(Arrays.toString(tokens));
OUTPUT : -
[abc, -, efg]
It splits on an empty character before and after the hyphen (-).
I suggest to use a regular expression in combination with the Java Pattern and Matcher. Example:
String line = "a-b-c-d-e-f-";
Pattern p = Pattern.compile("[^-]+|-");
Matcher m = p.matcher(line);
while (m.find())
{
String match = m.group();
System.out.println("match:" + match);
}
To test your regular expression you could use an online regexp tester like this

Problems with building this regex [1,2,3]

i have a problem to build following regex:
[1,2,3,4]
i found a work-around, but i think its ugly
String stringIds = "[1,2,3,4]";
stringIds = stringIds.replaceAll("\\[", "");
stringIds = stringIds.replaceAll("\\]", "");
String[] ids = stringIds.split("\\,");
Can someone help me please to build one regex, which i can use in the split function
Thanks for help
edit:
i want to get from this string "[1,2,3,4]" to an array with 4 entries. the entries are the 4 numbers in the string, so i need to eliminate "[","]" and ",". the "," isn't the problem.
the first and last number contains [ or ]. so i needed the fix with replaceAll. But i think if i use in split a regex for ",", i also can pass a regex which eliminates "[" "]" too. But i cant figure out, who this regex should look like.
This is almost what you're looking for:
String q = "[1,2,3,4]";
String[] x = q.split("\\[|\\]|,");
The problem is that it produces an extra element at the beginning of the array due to the leading open bracket. You may not be able to do what you want with a single regex sans shenanigans. If you know the string always begins with an open bracket, you can remove it first.
The regex itself means "(split on) any open bracket, OR any closed bracket, OR any comma."
Punctuation characters frequently have additional meanings in regular expressions. The double leading backslashes... ugh, the first backslash tells the Java String parser that the next backslash is not a special character (example: \n is a newline...) so \\ means "I want an honest to God backslash". The next backslash tells the regexp engine that the next character ([ for example) is not a special regexp character. That makes me lol.
Maybe substring [ and ] from beginning and end, then split the rest by ,
String stringIds = "[1,2,3,4]";
String[] ids = stringIds.substring(1,stringIds.length()-1).split(",");
Looks to me like you're trying to make an array (not sure where you got 'regex' from; that means something different). In this case, you want:
String[] ids = {"1","2","3","4"};
If it's specifically an array of integer numbers you want, then instead use:
int[] ids = {1,2,3,4};
Your problem is not amenable to splitting by delimiter. It is much safer and more general to split by matching the integers themselves:
static String[] nums(String in) {
final Matcher m = Pattern.compile("\\d+").matcher(in);
final List<String> l = new ArrayList<String>();
while (m.find()) l.add(m.group());
return l.toArray(new String[l.size()]);
}
public static void main(String args[]) {
System.out.println(Arrays.toString(nums("[1, 2, 3, 4]")));
}
If the first line your code is following:
String stringIds = "[1,2,3,4]";
and you're trying to iterate over all number items, then the follwing code-frag only could work:
try {
Pattern regex = Pattern.compile("\\b(\\d+)\\b", Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
for (int i = 1; i <= regexMatcher.groupCount(); i++) {
// matched text: regexMatcher.group(i)
// match start: regexMatcher.start(i)
// match end: regexMatcher.end(i)
}
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

Java Split not working as expected

I am trying to use a simple split to break up the following string: 00-00000
My expression is: ^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])
And my usage is:
String s = "00-00000";
String pattern = "^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])";
String[] parts = s.split(pattern);
If I play around with the Pattern and Matcher classes I can see that my pattern does match and the matcher tells me my groupCount is 7 which is correct. But when I try and split them I have no luck.
String.split does not use capturing groups as its result. It finds whatever matches and uses that as the delimiter. So the resulting String[] are substrings in between what the regex matches. As it is the regex matches the whole string, and with the whole string as a delimiter there is nothing else left so it returns an empty array.
If you want to use regex capturing groups you will have to use Matcher.group(), String.split() will not do.
for your example, you could simply do this:
String s = "00-00000";
String pattern = "-";
String[] parts = s.split(pattern);
I can not be sure, but I think what you are trying to do is to get each matched group into an array.
Matcher matcher = Pattern.compile(pattern).matcher();
if (matcher.matches()) {
String s[] = new String[matcher.groupCount()) {
for (int i=0;i<matches.groupCount();i++) {
s[i] = matcher.group(i);
}
}
}
From the documentation:
String[] split(String regex) -- Returns: the array of strings computed by splitting this string around matches of the given regular expression
Essentially the regular expression is used to define delimiters in the input string. You can use capturing groups and backreferences in your pattern (e.g. for lookarounds), but ultimately what matters is what and where the pattern matches, because that defines what goes into the returned array.
If you want to split your original string into 7 parts using regular expression, then you can do something like this:
String s = "12-3456";
String[] parts = s.split("(?!^)");
System.out.println(parts.length); // prints "7"
for (String part : parts) {
System.out.println("Part [" + part + "]");
} // prints "[1] [2] [-] [3] [4] [5] [6] "
This splits on zero-length matching assertion (?!^), which is anywhere except before the first character in the string. This prevents the empty string to be the first element in the array, and trailing empty string is already discarded because we use the default limit parameter to split.
Using regular expression to get individual character of a string like this is an overkill, though. If you have only a few characters, then the most concise option is to use foreach on the toCharArray():
for (char ch : "12-3456".toCharArray()) {
System.out.print("[" + ch + "] ");
}
This is not the most efficient option if you have a longer string.
Splitting on -
This may also be what you're looking for:
String s = "12-3456";
String[] parts = s.split("-");
System.out.println(parts.length); // prints "2"
for (String part : parts) {
System.out.print("[" + part + "] ");
} // prints "[12] [3456] "

Categories

Resources