Splitting text by punctuation and special cases like :) or space - java

I have a following string:
Hello word!!!
or
Hello world:)
Now I want to split this string to an array of string which contains Hello,world,!,!,! or Hello,world,:)
the problem is if there was space between all the parts I could use split(" ")
but here !!! or :) is attached to the string
I also used this code :
String Text = "But I know. For example, the word \"can\'t\" should";
String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
System.out.println(s);
}
which I found it from here but not really helpful in my case:
Splitting strings through regular expressions by punctuation and whitespace etc in java
Can anyone help?

Seems to me like you do not want to split but rather capture certain groups. The thing with split string is that it gets rid of the parts that you split by (so if you split by spaces, you don't have spaces in your output array), therefore if you split by "!" you won't get them in your output. Possibly this would work for capturing the things that you are interested in:
(\w+)|(!)|(:\))/g
regex101
Mind you don't use string split with it, but rather exec your regex against your string in whatever engine/language you are using. In Java it would be something like:
String input = "Hello world!!!:)";
Pattern p = Pattern.compile("(\w+)|(!)|(:\))");
Matcher m = p.matcher(input);
List<String> matches = new ArrayList<String>();
while (m.find()) {
matches.add(m.group());
}
Your matches array will have:
["Hello", "world", "!", "!", "!", ":)"]

Related

splitting string and keep characters (regex pattern)

I would like to split a String and despair on the regex pattern.
I need to split a string like this: Hi I want "to split" this (String) to a String array like this:
String [] array = {"Hi", "I", "want", """, "to", "split", """, "this", "(", "string", ")"};
This is what I have tried, but it deletes the delimiter.
public static void main(String[] args) {
String string = "Hi \"why should\" (this work)";
String[] array;
array = string.split("\\s"
+ "|\\s(?=\")"
+ "|\\w(?=\")"
+ "|\"(?=\\w)"
+ "|\\s(?=\\()"
+ "|\\w(?=\\))"
+ "|\\((?=\\w)");
for (String str : array) {
System.out.println(str);
}
}
Result:
Hi
why
shoul
"
this
wor
)
You can match the tokens with the regex \w+|[\w\s], assuming that you want the punctuation characters to end up in different tokens:
String input = "Hi I want \"to split\" this (String).";
Matcher matcher = Pattern.compile("\\w+|[^\\w\\s]").matcher(input);
List<String> out = new ArrayList<>();
while (matcher.find()) {
out.add(matcher.group());
}
The output ArrayList contains:
[Hi, I, want, ", to, split, ", this, (, String, ), .]
You might want to use (?U) flag to make the \w and \s follows the Unicode definition of word and whitespace character. By default, \w and \s only recognizes word and whitespace characters in ASCII range.
For the sake of completeness, here is the solution in split(), which works on Java 8 and above. There will be an extra empty string at the beginning in Java 7.
String tokens[] = input.split("\\s+|(?<![\\w\\s])(?=\\w)|(?<=\\w)(?![\\w\\s])|(?<=[^\\w\\s])(?=[^\\w\\s])");
The regex is rather complex, since the empty string splits between punctuation character and word character need to avoid the cases already split by \s+.
Since the regex in the split solution is quite a mess, please use the match solution instead.
What language are you trying to write this in?
You could write regex groups something like: (.+)(\s)
This would match any quantity of characters followed by a space

String split regex [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Splitting String using RegEx in Android

I've been trying to split Strings using RegEx with no success. The idea is to split a given music file metadata from its file name in a way so that:
"01. Kodaline - Autopilot.mp3"
.. would result in..
metadata[0] = "01"
metadata[1] = "Kodaline"
metadata[2] = "Autopilot"
This is the RegEx I've been trying to use in its original form:
^(.*)\.(.*)\-(.*)\.(mp3|flac)
From what I've read, I need to format the RegEx for String.split(String regex) to work. So here's my formatted RegEx:
^(.*)\\.(.*)\\-(.*)\\.(mp3|flac)
..and this is what my code looks like:
String filename = "01. Kodaline - Autopilot.mp3";
String regex = "^(.*)\\.(.*)\\-(.*)\\.(mp3|flac)";
String[] metadata = filename.split(regex);
But I'm not receiving the result I expected. Can you help me on this?
Your regex is fine for matching the input string. Your problem is that you used split(), which expects a regex with a totally different purpose. For split(), the regex you give it matches the delimiters (separators) that separate parts of the input; they don't match the entire input. Thus, in a different situation (not your situation), you could say
String[] parts = s.split("[\\- ]");
The regex matches one character that is either a dash or a space. So this will look for dashes and spaces in your string and return the parts separated by the dashes and spaces.
To use your regex to match the input string, you need something like this:
String filename = "01. Kodaline - Autopilot.mp3";
String regex = "^(.*)\\.(.*)\\-(.*)\\.(mp3|flac)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(filename);
String[] metadata = new String[4];
if (matcher.find()) {
metadata[0] = matcher.group(1); // in real life I'd use a loop
metadata[1] = matcher.group(2);
metadata[2] = matcher.group(3);
metadata[3] = matcher.group(4);
// the rest of your code
}
which sets metadata to the strings "01", " Kodaline ", " Autopilot", "mp3", which is close to what you want except maybe for extra spaces (which you can look for in your regex). Unfortunately, I don't think there's a built-in Matcher function that returns all the groups in one array.
(By the way, in your regex, you don't need the backslashes in front of -, but they're harmless, so I left them in. The - doesn't normally have a special meaning, so it doesn't need to be escaped. Inside square brackets, however, a hyphen is special, so you should use backslashes if you want to match a set of characters and a hyphen is one of those characters. That's why I used backslashes in my split example above.)
this worked for me
str.split("\\.\\s+|\\s+-\\s+|\\.(mp3|flac)");
Try something like:
String filename = "01. Kodaline - Autopilot.mp3";
String fileWithoutExtension = filename.substring(0, filename.lastIndexOf('.'));
System.out.println(Arrays.toString(fileWithoutExtension.replaceAll("[^\\w\\s]", "").split("\\s+")));
Output:
[01, Kodaline, Autopilot]

How can I split a string except when the delimiter is protected by quotes or brackets?

I asked How to split a string with conditions. Now I know how to ignore the delimiter if it is between two characters.
How can I check multiple groups of two characters instead of one?
I found Regex for splitting a string using space when not surrounded by single or double quotes, but I don't understand where to change '' to []. Also, it works with two groups only.
Is there a regex that will split using , but ignore the delimiter if it is between "" or [] or {}?
For instance:
// Input
"text1":"text2","text3":"text,4","text,5":["text6","text,7"],"text8":"text9","text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}
// Output
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}
You can use:
text = "\"text1\":\"text2\",\"text3\":\"text,4\",\"text,5\":[\"text6\",\"text,7\"],\"text8\":\"text9\",\"text10\":{\"text11\":\"text,12\",\"text13\":\"text14\",\"text,15\":[\"text,16\",\"text17\"],\"text,18\":\"text19\"}";
String[] toks = text.split("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(?![^{]*})(?![^\\[]*\\]),+");
for (String tok: toks)
System.out.printf("%s%n", tok);
- RegEx Demo
OUTPUT:
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}

Escape comma when using String.split

I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:
String [] parts = input.split(",");
And works great for input like:
a,b,c
Or
type=simple, output=Hello, repeat=true
Just to say something.
How can I escape the comma, so it doesn't match intermediate commas?
For instance, if I want to include a comma in one of the parts:
type=simple, output=Hello, world, repeate=true
I was thinking in something like:
type=simple, output=Hello\, world, repeate=true
But I don't know how to create the split to avoid matching the comma.
I've tried:
String [] parts = input.split("[^\,],");
But, well, is not working.
You can solve it using a negative look behind.
String[] parts = str.split("(?<!\\\\), ");
Basically it says, split on each ", " that is not preceeded by a backslash.
String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
System.out.println(s);
Output:
type=simple
output=Hello\, world
repeate=true
(ideone.com link)
If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:
String[] parts = str.split(", (?=\\w+=)");
Which says split on each ", " which is followed by some word-characters and an =
(ideone.com link)
I'm afraid, there's no perfect solution for String.split. Using a matcher for the three parts would work. In case the number of parts is not constant, I'd recommend a loop with matcher.find. Something like this maybe
final String s = "type=simple, output=Hello, world, repeat=true";
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,|$)");
final Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group(1));
You'll probably want to skip the spaces after the comma as well:
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,\\s*|$)");
It's not really complicated, just note that you need four backslashes in order to match one.
Escaping works with the opposite of aioobe's answer (updated: aioobe now uses the same construct but I didn't know that when I wrote this), negative lookbehind
final String s = "type=simple, output=Hello\\, world, repeate=true";
final String[] tokens = s.split("(?<!\\\\),\\s*");
for(final String item : tokens){
System.out.println("'" + item.replace("\\,", ",") + "'");
}
Output:
'type=simple'
'output=Hello, world'
'repeate=true'
Reference:
Pattern: Special Constructs
I think
input.split("[^\\\\],");
should work. It will split at all commas that are not preceeded with a backslash.
BTW if you are working with Eclipse, I can recommend the QuickRex Plugin to test and debug Regexes.

Categories

Resources