Java split using regex lookahead - character not followed by character - java

I need to split the string to the substings in order to sort them to quoted and not quoted ones. The single quote character is used as a separator, and two single quotes represents an escape sequence and means that they shall not be used for splitting.
For example:
"111 '222''22' 3333"
shall be splitted as
"111", "222''22", "3333"
no matter with or without whitespaces.
So, I wrote the following code, but it does not work. Tried lookbehind with "\\'(?<!\\')" as well, but with no success. Please help
String rgxSplit="\\'(?!\\')";
String text="";
Scanner s=new Scanner(System.in);
System.out.println("\""+rgxSplit+"\"");
text=s.nextLine();
while(!text.equals(""))
{
String [] splitted=text.split(rgxSplit);
for(int i=0;i<splitted.length;i++)
{
if(i%2==0)
{
System.out.println("+" + splitted[i]);
}
else
{
System.out.println("-" + splitted[i]);
}
}
text=s.nextLine();
}
Output:
$ java ParseTest
"\'(?!\')"
111 '222''22' 3333
+111
-222'
+22
- 3333

This should split on a single quote (when it is not doubled), and in the case of three consecutive, it will group the first two and will split on the third.
String [] splitted=text.split("(?<!') *' *(?!')|(?<='') *' *");

To split on single apostrophes use look arounds both sides of the apostrophe:
String[] parts = str.split(" *(?<!')'(?!') *");
See live demo on ideone.

Related

Remove empty Strings after splitting a StringBuilder into Array Java

Sorry if this question has already been asked, but I could only find results of c#.
So I have this StringBuilder:
StringBuilder sb = new StringBuilder(" 111 11 ");
and I want to split it into an array using this method:
String[] ar = sb.toString().split(" ");
As expected the result array has some empty entries. My question is if I can remove these empty spaces directly when I split the StringBuilder or I have to do it afterwards.
split takes a regex. So:
String[] ar = sb.toString().split("\\s+");
The string \\s is regexp-ese for 'any whitespace', and the + is: 1 or more of it. If you want to split on spaces only (and not on newlines, tabs, etc), try: String[] ar = sb.toString().split(" +"); which is literally: "split on one or more spaces".
This trick works for just about any separator. For example, split on commas? Try: .split("\\s*,\\s*"), which is: 0 or more whitespace, a comma, followed by 0 or more whitespace (and regexes take as much as they can).
Note that this trick does NOT get rid of leading and trailing whitespace. But to do that, use trim. Putting it all together:
String[] ar = sb.toString().trim().split("\\s+");
and for commas:
String[] ar = sb.toString().trim().split("\\s*,\\s*");
I would use guava for this:
String t = " 111 11 ";
Splitter.on(Pattern.compile("\\s+"))
.omitEmptyStrings()
.split(t)
.forEach(System.out::println);
If you do not want to depend on any third party dependencies and do not want to regex filtering,
You can do it in one line with Java 8 Streams API:
Arrays.stream(sb.toString().trim().split(" ")).filter(s-> !s.equals("")).map(s -> s.trim()).toArray();
For a detailed multiline version of the previous:
Arrays.stream(sb.toString()
.trim() // Trim the starting and ending whitespaces from string
.split(" ")) // Split the regarding to spaces
.filter(s-> !s.equals("")) // Filter the non-empty elements from the stream
.map(s -> s.trim()) // Trim the starting and ending whitespaces from element
.toArray(); // Collect the elements to object array
Here is the working code for demonstration:
StringBuilder sb = new StringBuilder(" 111 11 ");
Object[] array = Arrays.stream(sb.toString().trim().split(" ")).filter(s-> !s.equals("")).map(s -> s.trim()).toArray();
System.out.println("(" + array[0] + ")");
System.out.println("(" + array[1] + ")");
There is couple of regex to deal with it, i would also prefer #rzwitserloot method,
but if you would like to see more.
Check it here : How do I split a string with any whitespace chars as delimiters?
glenatron has explained it :
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
Thanks to glenatron
You can use turnkey solution from Apache Commons.
Here is an example:
StringBuilder sb = new StringBuilder(" 111 11 ");
String trimmedString = StringUtils.normalizeSpace(sb.toString());
String[] trimmedAr = trimmedString.split(" ");
System.out.println(Arrays.toString(trimmedAr));
Output: [111, 11].

How can I split a string except when the delimiter is protected by quotes or brackets?

I asked How to split a string with conditions. Now I know how to ignore the delimiter if it is between two characters.
How can I check multiple groups of two characters instead of one?
I found Regex for splitting a string using space when not surrounded by single or double quotes, but I don't understand where to change '' to []. Also, it works with two groups only.
Is there a regex that will split using , but ignore the delimiter if it is between "" or [] or {}?
For instance:
// Input
"text1":"text2","text3":"text,4","text,5":["text6","text,7"],"text8":"text9","text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}
// Output
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}
You can use:
text = "\"text1\":\"text2\",\"text3\":\"text,4\",\"text,5\":[\"text6\",\"text,7\"],\"text8\":\"text9\",\"text10\":{\"text11\":\"text,12\",\"text13\":\"text14\",\"text,15\":[\"text,16\",\"text17\"],\"text,18\":\"text19\"}";
String[] toks = text.split("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(?![^{]*})(?![^\\[]*\\]),+");
for (String tok: toks)
System.out.printf("%s%n", tok);
- RegEx Demo
OUTPUT:
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}

Matching to a specific pattern

I am trying to preserve all the sentences between double quotes and put them in the array results[]
for example I can have the following code
public static void main (String[] args){
int i = 0 ;
System.out.println ( "words to be printed" );
}
In this example array results should have one string "words to be printed"
The technique I am using is splitting on the new line (\n) and checking if each String contains a double quotations and put it in results
I used "your string here".split("\"")[1] for extracting the text in between the quotations
The problem is that some Strings have quotations and some don't.
I tried:
if("your \"string\" here".split("\"")[1]) -> but this gives an exception if there is no quotation in the string
How can I check if the String has quotations or not?
This is an appropriate time to use regular expressions to match everything between the ". So a line like this
"myWord" and somewhere else "myOther words"
Should output
myWord
myOther words
Example code for paren matching:
Pattern pattern = Pattern.compile("\"(.*?)\"");
for (String line: myLines){
Matcher matcher = pattern.matcher(line);
while (matcher.find()){
System.out.println("found match '"+matcher.group(1)+"'");
}
}
If you only want to match a single line ignore the for loop, and just match against one input.
Use MyString.contains("\"") to check the presence of double quotes.
If exists you use split like you said.
If don't exists make yourString = "\""+youtString; and use split after that
If your string has two double quotes, then split("\"") will split in three pieces. So you can make a check like this (if expected not more then one double quote pair):
String[] s = input.split( "\"" );
if( s.length > 2 )
System.out.println( s[ 1 ] );
This is how you can check to see if string has quotation
if (yourText.contains("\"")){
//do something
}
Instead of splitting by \n use(If you are using java 1.7)
String newLine = System.getProperty("line.separator");
then use "your string here".split(newLine). Hope this help.

Regular Expression - inserting space after comma only if succeeded by a letter or number

In Java I want to insert a space after a String but only if the character after the comma is succeeded by a digit or letter. I am hoping to use the replaceAll method which uses regular expressions as a parameter. So far I have the following:
String s1="428.0,chf";
s1 = s1.replaceAll(",(\\d|\\w)",", ");
This code does successfully distinguish between the String above and one where there is already a space after the comma. My problem is that I can't figure out how to write the expression so that the space is inserted. The code above will replace the c in the String shown above with a space. This is not what I want.
s1 should look like this after executing the replaceAll: "428.0 chf"
s1.replaceAll(",(?=[\da-zA-Z])"," ");
(?=[\da-zA-Z]) is a positive lookahead which would look for a digit or a word after ,.This lookahead would not be replaced since it is never included in the result.It's just a check
NOTE
\w includes digit,alphabets and a _.So no need of \d.
A better way to represent it would be [\da-zA-Z] instead of \w since \w also includes _ which you do not need 2 match
Try this, and note that $1 refers to your matched grouping:
s1.replaceAll(",(\\d|\\w)"," $1");
Note that String.replaceAll() works in the same way as a Matcher.replaceAll(). From the doc:
The replacement string may contain references to captured subsequences
String s1="428.0,chf";
s1 = s1.replaceAll(",([^_]\\w)"," $1"); //Match alphanumeric except '_' after ','
System.out.println(s1);
Output: -
428.0 chf
Since \w matches digits, words, and an underscore, So, [^_] negates the underscore from \w..
$1 represents the captured group.. You captured c after , here, so replace c with _$1 -> _c.. "_" represent a space..
Try this....
public class Tes {
public static void main(String[] args){
String s1="428.0,chf";
String[] sArr = s1.split(",");
String finalStr = new String();
for(String s : sArr){
finalStr = finalStr +" "+ s;
}
System.out.println(finalStr);
}
}

Java parsing a string with lots of whitespace

I have a string with multiple spaces, but when I use the tokenizer it breaks it apart at all of those spaces. I need the tokens to contain those spaces. How can I utilize the StringTokenizer to return the values with the tokens I am splitting on?
You'll note in the docs for the StringTokenizer that it is recommended it shouldn't be used for any new code, and that String.split(regex) is what you want
String foo = "this is some data in a string";
String[] bar = foo.split("\\s+");
Edit to add: Or, if you have greater needs than a simple split, then use the Pattern and Matcher classes for more complex regular expression matching and extracting.
Edit again: If you want to preserve your space, actually knowing a bit about regular expressions really helps:
String[] bar = foo.split("\\b+");
This will split on word boundaries, preserving the space between each word as a String;
public static void main( String[] args )
{
String foo = "this is some data in a string";
String[] bar = foo.split("\\b");
for (String s : bar)
{
System.out.print(s);
if (s.matches("^\\s+$"))
{
System.out.println("\t<< " + s.length() + " spaces");
}
else
{
System.out.println();
}
}
}
Output:
this
<< 1 spaces
is
<< 6 spaces
some
<< 2 spaces
data
<< 6 spaces
in
<< 3 spaces
a
<< 1 spaces
string
Sounds like you may need to use regular expressions (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/package-summary.html) instead of StringTokenizer.
Use String.split("\\s+") instead of StringTokenizer.
Note that this will only extract the non-whitespace characters separated by at least one whitespace character, if you want leading/trailing whitespace characters included with the non-whitespace characters that will be a completely different solution!
This requirement isn't clear from your original question, and there is an edit pending that tries to clarify it.
StringTokenizer in almost every non-contrived case is the wrong tool for the job.
I think It will be good if you use first replaceAll function to replace all the multiple spaces by a single space and then do tokenization using split function.

Categories

Resources