Why is String#split(...) implemented like this? [duplicate]

Why is String#split(...) implemented like this? [duplicate] - java

This question already has answers here:
Java String split removed empty values
(5 answers)
Closed 3 years ago.
I am actually working on a software that requires to read text files with some features that won't be explained here. While testing my code, I've found an anomaly which seems to come from the implementation of str.split("\r\n"), where str is a substring of the file's content.
When my substring ends with a succession of "\r\n" (several line breaks), the method completely neglects this part. For example, if I work with the following string:
"\r\nLine 1\r\n\r\nLine 2\r\n\r\n"
, I would like to get the following array;
["", "Line 1", "", "Line 2", "", ""]
, but it returns:
["", "Line 1", "", "Line 2"]
The String.split() Javadoc only notifies this without explaining:
... Trailing empty strings are therefore not included in the resulting array.
I cannot understand this asymmetry; why did they neglect empty string at the end, but not at the beginning?

The Javadocs explain why it works the way it does; you'd have to ask them why they chose this default implementation. Why not just call split(regex, n) as per the docs? Using -1 does what you say you want, just like the docs imply.
class Main {
public static void main(String[] args) {
String s = "\r\nLine 1\r\n\r\nLine 2\r\n\r\n";
String[] r = s.split("\\r\\n", -1);
for (int i = 0; i < r.length; i++) {
System.out.println("i: " + i + " = \"" + r[i] + "\"");
}
}
}
Produces:
i: 0 = ""
i: 1 = "Line 1"
i: 2 = ""
i: 3 = "Line 2"
i: 4 = ""
i: 5 = ""

You missed the part of the doc that explains the therefore, which states:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.
Looking at the referenced two-arg doc shows
If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
So this is just not the special case you want. Call with a negative integer instead:
str.split("\r\n", -1)
It's unclear why the authors thought 0 would be a more popular use-case than -1, but it doesn't really matter since the option you want exists.

Related

Java splitting string at index without cutting the word [duplicate]

This question already has answers here:
Large string split into lines with maximum length in java
(8 answers)
Closed 4 years ago.
I was just wondering it here is an API or some easy and quick way to split String at given index into String[] array but if there is a word at that index then put it to other String.
So lets say I have a string: "I often used to look out of the window, but I rarely do that anymore"
The length of that string is 68 and I have to cut it at 36, which is in this given sentence n, but now it should split the word at the so that the array would be ["I often used to look out of the", "window, but I rarely do that anymore"].
And if the new sentence is longer than 36 then it should be split aswell, so if I had a bit longer sentence: "I often used to look out of the window, but I rarely do that anymore, even though I liked it"
Would be ["I often used to look out of the", "window, but I rarely do that anymore", ",even though I liked it"]

Here's an old-fashioned, non-stream, non-regex solution:
public static List<String> chunk(String s, int limit)
{
List<String> parts = new ArrayList<String>();
while(s.length() > limit)
{
int splitAt = limit-1;
for(;splitAt>0 && !Character.isWhitespace(s.charAt(splitAt)); splitAt--);
if(splitAt == 0)
return parts; // can't be split
parts.add(s.substring(0, splitAt));
s = s.substring(splitAt+1);
}
parts.add(s);
return parts;
}
This doesn't trim additional spaces either side of the split point. Also, if a string cannot be split, because it doesn't contain any whitespace in the first limit characters, then it gives up and returns the partial result.
Test:
public static void main(String[] args)
{
String[] tests = {
"This is a short string",
"This sentence has a space at chr 36 so is a good test",
"I often used to look out of the window, but I rarely do that anymore, even though I liked it",
"I live in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch",
};
int limit = 36;
for(String s : tests)
{
List<String> chunks = chunk(s, limit);
for(String st : chunks)
System.out.println("|" + st + "|");
System.out.println();
}
}
Output:
|This is a short string|
|This sentence has a space at chr 36|
|so is a good test|
|I often used to look out of the|
|window, but I rarely do that|
|anymore, even though I liked it|
|I live in|

This matches between 1 and 30 characters repetitively (greedy) and requires a whitespace behind each match.
public static List<String> chunk(String s, int size) {
List<String> chunks = new ArrayList<>(s.length()/size+1);
Pattern pattern = Pattern.compile(".{1," + size + "}(=?\\s|$)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
chunks.add(matcher.group());
}
return chunks;
}
Note that it doesn't work if there's a long string (>size) whitout whitespace.

Using scanner.next() to return the next n number of characters

I'm trying to use a scanner to parse out some text but i keep getting an InputMismatchException. I'm using the scanner.next(Pattern pattern) method and i want to return the next n amount of characters (including whitespace).
For example when trying to parse out
"21 SPAN 1101"
I want to store the first 4 characters ("21 ") in a variable, then the next 6 characters (" ") in another variable, then the next 5 ("SPAN "), and finally the last 4 ("1101")
What I have so far is:
String input = "21 SPAN 1101";
Scanner parser = new Scanner(input);
avl = parser.next(".{4}");
cnt = parser.next(".{6}");
abbr = parser.next(".{5}");
num = parser.next(".{4}");
But this keeps throwing an InputMismatchException even though according to the java 8 documentation for the scanner.next(Pattern pattern) it doesn't throw that type of exception. Even if I explicitly declare the pattern and then pass that pattern into the method i get the same exception being thrown.
Am I approaching this problem with the wrong class/method altogether? As far as i can tell my syntax is correct but i still cant figure out why im getting this exception.

At documentation of next(String pattern) we can find that it (emphasis mine)
Returns the next token if it matches the pattern constructed from the specified string.
But Scanner is using as default delimiter one or more whitespaces so it doesn't consider spaces as part of token. So first token it returns is "21", not "21 " so condition "...if it matches the pattern constructed from the specified string" is not fulfilled for .{4} because of its length.
Simplest solution would be reading entire line with nextLine() and splitting it into separate parts via regex like (.{4})(.{6})(.{5})(.{4}) or series of substring methods.

You might want to consider creating a convenience method to cut your input String into variable number of pieces of variable length, as approach with Scanner.next() seems to fail due to not considering spaces as part of tokens (spaces are used as delimiter by default). That way you can store result pieces of input String in an array and assign specific elements of an array to other variables (I made some additional explanations in comments to proper lines):
public static void main(String[] args) throws IOException {
String input = "21 SPAN 1101";
String[] result = cutIntoPieces(input, 4, 6, 5, 4);
// You can assign elements of result to variables the following way:
String avl = result[0]; // "21 "
String cnt = result[1]; // " "
String abbr = result[2]; // "SPAN "
String num = result[3]; // "1101"
// Here is an example how you can print whole array to console:
System.out.println(Arrays.toString(result));
}
public static String[] cutIntoPieces(String input, int... howLongPiece) {
String[] pieces = new String[howLongPiece.length]; // Here you store pieces of input String
int startingIndex = 0;
for (int i = 0; i < howLongPiece.length; i++) { // for each "length" passed as an argument...
pieces[i] = input.substring(startingIndex, startingIndex + howLongPiece[i]); // store at the i-th index of pieces array a substring starting at startingIndex and ending "howLongPiece indexes later"
startingIndex += howLongPiece[i]; // update value of startingIndex for next iterations
}
return pieces; // return array containing all pieces
}
Output that you get:
[21 , , SPAN , 1101]

Java: Split lines that are unequal

I tried looking for answers online, but I don't know how to word it correctly to find what I'm looking for. I have a file that I need to split but some lines are missing the regex I am trying to use.
The file I need to split looks like this:
A,106,Chainsaw 12"
D,102
d,104
a,107,Chainsaw 10"
I need to split it in three different sections, Letter, ID, Tool but 102 and 104 are missing the comma and Tool section. I've tried:
String[] sec = line.split(",");
and
String[] sec = line.split(",| \n");
And several other regex combinations, but none of them work. I get an AOB error on the line such as (below) because its missing.
...[0];
...[1];
String tool = sec[2]; //here
Any help is appreciated

Use String[] sec = line.split(","); and then test the length of the sec array
If you have 2 then you can use sec[0] and sec[1] but if you have 3 you can also use sec[2] If you have 0 then you have a empty line

You were on the right path with
String[] sec = line.split(",");
This function will return an Array with (in your example) either 2 or 3 elements. Then you could simply check for the length of the array
String[] sec = line.split(",");
int length = sec.length;
Here you can see what type of entry you have. Either length will be 2, which mean that it is a letter-ID pair, or it is 3, which means it is a letter, Id and tool entry.
If you need to be able to distinguish between more categories, you will have to put in an extra check. For example: Lets say your entry can be missing out one of the other two elements (not only the tool) and you could encounter an entry like:
a,Chainsaw 10
In this case you will furthermore need to read out the type of the single elements. The first thing that comes to my mind is, that you could check the first element in your array and check its length (should always be 1, since it is just a letter) and parse the second one into Integer (I assume id is always a number)

You are splitting on , and you have cases where input string is not present in same pattern.
So for this knid of situation, after splitting the string, every-time you have to check for the array length.
If it's less than desired length, then you cannot access desired element, because it's not present in the array.
For example:
When you do String[] sec = line.split(","); for A,106,Chainsaw 12, you will have 3 length, and you can access elements like sec[0],sec[1],sec[2].
When you split A,106, then you will get 2 as length, and elements present in the array are going to be sec[0],sec[1].
Example code:
import java.util.*;
public class ArrayListDemo {
public static void main(String args[]) {
ArrayList<String> lines= new ArrayList<String>();
lines.add("A,106,Chainsaw 12");
lines.add("A,106");
lines.add("A");
for(String str:lines){
String[] parts = str.split(",");
if(parts.length<2){
for(int i = 0 ; i < parts.length ; i++)
System.out.println("Splitted Item at index " + "[" + i + "]" + "::" + parts[i]);
}else{
for(int i = 0 ; i < parts.length ; i++)
System.out.println("Splitted Item at index " + "[" + i + "]" + "::" + parts[i]);
}
}
}
}
Hope that helps..

Regex does not store the element in the first index

I have a function which takes a String containing a math expression such as 6+9*8 or 4+9 and it evaluates them from left to right (without normal order of operation rules).
I've been stuck with this problem for the past couple of hours and have finally found the culprit BUT I have no idea why it is doing what it does. When I split the string through regex (.split("\\d") and .split("\\D")), I make it go into 2 arrays, one is a int[] where it contains the numbers involved in the expression and a String[] where it contains the operations.
What I've realized is that when I do the following:
String question = "5+9*8";
String[] mathOperations = question.split("\\d");
for(int i = 0; i < mathOperations.length; i++) {
System.out.println("Math Operation at " + i + " is " + mathOperations[i]);
}
it does not put the first operation sign in index 0, rather it puts it in index 1. Why is this?
This is the system.out on the console:
Math Operation at 0 is
Math Operation at 1 is +
Math Operation at 2 is *

Because on position 0 of mathOperations there's an empty String. In other words
mathOperations = {"", "+", "*"};
According to split documentation
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string. ...
Why isn't there an empty string at the end of the array too?
Trailing empty strings are therefore not included in the resulting
array.
More detailed explanation - your regex matched the String like this:
"(5)+(9)*(8)" -> "" + (5) + "+" + (9) + "*" + (8) + ""
but the trailing empty string is discarded as specified by the documentation.
(hope this silly illustration helps)
Also a thing worth noting, the regex you used "\\d", would split following string "55+5" into
["", "", "+"]
That's because you match only a single character, you should probably use "\\d+"

You may find the following variation on your program helpful, as one split does the jobs of both of yours...
public class zw {
public static void main(String[] args) {
String question = "85+9*8-900+77";
String[] bits = question.split("\\b");
for (int i = 0; i < bits.length; ++i) System.out.println("[" + bits[i] + "]");
}
}
and its output:
[]
[85]
[+]
[9]
[*]
[8]
[-]
[900]
[+]
[77]
In this program, I used \b as a "zero-width boundary" to do the splitting. No characters were harmed during the split, they all went into the array.
More info here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
and here: http://www.regular-expressions.info/wordboundaries.html

Regarding split() method in Java String API [duplicate]

This question already has answers here:
java - after splitting a string, what is the first element in the array?
(4 answers)
Closed 7 years ago.
I wrote the following Java code for testing the split() method in the String API.
import java.util.Scanner;
public class TestSplit {
public static void main(String[] args) {
String str = "10 5";
String[] integers = str.split(" ");
int numOfInt = integers.length;
for (int i = 0; i < numOfInt; i++) {
System.out.println(integers[i]);
}
}
}
I noticed that the above code gives me an output of
10
5
which is to be expected.
However, if I change the contents of str to " 10 5" then I get
10
5
as output. I don't understand why the output is different from the one above. If I am splitting str by using " " as a delimiter, then I thought that all " " will be ignored. So what is the extra space doing in my output?
EDIT: I tried " 10 5" and got
10
5
as output.

If you have a delimiter as the first character, split will return an empty String as the first element of the output array (i.e. " 10 5".split(" ") returns the array {"","10","5"}).
Similarly, if you have two consecutive delimiters, split will produce an empty String. So "10 5".split(" ") will produce the array {"10","","5"}.
If you wish leading and trailing whitespace to be ignored, change str.split(" "); to str.trim().split(" ");.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is String#split(...) implemented like this? [duplicate] - java

Related

Java splitting string at index without cutting the word [duplicate]

Using scanner.next() to return the next n number of characters

Java: Split lines that are unequal

Regex does not store the element in the first index

Regarding split() method in Java String API [duplicate]

Categories

Resources