String parsing in Java with delimiter tab "\t" using split - java

I'm processing a string which is tab delimited. I'm accomplishing this using the split function, and it works in most situations. The problem occurs when a field is missing, so instead of getting null in that field I get the next value. I'm storing the parsed values in a string array.
String[] columnDetail = new String[11];
columnDetail = column.split("\t");
Any help would be appreciated. If possible I'd like to store the parsed strings into a string array so that I can easily access the parsed data.

String.split uses Regular Expressions, also you don't need to allocate an extra array for your split.
The split-method will give you a list., the problem is that you try to pre-define how many occurrences you have of a tab, but how would you Really know that? Try using the Scanner or StringTokenizer and just learn how splitting strings work.
Let me explain Why \t does not work and why you need \\\\ to escape \\.
Okay, so when you use Split, it actually takes a regex ( Regular Expression ) and in regular expression you want to define what Character to split by, and if you write \t that actually doesn't mean \t and what you WANT to split by is \t, right? So, by just writing \t you tell your regex-processor that "Hey split by the character that is escaped t" NOT "Hey split by all characters looking like \t". Notice the difference? Using \ means to escape something. And \ in regex means something Totally different than what you think.
So this is why you need to use this Solution:
\\t
To tell the regex processor to look for \t. Okay, so why would you need two of em? Well, the first \ escapes the second, which means it will look like this: \t when you are processing the text!
Now let's say that you are looking to split \
Well then you would be left with \\ but see, that doesn't Work! because \ will try to escape the previous char! That is why you want the Output to be \\ and therefore you need to have \\\\.
I really hope the examples above helps you understand why your solution doesn't work and how to conquer other ones!
Now, I've given you this answer before, maybe you should start looking at them now.
OTHER METHODS
StringTokenizer
You should look into the StringTokenizer, it's a very handy tool for this type of work.
Example
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
This will output
this
is
a
test
You use the Second Constructor for StringTokenizer to set the delimiter:
StringTokenizer(String str, String delim)
Scanner
You could also use a Scanner as one of the commentators said this could look somewhat like this
Example
String input = "1 fish 2 fish red fish blue fish";
Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
System.out.println(s.nextInt());
System.out.println(s.nextInt());
System.out.println(s.next());
System.out.println(s.next());
s.close();
The output would be
1
2
red
blue
Meaning that it will cut out the word "fish" and give you the rest, using "fish" as the delimiter.
examples taken from the Java API

Try this:
String[] columnDetail = column.split("\t", -1);
Read the Javadoc on String.split(java.lang.String, int) for an explanation about the limit parameter of split function:
split
public String[] split(String regex, int limit)
Splits this string around matches of the given regular expression.
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
The string "boo:and:foo", for example, yields the following results with these parameters:
Regex Limit Result
: 2 { "boo", "and:foo" }
: 5 { "boo", "and", "foo" }
: -2 { "boo", "and", "foo" }
o 5 { "b", "", ":and:f", "", "" }
o -2 { "b", "", ":and:f", "", "" }
o 0 { "b", "", ":and:f" }
When the last few fields (I guest that's your situation) are missing, you will get the column like this:
field1\tfield2\tfield3\t\t
If no limit is set to split(), the limit is 0, which will lead to that "trailing empty strings will be discarded". So you can just get just 3 fields, {"field1", "field2", "field3"}.
When limit is set to -1, a non-positive value, trailing empty strings will not be discarded. So you can get 5 fields with the last two being empty string, {"field1", "field2", "field3", "", ""}.

Well nobody answered - which is in part the fault of the question : the input string contains eleven fields (this much can be inferred) but how many tabs ? Most possibly exactly 10. Then the answer is
String s = "\t2\t\t4\t5\t6\t\t8\t\t10\t";
String[] fields = s.split("\t", -1); // in your case s.split("\t", 11) might also do
for (int i = 0; i < fields.length; ++i) {
if ("".equals(fields[i])) fields[i] = null;
}
System.out.println(Arrays.asList(fields));
// [null, 2, null, 4, 5, 6, null, 8, null, 10, null]
// with s.split("\t") : [null, 2, null, 4, 5, 6, null, 8, null, 10]
If the fields happen to contain tabs this won't work as expected, of course.
The -1 means : apply the pattern as many times as needed - so trailing fields (the 11th) will be preserved (as empty strings ("") if absent, which need to be turned to null explicitly).
If on the other hand there are no tabs for the missing fields - so "5\t6" is a valid input string containing the fields 5,6 only - there is no way to get the fields[] via split.

String.split implementations will have serious limitations if the data in a tab-delimited field itself contains newline, tab and possibly " characters.
TAB-delimited formats have been around for donkey's years, but format is not standardised and varies. Many implementations don't escape characters (newlines and tabs) appearing within a field. Rather, they follow CSV conventions and wrap any non-trivial fields in "double quotes". Then they escape only double-quotes. So a "line" could extend over multiple lines.
Reading around I heard "just reuse apache tools", which sounds like good advice.
In the end I personally chose opencsv. I found it light-weight, and since it provides options for escape and quote characters it should cover most popular comma- and tab- delimited data formats.
Example:
CSVReader tabFormatReader = new CSVReader(new FileReader("yourfile.tsv"), '\t');

You can use yourstring.split("\x09");
I tested it, and it works.

I just had the same question and noticed the answer in some kind of tutorial. In general you need to use the second form of the split method, using the
split(regex, limit)
Here is the full tutorial http://www.rgagnon.com/javadetails/java-0438.html
If you set some negative number for the limit parameter you will get empty strings in the array where the actual values are missing. To use this your initial string should have two copies of the delimiter i.e. you should have \t\t where the values are missing.
Hope this helps :)

String[] columnDetail = new String[11];
columnDetail = column.split("\t", -1); // unlimited
OR
columnDetail = column.split("\t", 11); // if you are sure about limit.
* The {#code limit} parameter controls the number of times the
* pattern is applied and therefore affects the length of the resulting
* array. If the limit <i>n</i> is greater than zero then the pattern
* will be applied at most <i>n</i> - 1 times, the array's
* length will be no greater than <i>n</i>, and the array's last entry
* will contain all input beyond the last matched delimiter. If <i>n</i>
* is non-positive then the pattern will be applied as many times as
* possible and the array can have any length. If <i>n</i> is zero then
* the pattern will be applied as many times as possible, the array can
* have any length, and trailing empty strings will be discarded.

Related

Weird behavior of Java's String.split() [duplicate]

I am trying to split the Value using a separator.
But I am finding the surprising results
String data = "5|6|7||8|9||";
String[] split = data.split("\\|");
System.out.println(split.length);
I am expecting to get 8 values. [5,6,7,EMPTY,8,9,EMPTY,EMPTY]
But I am getting only 6 values.
Any idea and how to fix. No matter EMPTY value comes at anyplace, it should be in array.
split(delimiter) by default removes trailing empty strings from result array. To turn this mechanism off we need to use overloaded version of split(delimiter, limit) with limit set to negative value like
String[] split = data.split("\\|", -1);
Little more details:
split(regex) internally returns result of split(regex, 0) and in documentation of this method you can find (emphasis mine)
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array.
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.
If n is non-positive then the pattern will be applied as many times as possible and the array can have any length.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
Exception:
It is worth mentioning that removing trailing empty string makes sense only if such empty strings were created by the split mechanism. So for "".split(anything) since we can't split "" farther we will get as result [""] array.
It happens because split didn't happen here, so "" despite being empty and trailing represents original string, not empty string which was created by splitting process.
From the documentation of String.split(String regex):
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
So you will have to use the two argument version String.split(String regex, int limit) with a negative value:
String[] split = data.split("\\|",-1);
Doc:
If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
This will not leave out any empty elements, including the trailing ones.
String[] split = data.split("\\|",-1);
This is not the actual requirement in all the time. The Drawback of above is show below:
Scenerio 1:
When all data are present:
String data = "5|6|7||8|9|10|";
String[] split = data.split("\\|");
String[] splt = data.split("\\|",-1);
System.out.println(split.length); //output: 7
System.out.println(splt.length); //output: 8
When data is missing:
Scenerio 2: Data Missing
String data = "5|6|7||8|||";
String[] split = data.split("\\|");
String[] splt = data.split("\\|",-1);
System.out.println(split.length); //output: 5
System.out.println(splt.length); //output: 8
Real requirement is length should be 7 although there is data missing. Because there are cases such as when I need to insert in database or something else. We can achieve this by using below approach.
String data = "5|6|7||8|||";
String[] split = data.split("\\|");
String[] splt = data.replaceAll("\\|$","").split("\\|",-1);
System.out.println(split.length); //output: 5
System.out.println(splt.length); //output:7
What I've done here is, I'm removing "|" pipe at the end and then splitting the String. If you have "," as a seperator then you need to add ",$" inside replaceAll.
From String.split() API Doc:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
Overloaded String.split(regex, int) is more appropriate for your case.
you may have multiple separators, including whitespace characters, commas, semicolons, etc. take those in repeatable group with []+, like:
String[] tokens = "a , b, ,c; ;d, ".split( "[,; \t\n\r]+" );
you'll have 4 tokens -- a, b, c, d
leading separators in the source string need to be removed before applying this split.
as answer to question asked:
String data = "5|6|7||8|9||";
String[] split = data.split("[\\| \t\n\r]+");
whitespaces added just in case if you'll have those as separators along with |

Why aren't my last split patterns respected in this code

First thing first, here is my code:
String line = "Events|1005435529|7021370073||PAGELOAD|2017-06-19T12:04:40||JI||ServerHostName|ServerIPAddress|9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121|||||||||||";
int offset = line.indexOf("Events");
String zeroIn = line.substring(offset);
String[] jsonElements = zeroIn.split("\\|");
System.out.println(Arrays.asList(jsonElements));
Output:
[Events, 1005435529, 7021370073, , PAGELOAD, 2017-06-19T12:04:40, , JI, , ServerHostName, ServerIPAddress, 9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121]`
I also notice spaces added to each array element at the beginning.
My question is that I have almost 10 empty pipeline symbols at the end of the String line while as the first second and third occurance of empty pipeline symbols is respected, the last ones are missed and don't add up in the array. What do I miss here?
split(java.lang.String regex) calls split(java.lang.String regex ,int limit) with an argument of 0.
If n is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
You may call this method by yourself with a positive value (and large enough to be sure to include all tokens) to prevent empty tokens from being discarded :
String[] jsonElements = zeroIn.split("\\|", zeroIn.length());
Note : from the comments below, using a negative value is indeed a better way to do this :
String[] jsonElements = zeroIn.split("\\|", -1);
If n is non-positive then the pattern will be applied as many times as
possible and the array can have any length.
From String class and split method doc:
Trailing empty strings are therefore not included in the resulting array.
So, after last occurrence of not empty string, rest will be not included in array.
The accepted answer explains the limitations you oberved splitting on a single character delimeter. I thought I would offer this answer if you need the ability to retain empty tokens in your output. If you split using a lookaround, e.g. a lookbehind, then you would end up with distinct entries even when two pipes have nothing in between them:
String line = "Events|1005435529|7021370073||PAGELOAD|2017-06-19T12:04:40||JI||ServerHostName|ServerIPAddress|9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121|||||||||||";
String[] parts = line.split("(?<=\\|)");
for (String part : parts) {
System.out.println(part);
}
Demo here:
Rextester

Splitting an empty string in Java seems to violate documentation by not discarding trailing empty strings

System.out.println(",".split(",", 0).length);
System.out.println("".split(",", 0).length);
prints:
0
1
This seems odd. According to the documentation for String.split(pattern, n),
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
In the second case, when splitting an empty string, this rule seems to be ignored. Is this expected behavior?
As from docs
If the expression does not match any part of the input then the
resulting array has just one element, namely this string
"".split(",", 0).length mean it is similar to this
System.out.println(new String[]{""}.length);
There was no , in the string "" so the array contain single element "" an empty string , result in array length as 1
another example
System.out.println("aaa".split(",", 0).length); // 1
System.out.println("aaa".split("," , 0)[0]); // aaa

Confused with using split with multiple delimiters

I'm practicing reading input and then tokenizing it.
For example, if I have [882,337] I want to just get the numbers 882 and 337. I tried using the following code:
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
System.out.println(tokens[0]);
System.out.println(tokens[1]);
System.out.println(tokens[2]);
It kind of works, the output is:
(blank line)
882
337
What I don't understand is why token[0] is empty? I would expect there to only be two tokens where token[0] = 882 and token[1] = 337.
I checked out some links but didn't find the answer.
Thanks for the help!
Split splits the given String. If you split "[882,337]" on "[" or "," or "]" then you actually have:
nothing
882
337
nothing
But, as you have called String.split(delimiter), this calls String.split(delimiter, limit) with a limit of zero.
From the documentation:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
(emphasis mine)
So in this configuration the final, empty, strings are discarded. You are therefore left with exactly what you have.
Usually, to tokenize something like this, one would go for a combination of replaceAll and split:
final String[] tokens = input.replaceAll("^\\[|\\]$").split(",");
This will first strip off the start (^[) and end (]$) brackets and then split on ,. This way you don't have to have somewhat obtuse program logic where you start looping from an arbitrary index.
As an alternative, for more complex tokenizations, one can use Pattern - might be overkill here, but worth bearing in mind before you get into writing multiple replaceAll chains.
First we need to define, in Regex, the tokens we want (rather than those we're splitting on) - in this case it's simple, it's just digits so \d.
So, in order to extract all digit only (no thousands/decimal separators) values from an arbitrary String on would do the following:
final List<Integer> tokens = new ArrayList<>(); <-- to hold the tokens
final Pattern pattern = Pattern.compile("\\d++"); <-- the compiled regex
final Matcher matcher = pattern.matcher(input); <-- the matcher on input
while(matcher.find()) { <-- for each matched token
tokens.add(Integer.parseInt(matcher.group())); <-- parse and `int` and store
}
N.B: I have used a possessive regex pattern for efficiency
So, you see, the above code is somewhat more complex than the simple replaceAll().split(), but it is much more extensible. You can use arbitrary complex regex to token almost any input.
The symbols where the string is split are here:
String test = "[882,337]";
^ ^ ^
Because The first char matches your delimiter, everything left from it will be the first result. Well, left from the first letter is nothing, so the result is the empty string.
One could expect the same behaviour for the end, since the last symbol also matches the delimiter. But:
Trailing empty strings are therefore not included in the resulting array.
See Javadoc.
Splitting creates two (or more) things from one thing. For instance if you split a,b by , you will get a and b.
But in case of ",b" you will get "" and "b". You can think of it this way:
"" exists at start, end and even in-between all characters of string:
""+","+"b" -> ",b" so if we split on this "," we are getting left and right part: "" and "b"
Similar things happens in case of "a," and at first result array is ["a",""] but here split method removes trailing empty strings and returns only ["a"] (you can turn off this clearing mechanism by using split(",", -1)).
So in case of
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
you are splitting:
""+"["+"882"+","+"337"+"]"+""
here: ^ ^ ^
which at first creates array ["", "882", "337", ""] but then trailing empty string is removed and finally you are receiving:
["", "882", "337"]
Only case where empty string is removed from start of result array is when
you are using Java 8 (or newer) and splitting on regex which is zero-length like split("") or lets say before each x with split("(?=x)") (more info at: Why in Java 8 split sometimes removes empty strings at start of result array?)
and when this empty string was result of split method. For instance "".split("") will not remove "", more info here: https://stackoverflow.com/a/25058091/1393766
That's because each delimiter has a "before" and "after" result, even if it is empty. Consider
882,337
You expect that to produce two results.
Similarly, you expect
882,337,
to produce three, with the last one being empty (assuming your limit is big enough, or assuming you're using almost any other language / implementation of split()). Extending that logically,
,882,337,
must produce four, with the first and last results being empty. This is exactly the case you have, except you have multiple delimiters.

Java : split a string that containing special characters

I have a string like ||81|||01|| and I want to split the string with | symbol.
I had done this way,
String str = "||81|||01||";
System.out.println(str .split("\\|").length); //printing 6 . But I am expecting 8
what is wrong with this code? | How can I split this string with that character so that I will get expected length (8)?;
Using split("\\|") is the same as split("\\|", 0), where the limit parameter 0 tells the function "omit trailing empty strings". So you are missing the last two empty strings. Use the two-argument version and supply a negative number to obtain all parts (even trailing empty ones):
str.split("\\|", -1)
Print:
System.out.println(Arrays.toString(str.split("\\|")));
And you'll understand why it's printing 6.
You can try doing what you want using public String[] split(String regex, int limit):
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array.
So you should do:
System.out.println(str.split("\\|", -1).length);
Now, printing the array will print:
[, , 81, , , 01, , ] as you expected.
You can also use string.split(Pattern.quote("|"),-1) for spliting a string on a special character.
You need to use:
str.split("\\|", -1)
The second parameter is limit. From the javadoc:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
str.split("\\|", -1) will do the necessary.
Possible duplicate : Here
String str = "||81|||01||";
System.out.println(str.split("\\|", 8).length);
The second argument to split specifies maximum number of matches. Single argument split is like invoking split(str, 0) which leaves out trailing strings. See javadoc of both for more explaination.

Categories

Resources