Hadoop - Pipe delimiter not recognized - java

I want to split a file with a pipe character on a string like number|twitter|abc.. in the mapper.
It is a long string. But it doesn't recognize pipe delimiter when I do:
String[] columnArray = line.split("|");
If I try to split it with a space like line.split(" "), it works fine so I don't think there is a problem with it recognizing characters.
Is there any other character that can look like pipe? Why doesn't split recognize the | character?

As shared in another answer
"String.split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean."
https://stackoverflow.com/a/9808719/2623158
Here's a test example.
public class Test
{
public static void main(String[] args)
{
String str = "test|pipe|delimeter";
String [] tmpAr = str.split("\\|");
for(String s : tmpAr)
{
System.out.println(s);
}
}
}

String.split takes a regular expression (as the javadoc states), and "|" is a special character in regular expressions. try "[|]" instead.

Related

Replace all characters between two delimiters using regex

I'm trying to replace all characters between two delimiters with another character using regex. The replacement should have the same length as the removed string.
String string1 = "any prefix [tag=foo]bar[/tag] any suffix";
String string2 = "any prefix [tag=foo]longerbar[/tag] any suffix";
String output1 = string1.replaceAll(???, "*");
String output2 = string2.replaceAll(???, "*");
The expected outputs would be:
output1: "any prefix [tag=foo]***[/tag] any suffix"
output2: "any prefix [tag=foo]*********[/tag] any suffix"
I've tried "\\\\\[tag=.\*?](.\*?)\\\\[/tag]" but this replaces the whole sequence with a single "\*".
I think that "(.\*?)" is the problem here because it captures everything at once.
How would I write something that replaces every character separately?
you can use the regex
\w(?=\w*?\[)
which would match all characters before a "[\"
see the regex demo, online compiler demo
You can capture the chars inside, one by one and replace them by * :
public static String replaceByStar(String str) {
String pattern = "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)";
while (str.matches(pattern)) {
str = str.replaceAll(pattern, "$1*$2");
}
return str;
}
Use like this it will print your tx2 expected outputs :
public static void main(String[] args) {
System.out.println(replaceByStar("any prefix [tag=foo]bar[/tag] any suffix"));
System.out.println(replaceByStar("any prefix [tag=foo]loooongerbar[/tag] any suffix"));
}
So the pattern "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)" :
(.*\\[tag=.*\\].*) capture the beginning, with eventually some char in the middle
\\w is for the char you want to replace
(.*\\[\\/tag\\].*) capture the end, with eventually some char in the middle
The substitution $1*$2:
The pattern is (text$1)oneChar(text$2) and it will replace by (text$1)*(text$2)

Why am I not being able to split a String in java given that I have a string containing filename?

Basically there are some images in my folder called Patterns. All images are in png file format.
Below is the code I'm using:
import java.io.File;
public class IMG_List {
public static void main(String [] args){
File file = new File("C:/images/Patterns");
String[] str = file.list();
for(String f_name : str){
String[] str_name = f_name.split(".");
System.out.println(str_name[0]);
}
}
}
When i use the above code I get:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at IMG_List.main(IMG_List.java:11)
However when i use the following code i get no error
import java.io.File;
public class IMG_List {
public static void main(String [] args){
File file = new File("C:/images/Patterns");
String[] str = file.list();
for(String f_name : str){
String[] str_name = f_name.split("png");
System.out.println(str_name[0]);
}
}
}
Why am I not being to split the string with the dot ?
Thank you,
MMK.
The '.' character in regular expressions means any character, according to the Pattern javadocs.
. Any character (may or may not match line terminators)
So, you get a bunch of empty strings in between the characters. But the no-arg split method discards trailing empty strings, and they're all empty, so you get a 0-length array, which explains the exception you received.
You must escape the '.' character with a backslash. To create a backslash character, you must escape the backslash itself for Java. Try
String[] str_name = f_name.split("\\.");
Then you'll get 2 elements in your array, e.g. "C:/images/Patterns/example" and "png".
you have to use escape characters before dot in order to be re-presentable as a regexp since split function accept regexp
public String[] split(String regex)
use \\. in regexp to represent dot because . means any character in regexp
You have to escape the dot:
String[] str_name = f_name.split("\\.");
If all the images are in PNG format, then you can also use String.substring()
String st_name = f_name.substring(0,f_name.length()-4);

How to split string separated by | character

I have input string in the following format
first|second|third|<forth>|<fifth>|$sixth I want to split this string into an array of string with value [first,second,third,,,$sixth]. I am using following code to split the string but that is not working. please help me.
public String[] splitString(String input){
String[] resultArray = input.split("|")
return resultArray;
}
Could you please tell me what am I doing wrong.
You need to escape | using backslash as it is a special character. This should work:
String[] resultArray = input.split("\\|")
| is a meta character meaning it represents something else in regex. Considering split takes regex as an argument, it interprets the argument using regex. You need to "escape" all of the meta characters by placing a \\ before it. In your case, you would do:
String[] resultArray = input.split("\\|");

Python split semantics in Java

When I split a string in python, adjacent space delimiters are merged:
>>> str = "hi there"
>>> str.split()
['hi', 'there']
In Java, the delimiters are not merged:
$ cat Split.java
class Split {
public static void main(String args[]) {
String str = "hi there";
String result = "";
for (String tok : str.split(" "))
result += tok + ",";
System.out.println(result);
}
}
$ javac Split.java ; java Split
hi,,,,,,,,,,,,,,there,
Is there a straightforward way to get python space split semantics in java?
String.split accepts a regular expression, so provide it with one that matches adjacent whitespace:
str.split("\\s+")
If you want to emulate the exact behaviour of Python's str.split(), you'd need to trim as well:
str.trim().split("\\s+")
Quote from the Python docs on str.split():
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
So the above is still not an exact equivalent, because it will return [''] for the empty string, but it's probably okay for your purposes :)
Use str.split("\\s+") instead. This will do what you need.
Java uses Regex to split.
so splitting on a single space will absolutely give you many array elements.
Python split, ltrims and rtrims and then takes runs of spaces into a single space when no parameter has been passed.
So it would more properly be
"my string".trim().split("\\s+");
The problem with Niklas B.'s answer is that trim has its own definition of whitespace, i.e., anything with code up to '\u0020'. The following should get close enough to the Python version, including the fix for the empty string:
class TestSplit {
private static final String[] EMPTY = {};
private static String[] pySplit(String s) {
s = s.replaceAll("^\\s+", "").replaceAll("\\s+$", "");
if (s.isEmpty()) return EMPTY;
return s.split("\\s+");
}
}
In java, String.split takes a regex. So you can do str.split(" +") to get python semantics.

Why does String.split need pipe delimiter to be escaped?

I am trying to parse a file that has each line with pipe delimited values.
It did not work correctly when I did not escape the pipe delimiter in split method, but it worked correctly after I escaped the pipe as below.
private ArrayList<String> parseLine(String line) {
ArrayList<String> list = new ArrayList<String>();
String[] list_str = line.split("\\|"); // note the escape "\\" here
System.out.println(list_str.length);
System.out.println(line);
for(String s:list_str) {
list.add(s);
System.out.print(s+ "|");
}
return list;
}
Can someone please explain why the pipe character needs to be escaped for the split() method?
String.split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.
Because the syntax for that parameter to split is a regular expression, where in the '|' has a special meaning of OR, and a '\|' means a literal '|' so the string "\\|" means the regular expression '\|' which means match exactly the character '|'.
You can simply do this:
String[] arrayString = yourString.split("\\|");

Categories

Resources