How do I split input texts with multiple delimiters? - java

Update: The problem has been solved! Thanks everyone for helping!
I have a dataset that looks like this:
Title: The Importance of Being Earnest
A Trivial Comedy for Serious People
Author: Oscar Wilde
I would like to split the text via "(space)\t\n\r\f", any other tokens such as ",.:" will be regarded as part of the words. Are there any efficient ways that I can split the tokens like so?
I tried this:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), "\t");
I know you can split the tokens by a tab, is there a way to split the tokens by more than one delimiter?

You can use String.split() and separate all possible delimiters by "|". Note that some delimiters need to be escaped (e.g. "." -> "\\.")
Example:
String input = "123,45!67.890";
String[] splitted = input.split(",|\\.|!");
for (String split : splitted) {
System.out.println(split);
}
Output:
123
45
67
890

Related

Splitting text by punctuation and special cases like :) or space

I have a following string:
Hello word!!!
or
Hello world:)
Now I want to split this string to an array of string which contains Hello,world,!,!,! or Hello,world,:)
the problem is if there was space between all the parts I could use split(" ")
but here !!! or :) is attached to the string
I also used this code :
String Text = "But I know. For example, the word \"can\'t\" should";
String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
System.out.println(s);
}
which I found it from here but not really helpful in my case:
Splitting strings through regular expressions by punctuation and whitespace etc in java
Can anyone help?
Seems to me like you do not want to split but rather capture certain groups. The thing with split string is that it gets rid of the parts that you split by (so if you split by spaces, you don't have spaces in your output array), therefore if you split by "!" you won't get them in your output. Possibly this would work for capturing the things that you are interested in:
(\w+)|(!)|(:\))/g
regex101
Mind you don't use string split with it, but rather exec your regex against your string in whatever engine/language you are using. In Java it would be something like:
String input = "Hello world!!!:)";
Pattern p = Pattern.compile("(\w+)|(!)|(:\))");
Matcher m = p.matcher(input);
List<String> matches = new ArrayList<String>();
while (m.find()) {
matches.add(m.group());
}
Your matches array will have:
["Hello", "world", "!", "!", "!", ":)"]

Remove space except tab in a text line

I am reading a tab-delimited text file line by line which is extremely messy and trying to get the unique columns names out of it.
The problem is it contains tabs as field separator but some column names have space in their names! I am using
String[] cols = line.split("\\t");
which seems that is not working properly since for some cases it gets the spaces as separators! Is using regex a good solution? If yes, could you advise what regex removes white spaces from a string but keeps the tabs?
Data is like:
Sever ID Name
12221 zxsz
Tab in a string literal is just "\t". "\\t" is a literal backslash followed by a "t". Having said that, either method works for me:
public class Scratch2 {
public static void main(String[] args) {
String welk = "anna one\tanna two\tanna three";
System.out.println("\\t");
String[] annas = welk.split("\t");
for (String anna : annas) {
System.out.println(anna);
}
System.out.println("\\\\t");
annas = welk.split("\\t");
for (String anna : annas) {
System.out.println(anna);
}
}
}
Output:
\t
anna one
anna two
anna three
\\t
anna one
anna two
anna three
The simplest explanation is that your input strings don't contain the whitespace characters you think they do.

Splitting string on multiple spaces in java [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to split a String by space
I need help while parsing a text file.
The text file contains data like
This is different type of file.
Can not split it using ' '(white space)
My problem is spaces between words are not similar. Sometimes there is single space and sometimes multiple spaces are given.
I need to split the string in such a way that I will get only words, not spaces.
str.split("\\s+") would work. The + at the end of the regular-expression, would treat multiple spaces the same as a single space. It returns an array of strings (String[]) without any " " results.
You can use Quantifiers to specify the number of spaces you want to split on: -
`+` - Represents 1 or more
`*` - Represents 0 or more
`?` - Represents 0 or 1
`{n,m}` - Represents n to m
So, \\s+ will split your string on one or more spaces
String[] words = yourString.split("\\s+");
Also, if you want to specify some specific numbers you can give your range between {}:
yourString.split("\\s{3,6}"); // Split String on 3 to 6 spaces
Use a regular expression.
String[] words = str.split("\\s+");
you can use regex pattern
public static void main(String[] args)
{
String s="This is different type of file.";
String s1[]=s.split("[ ]+");
for(int i=0;i<s1.length;i++)
{
System.out.println(s1[i]);
}
}
output
This
is
different
type
of
file.
you can use
replaceAll(String regex, String replacement) method of String class to replace the multiple spaces with space and then you can use split method.
String spliter="\\s+";
String[] temp;
temp=mystring.split(spliter);
I am giving you another method to tockenize your string if you dont want to use the split method.Here is the method
public static void main(String args[]) throws Exception
{
String str="This is different type of file.Can not split it using ' '(white space)";
StringTokenizer st = new StringTokenizer(str, " ");
while(st.hasMoreElements())
System.out.println(st.nextToken());
}
}

Java parsing a string with lots of whitespace

I have a string with multiple spaces, but when I use the tokenizer it breaks it apart at all of those spaces. I need the tokens to contain those spaces. How can I utilize the StringTokenizer to return the values with the tokens I am splitting on?
You'll note in the docs for the StringTokenizer that it is recommended it shouldn't be used for any new code, and that String.split(regex) is what you want
String foo = "this is some data in a string";
String[] bar = foo.split("\\s+");
Edit to add: Or, if you have greater needs than a simple split, then use the Pattern and Matcher classes for more complex regular expression matching and extracting.
Edit again: If you want to preserve your space, actually knowing a bit about regular expressions really helps:
String[] bar = foo.split("\\b+");
This will split on word boundaries, preserving the space between each word as a String;
public static void main( String[] args )
{
String foo = "this is some data in a string";
String[] bar = foo.split("\\b");
for (String s : bar)
{
System.out.print(s);
if (s.matches("^\\s+$"))
{
System.out.println("\t<< " + s.length() + " spaces");
}
else
{
System.out.println();
}
}
}
Output:
this
<< 1 spaces
is
<< 6 spaces
some
<< 2 spaces
data
<< 6 spaces
in
<< 3 spaces
a
<< 1 spaces
string
Sounds like you may need to use regular expressions (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/package-summary.html) instead of StringTokenizer.
Use String.split("\\s+") instead of StringTokenizer.
Note that this will only extract the non-whitespace characters separated by at least one whitespace character, if you want leading/trailing whitespace characters included with the non-whitespace characters that will be a completely different solution!
This requirement isn't clear from your original question, and there is an edit pending that tries to clarify it.
StringTokenizer in almost every non-contrived case is the wrong tool for the job.
I think It will be good if you use first replaceAll function to replace all the multiple spaces by a single space and then do tokenization using split function.

Escape comma when using String.split

I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:
String [] parts = input.split(",");
And works great for input like:
a,b,c
Or
type=simple, output=Hello, repeat=true
Just to say something.
How can I escape the comma, so it doesn't match intermediate commas?
For instance, if I want to include a comma in one of the parts:
type=simple, output=Hello, world, repeate=true
I was thinking in something like:
type=simple, output=Hello\, world, repeate=true
But I don't know how to create the split to avoid matching the comma.
I've tried:
String [] parts = input.split("[^\,],");
But, well, is not working.
You can solve it using a negative look behind.
String[] parts = str.split("(?<!\\\\), ");
Basically it says, split on each ", " that is not preceeded by a backslash.
String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
System.out.println(s);
Output:
type=simple
output=Hello\, world
repeate=true
(ideone.com link)
If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:
String[] parts = str.split(", (?=\\w+=)");
Which says split on each ", " which is followed by some word-characters and an =
(ideone.com link)
I'm afraid, there's no perfect solution for String.split. Using a matcher for the three parts would work. In case the number of parts is not constant, I'd recommend a loop with matcher.find. Something like this maybe
final String s = "type=simple, output=Hello, world, repeat=true";
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,|$)");
final Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group(1));
You'll probably want to skip the spaces after the comma as well:
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,\\s*|$)");
It's not really complicated, just note that you need four backslashes in order to match one.
Escaping works with the opposite of aioobe's answer (updated: aioobe now uses the same construct but I didn't know that when I wrote this), negative lookbehind
final String s = "type=simple, output=Hello\\, world, repeate=true";
final String[] tokens = s.split("(?<!\\\\),\\s*");
for(final String item : tokens){
System.out.println("'" + item.replace("\\,", ",") + "'");
}
Output:
'type=simple'
'output=Hello, world'
'repeate=true'
Reference:
Pattern: Special Constructs
I think
input.split("[^\\\\],");
should work. It will split at all commas that are not preceeded with a backslash.
BTW if you are working with Eclipse, I can recommend the QuickRex Plugin to test and debug Regexes.

Categories

Resources