Splitting Java string with quotation marks [duplicate]

Splitting Java string with quotation marks [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can you recommend a Java library for reading (and possibly writing) CSV files?
I need to split the String in Java. The separator is the space character.
String may include the paired quotation marks (with some text and spaces inside) - the whole body inside the paired quotation marks should be considered as the single token.
Example:
Input:
token1 "token 2" token3
Output: array of 3 elements:
token1
token 2
token3
How to do it?
Thanks!

Split twice. First on quotes, then on spaces.

Assuming that the other solutions will not work for you, because they do not properly detect matching quotes or ignore spaces within quoted text, try something like:
private void addTokens(String tokenString, List<String> result) {
String[] tokens = tokenString.split("[\\r\\n\\t ]+");
for (String token : tokens) {
result.add(token);
}
}
List<String> result = new ArrayList<String>();
while (input.contains("\"")) {
String prefixTokens = input.substring(0, input.indexOf("\""));
input = input.substring(input.indexOf("\"") + 1);
String literalToken = input.substring(0, input.indexOf("\""));
input.substring(input.indexOf("\"") + 1);
addTokens(prefixTokens, result);
result.add(literalToken);
}
addTokens(input, result);
Note that this won't handle unbalanced quotes, escaped quotes, or other cases of erroneous/malformed input.

import java.util.StringTokenizer;
class STDemo {
static String in = "token1;token2;token3"
public static void main(String args[]) {
StringTokenizer st = new StringTokenizer(in, ";");
while(st.hasMoreTokens()) {
String val = st.nextToken();
System.out.println(val);
}
}
}
this is easy way to string tokenize

Related

Regex to remove pound sign and double commas java csv

I'm working with a CSV file that in places, has multiple commas and pound signs. My question is about how to remove the multiple commas and the pound signs, while leaving a single comma between fields.
The part of this task I am on is, using only java and no external libraries to sort through the csv file sort the array by price. I am to input a number as an input parameter and return that number of rows, ordered by price.
What I have currently is around 1000 lines of data that looks like this:
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,
I need to remove the double commas and the pound sign, but for the life of me haven't been able to get it to work.
This is the line I am using for the regex.
String currentLine = line.replaceAll("[,{2}|£]", "");
This outputs a line which looks like this:
100086 Norway Maple WayMadelleGeorgeotmgeorgeotrr#hao13.com417175.60
A larger chunk of the code looks like this and by no means is it nearly finished:
public String[] getTopProperties(int n){
String[] properties = new String[n];
String file = "data.csv";
String line = "";
String splitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
while ((line = br.readLine()) != null) {
String currentLine = line.replaceAll("[,{2}|£]", "");
System.out.println("Current line is: " + currentLine);
String[] user = currentLine.split(splitBy);
}
} catch (IOException e) {
e.printStackTrace();
}
return properties;
}
Issue is it's now removed all the commas and where the price and double commas used to be, they now connect.
Could use some help finding some regex that keeps a single comma between each field, as well as removing the pound sign.

You could simplify this by parsing the CSV file into a 2D array and ignoring the empty column which results from the double comma. Then parsing the currency column is a snap: just ignore the first character.

In your regex .replaceAll("[,{2}|£]", ""); the square-brackets creates a character class, so this means "replace any characters ,, {, 2, }, |, or £ with nothing".
What you really want is to replace the sequence ,,£ with a single comma, which would be .replaceAll(",,£", ",")
In java script this would be...
var line="18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,";
console.log(' original line: ' + line);
console.log('replacement line: ' + line.replace(/,,£/, ","));
update
Converting this to Java as a stand-alone test program to demonstrate that this does work, I get the following:
public class so50419207
{
public static void main(String... args)
{
String input = "18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,";
String replaced = input.replace(",,£", ",");
System.out.println("original string: " + input);
System.out.println("replaced string: " + replaced);
}
}
Running this...
$ javac so50419207.java ; java so50419207
original string: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,,£307018.48,
replaced string: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,

Tried the regex (,,)(£)? and tested it in ideone :
Please find the code below:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
final String regex = "(,,)(£)?";
final String string = "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,\n"
+ "18,,5 Ramsey Lane,,See,Amerighi,,samerighih#trellian.com,,£307018.48,,";
final String subst = ",";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);
System.out.println("Substitution result: " + result);
}
}
Output:
Substitution result: 18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,
18,5 Ramsey Lane,See,Amerighi,samerighih#trellian.com,307018.48,

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

I need to split a string in Java (first remove whitespaces between quotes and then split at whitespaces.)
"abc test=\"x y z\" magic=\" hello \" hola"
becomes:
firstly:
"abc test=\"xyz\" magic=\"hello\" hola"
and then:
abc
test="xyz"
magic="hello"
hola
Scenario :
I am getting a string something like above from input and I want to break it into parts as above. One way to approach was first remove the spaces between quotes and then split at spaces. Also string before quotes complicates it. Second one was split at spaces but not if inside quote and then remove spaces from individual split. I tried capturing quotes with "\"([^\"]+)\"" but I'm not able to capture just the spaces inside quotes. I tried some more but no luck.

We can do this using a formal pattern matcher. The secret sauce of the answer below is to use the not-much-used Matcher#appendReplacement method. We pause at each match, and then append a custom replacement of anything appearing inside two pairs of quotes. The custom method removeSpaces() strips all whitespace from each quoted term.
public static String removeSpaces(String input) {
return input.replaceAll("\\s+", "");
}
String input = "abc test=\"x y z\" magic=\" hello \" hola";
Pattern p = Pattern.compile("\"(.*?)\"");
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer("");
while (m.find()) {
m.appendReplacement(sb, "\"" + removeSpaces(m.group(1)) + "\"");
}
m.appendTail(sb);
String[] parts = sb.toString().split("\\s+");
for (String part : parts) {
System.out.println(part);
}
abc
test="xyz"
magic="hello"
hola
Demo
The big caveat here, as the above comments hinted at, is that we are really using a regex engine as a rudimentary parser. To see where my solution would fail fast, just remove one of the quotes by accident from a quoted term. But, if you are sure you input is well formed as you have showed us, this answer might work for you.

I wanted to mention the java 9's Matcher.replaceAll lambda extension:
// Find quoted strings and remove there whitespace:
s = Pattern.compile("\"[^\"]*\"").matcher(s)
.replaceAll(mr -> mr.group().replaceAll("\\s", ""));
// Turn the remaining whitespace in a comma and brace all.
s = '{' + s.trim().replaceAll("\\s+", ", ") + '}';

Probably the other answer is better but still I have written it so I will post it here ;) It takes a different approach
public static void main(String[] args) {
String test="abc test=\"x y z\" magic=\" hello \" hola";
Pattern pattern = Pattern.compile("([^\\\"]+=\\\"[^\\\"]+\\\" )");
Matcher matcher = pattern.matcher(test);
int lastIndex=0;
while(matcher.find()) {
String[] parts=matcher.group(0).trim().split("=");
boolean newLine=false;
for (String string : parts[0].split("\\s+")) {
if(newLine)
System.out.println();
newLine=true;
System.out.print(string);
}
System.out.println("="+parts[1].replaceAll("\\s",""));
lastIndex=matcher.end();
}
System.out.println(test.substring(lastIndex).trim());
}
Result is
abc
test="xyz"
magic="hello"
hola

It sounds like you want to write a basic parser/Tokenizer. My bet is that after you make something that can deal with pretty printing in this structure, you will soon want to start validating that there arn't any mis-matching "'s.
But in essence, you have a few stages for this particular problem, and Java has a built in tokenizer that can prove useful.
import java.util.LinkedList;
import java.util.List;
import java.util.StringTokenizer;
import java.util.stream.Collectors;
public class Q50151376{
private static class Whitespace{
Whitespace(){ }
#Override
public String toString() {
return "\n";
}
}
private static class QuotedString {
public final String string;
QuotedString(String string) {
this.string = "\"" + string.trim() + "\"";
}
#Override
public String toString() {
return string;
}
}
public static void main(String[] args) {
String test = "abc test=\"x y z\" magic=\" hello \" hola";
StringTokenizer tokenizer = new StringTokenizer(test, "\"");
boolean inQuotes = false;
List<Object> out = new LinkedList<>();
while (tokenizer.hasMoreTokens()) {
final String token = tokenizer.nextToken();
if (inQuotes) {
out.add(new QuotedString(token));
} else {
out.addAll(TokenizeWhitespace(token));
}
inQuotes = !inQuotes;
}
System.out.println(joinAsStrings(out));
}
private static String joinAsStrings(List<Object> out) {
return out.stream()
.map(Object::toString)
.collect(Collectors.joining());
}
public static List<Object> TokenizeWhitespace(String in){
List<Object> out = new LinkedList<>();
StringTokenizer tokenizer = new StringTokenizer(in, " ", true);
boolean ignoreWhitespace = false;
while (tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
boolean whitespace = token.equals(" ");
if(!whitespace){
out.add(token);
ignoreWhitespace = false;
} else if(!ignoreWhitespace) {
out.add(new Whitespace());
ignoreWhitespace = true;
}
}
return out;
}
}

How to use regex with String.split()

I have the following String:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n"
I want to convert it to an array of String which will look like this.
String[] Title = {"Title1 Title2","Title3 Title4","Title5 Title6","Title7"}
I am trying the following code.
String[] Title=fullPDFContext.split("\r\n\r\n|\r\n \r\n|\r\n");
But not getting the desired output.

You need to split with a pattern that matches any amount of whitespace that contains a line break:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
String separator = "\\p{javaWhitespace}*\\R\\p{javaWhitespace}*";
String results[] = fullPDFContex.split(separator);
System.out.println(Arrays.toString(results));
// => [Title1 Title2, Title3 Title4, Title5 Title6, Title7]
See the Java demo.
The \\p{javaWhitespace}*\\R\\p{javaWhitespace}* matches
\\p{javaWhitespace}* - 0+ whitespaces
\\R - a line break (you may replace it with [\r\n] for Java 7 and older)
\\p{javaWhitespace}* - 0+ whitespaces.
Alternatively, you may use a bit more efficient
String separator = "[\\s&&[^\r\n]]*\\R\\s*";
See another demo
Unfortunately, the \R construct cannot be used in the character classes. The pattern will match:
[\\s&&[^\r\n]]* - zero or more whitespace chars other than CR and LF (character class subtraction is used here)
\\R - a line break
\\s* - any 0+ whitespace chars.

Here is your solution. we can use StringTokenizer & I have used list to insert the splitted values.This can help you if you have n number of values splitted from your array
package com.sujit;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class UserInput {
public static void main(String[] args) {
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
StringTokenizer token = new StringTokenizer(fullPDFContex, "\r\n");
List<String> list = new ArrayList<>();
while (token.hasMoreTokens()) {
list.add(token.nextToken());
}
for (String string : list) {
System.out.println(string);
}
}
}

With this code you get the output you want:
String[] Title = fullPDFContext.split(" *(\r\n ?)+ *");

Best way to convert list to comma separated string in java [duplicate]

This question already has answers here:
What's the best way to build a string of delimited items in Java?
(37 answers)
Closed 9 years ago.
I have Set<String> result & would like to convert it to comma separated string. My approach would be as shown below, but looking for other opinion as well.
List<String> slist = new ArrayList<String> (result);
StringBuilder rString = new StringBuilder();
Separator sep = new Separator(", ");
//String sep = ", ";
for (String each : slist) {
rString.append(sep).append(each);
}
return rString;

Since Java 8:
String.join(",", slist);
From Apache Commons library:
import org.apache.commons.lang3.StringUtils
Use:
StringUtils.join(slist, ',');
Another similar question and answer here

You could count the total length of the string first, and pass it to the StringBuilder constructor. And you do not need to convert the Set first.
Set<String> abc = new HashSet<String>();
abc.add("A");
abc.add("B");
abc.add("C");
String separator = ", ";
int total = abc.size() * separator.length();
for (String s : abc) {
total += s.length();
}
StringBuilder sb = new StringBuilder(total);
for (String s : abc) {
sb.append(separator).append(s);
}
String result = sb.substring(separator.length()); // remove leading separator

The Separator you are using is a UI component. You would be better using a simple String sep = ", ".

string tokenizer in Java

I have a text file which contains data seperated by '|'. I need to get each field(seperated by '|') and process it. The text file can be shown as below :
ABC|DEF||FGHT
I am using string tokenizer(JDK 1.4) for getting each field value. Now the problem is, I should get an empty string after DEF.However, I am not getting the empty space between DEF & FGHT.
My result should be - ABC,DEF,"",FGHT but I am getting ABC,DEF,FGHT

From StringTokenizer documentation :
StringTokenizer is a legacy class that
is retained for compatibility reasons
although its use is discouraged in new
code. It is recommended that anyone
seeking this functionality use the
split method of String or the
java.util.regex package instead.
The following code should work :
String s = "ABC|DEF||FGHT";
String[] r = s.split("\\|");

Use the returnDelims flag and check two subsequent occurrences of the delimiter:
String str = "ABC|DEF||FGHT";
String delim = "|";
StringTokenizer tok = new StringTokenizer(str, delim, true);
boolean expectDelim = false;
while (tok.hasMoreTokens()) {
String token = tok.nextToken();
if (delim.equals(token)) {
if (expectDelim) {
expectDelim = false;
continue;
} else {
// unexpected delim means empty token
token = null;
}
}
System.out.println(token);
expectDelim = true;
}
this prints
ABC
DEF
null
FGHT
The API isn't pretty and therefore considered legacy (i.e. "almost obsolete"). Use it only with where pattern matching is too expensive (which should only be the case for extremely long strings) or where an API expects an Enumeration.
In case you switch to String.split(String), make sure to quote the delimiter. Either manually ("\\|") or automatically using string.split(Pattern.quote(delim));

StringTokenizer ignores empty elements. Consider using String.split, which is also available in 1.4.
From the javadocs:
StringTokenizer is a legacy class that
is retained for compatibility reasons
although its use is discouraged in new
code. It is recommended that anyone
seeking this functionality use the
split method of String or the
java.util.regex package instead.

you can use the constructor that takes an extra 'returnDelims' boolean, and pass true to it.
this way you will receive the delimiters, which will allow you to detect this condition.
alternatively you can just implement your own string tokenizer that does what you need, it's not that hard.

Here is another way to solve this problem
String str = "ABC|DEF||FGHT";
StringTokenizer s = new StringTokenizer(str,"|",true);
String currentToken="",previousToken="";
while(s.hasMoreTokens())
{
//Get the current token from the tokenize strings
currentToken = s.nextToken();
//Check for the empty token in between ||
if(currentToken.equals("|") && previousToken.equals("|"))
{
//We denote the empty token so we print null on the screen
System.out.println("null");
}
else
{
//We only print the tokens except delimiters
if(!currentToken.equals("|"))
System.out.println(currentToken);
}
previousToken = currentToken;
}

Here is a way to split a string into tokens (a token is one or more letters)
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
s = s.replaceAll("[^A-Za-z]", " ");
StringTokenizer arr = new StringTokenizer(s, " ");
int n = arr.countTokens();
System.out.println(n);
while(arr.hasMoreTokens()){
System.out.println(arr.nextToken());
}
scan.close();
}

package com.java.String;
import java.util.StringTokenizer;
public class StringWordReverse {
public static void main(String[] kam) {
String s;
String sReversed = "";
System.out.println("Enter a string to reverse");
s = "THIS IS ASHIK SKLAB";
StringTokenizer st = new StringTokenizer(s);
while (st.hasMoreTokens()) {
sReversed = st.nextToken() + " " + sReversed;
}
System.out.println("Original string is : " + s);
System.out.println("Reversed string is : " + sReversed);
}
}
Output:
Enter a string to reverse
Original string is : THIS IS ASHIK SKLAB
Reversed string is : SKLAB ASHIK IS THIS

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting Java string with quotation marks [duplicate] - java

Split twice. First on quotes, then on spaces.

Related

Regex to remove pound sign and double commas java csv

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

How to use regex with String.split()

Best way to convert list to comma separated string in java [duplicate]

string tokenizer in Java

Categories

Resources