Java: CSV parser - skipping quotes

Java: CSV parser - skipping quotes - java

is there any way to parse CSV file (variable number of columns) with the help of some CSV parser (e.g. SuperCSV) to set of List<String> without skipping quotes in Java? For the input:
id,name,text,sth
1,"John","Text with 'c,o,m,m,a,s' and \"",qwerty
2,Bob,"",,sth
after parsing, I'd like to have in the set the same text as in input instead of:
id,name,text,sth
1,John,Text with 'c,o,m,m,a,s' and \",qwerty
2,Bob,null,null,sth
that element
"John" will parsed to string "John" ( instead of John )
"" --> ""
,, --> ,null,
etc.
I already wrote about this here, but I probably didn't make this clear enough.
I want to parse csv file to set of List<String>, do something with this and print to the stdout leaving quotes where they was. Please help me.

Something like this? Not using any existing parser, doing it from scratch:
public List<String> parse(String st) {
List<String> result = new ArrayList<String>();
boolean inText = false;
StringBuilder token = new StringBuilder();
char prevCh = 0;
for (int i = 0; i < st.length(); i++) {
char ch = st.charAt(i);
if (ch == ',' && !inText) {
result.add(token.toString());
token = new StringBuilder();
continue;
}
if (ch == '"' && inText) {
if (prevCh == '\\') {
token.deleteCharAt(token.length() - 1);
} else {
inText = false;
}
} else if (ch == '"' && !inText) {
inText = true;
}
token.append(ch);
prevCh = ch;
}
result.add(token.toString());
return result;
}
Then
String st = "1,\"John\",\"Text with 'c,o,m,m,a,s' and \\\"\",qwerty";
List<String> result = parse(st);
System.out.println(result);
Will print out:
[1, "John", "Text with 'c,o,m,m,a,s' and "", qwerty]

I have used this one:
http://opencsv.sourceforge.net/
And I was pretty satasfied with the results. I had a bunch of differently organized CSV files (it's sometimes funny what kinds of things people call CSV these days), and I managed to set up the reader for it. However, I don't think it will generate commas, but it will leave blanks where there is an empty field. Since you can fetch the whole line as an array, you can iterate it and but a comma between each iteration.
Look up the settings, there is a bunch of them, including quote characters.

Related

Regex to Extract Arabic text Using Java

I want to extract only Arabic text from a file that contains many non-Arabic texts and elements e.g (English, emojie, numbers ..etc), using Regex, I found many tutorials here and they work! but the problem is I get the letters attached for Ex:
String text = "123 اهلين و سهلين"
after applying regex
output:
"اهلينوسهلين"
The output I want:
"اهلين و سهلين"
I tried so many ways to solve this including:
"\\p{InArabic}+(?:\\s+\\p{InArabic}+)*"
"(?:[\\u0600-\\u06FF]+(?:\\s+[\\u0600-\\u06FF]+)*)"
"^[\\p\\{Arabic\\}\\s]+$"
But I was unable to get the results that I need even though others, based on their questions, were able to find the same output structure as I need using these regex.
My code:
String regex = "\\p{InArabic}+";
String outString;
String cleaned = "";
Scanner in = new Scanner(new FileReader(path+"tweets.txt"));
StringBuilder sb = new StringBuilder();
while(in.hasNext()) {
sb.append(in.next());
}
in.close();
outString = sb.toString();
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE |
Pattern.UNICODE_CHARACTER_CLASS);
final Matcher matcher = pattern.matcher(outString);
while (matcher.find()) {
cleaned = cleaned +" "+ matcher.group();
}
I ran my code on another text file and it worked it showed the right output in the right format, so I think the problem is with the text file I'm trying to run the code on, which is retrieved tweets using twitter4j so perhaps there's a problem with that?

This outputs exactly the desired text in your question:
text.replaceAll("[^\\p{InARABIC} ]", "").trim()
This uses the negative character POSIX class for Arabic letters, and adds a call to trim().
If you absolutely must use a single regex (ie no call to trim()):
text.replaceAll("^[\\P{InARABIC}\\d ]*|[\\P{InARABIC} ]*$", "")
This code:
System.out.println(" اهلين و سهلين 123".replaceAll("[^\\p{InARABIC} ]", "").trim()
Outputs:
اهلين و سهلين

Try to use this regex [^\u0600-\u06FF\\s]+ which mean replace non Arabic character or space with empty :
String text = "123 اهلين و سهلين, Welcome, Bienvenue, Hola";
text = text.replaceAll("[^\u0600-\u06FF\\s]+", "");
Output
اهلين و سهلين
You can also use trim() in the end to remove space in the start and in the end :
text = text.replaceAll("[^\u0600-\u06FF\\s]+", "").trim();
Code demo

public class HelloWorld
{
public static void main(String []args)
{
System.out.println("Hello World");
System.out.println (containsArabicLetters("بسيب سيبيس سيبسيبسي سشسشس"));
}
public static boolean containsArabicLetters(String text)
{
char[] ch1 = text.replaceAll(" ", "").toCharArray();
for (char c:ch1)
{
if (c >= 0x600 && c <= 0x6ff)
continue;
if (c >= 0x750 && c <= 0x77f)
continue;
if (c >= 0xfb50 && c <= 0xfc3f)
continue;
if (c >= 0xfe70 && c <= 0xfefc)
continue;
else
return false;
}
return true;
}
}

How to merge many List<String> elements in one based on double quote delimiter in java

I have a CSV file generated in other platform (Salesforce), by default it seems Salesforce is not handling break lines in the file generation in some large text fields, so in my CSV file I have some rows with break lines like this that I need to fix:
"column1","column2","my column with text
here the text continues
more text in the same field
here we finish this","column3","column4"
Same idea using this piece of code:
List<String> listWords = new ArrayList<String>();
listWords.add("\"Hi all");
listWords.add("This is a test");
listWords.add("of how to remove");
listWords.add("");
listWords.add("breaklines and merge all in one\"");
listWords.add("\"This is a new Line with the whole text in one row\"");
in this case I would like to merge the elements. My first approach was to check for the lines were the last char is not a ("), concatenates the next line and just like that until we see the las char contains another double quote.
this is a non working sample of what I was trying to achieve but I hope it gives you an idea
String[] csvLines = csvContent.split("\n");
Integer iterator = 0;
String mergedRows = "";
for(String row:csvLines){
newCsvfile.add(row);
if(row != null){
if(!row.isEmpty()){
String lastChar = String.valueOf(row.charAt(row.length()-1));
if(!lastChar.contains("\"")){
//row += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
mergedRows += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
row = mergedRows;
csvLines[iterator+1] = null;
}
}
newCsvfile.add(row);
}
iterator++;
}
My final result should look like (based on the list sample):
"Hi all This is a test of how to remove break lines and merge all in one"
"This is a new Line with the whole text in one row".
What is the best approach to achieve this?

In case you don't want to use a CSV reading library like #RealSkeptic suggested...
Going from your listWords to your expected solution is fairly simple:
List<String> listSentences = new ArrayList<>();
String tmp = "";
for (String s : listWords) {
tmp = tmp.concat(" " + s);
if (s.endsWith("\"")){
listSentences.add(tmp);
tmp = "";
}
}

StringIndexOutOfBoundsException when using delimiter

I want to split a string into multiple parts based on parentheses. So if I have the following string:
In fair (*NAME OF A CITY), where we lay our (*NOUN),
The string should be split as:
In fair
*NAME OF A CITY
, where we lay our
*NOUN
I set up a delimiter like so:
String delim = "[()]";
String [] inputWords = line.split (delim);
Because the strings in all caps with an * at the beginning are going to be replaced with user input, I set up a loop like so:
while (input.hasNextLine())
{
line = input.nextLine();
String [] inputWords = line.split (delim);
for (int i = 0; i < inputWords.length; i++)
{
if (inputWords[i].charAt(0) != '*')
{
newLine.append (inputWords[i]);
}
else
{
String userWord = JOptionPane.showInputDialog (null, inputWords[i].substring (1, inputWords[i].length()));
newLine.append (userWord);
}
}
output.println (newLine.toString());
output.flush();
newLine.delete (0, line.length());
}
Looks like I'm getting an error with this if statement:
if (inputWords[i].charAt(0) != '*')
When I run it, I get a StringIndexOutOfBoundsException: String index out of range: 0. Not sure why that's happening. Any advice? Thank you!

apparently line = input.nextLine(); gives you a blank string, as #Marco already mentioned.
handle empty line(s) before processing further.

How to disguise escape character - \" within a string

I am facing a little difficulty with a Syntax highlighter that I've made and is 90% complete. What it does is that it reads in the text from the source of a .java file, detects keywords, comments, etc and writes a (colorful) output in an HTML file. Sample output from it is:
(I couldn't upload a whole html page, so this is a screenshot.) As (I hope) you can see, my program seems to work correctly with keywords, literals and comments (see below) and hence can normally document almost all programs. But it seems to break apart when I store the escape sequence for " i.e. \" inside a String. An error case is shown below:
The string literal highlighting doesn't stop at the end of the literal, but continues until it finds another cue, like a keyword or another literal.
So, the question is how do I disguise/hide/remove this \" from within a String?
The stringFilter method of my program is:
public String stringFilter(String line) {
if (line == null || line.equals("")) {
return "";
}
StringBuffer buf = new StringBuffer();
if (line.indexOf("\"") <= -1) {
return keywordFilter(line);
}
int start = 0;
int startStringIndex = -1;
int endStringIndex = -1;
int tempIndex;
//Keep moving through String characters until we want to stop...
while ((tempIndex = line.indexOf("\"")) > -1 && !isInsideString(line, tempIndex)) {
//We found the beginning of a string
if (startStringIndex == -1) {
startStringIndex = 0;
buf.append( stringFilter(line.substring(start,tempIndex)) );
buf.append("</font>");
buf.append(literal).append("\"");
line = line.substring(tempIndex+1);
}
//Must be at the end
else {
startStringIndex = -1;
endStringIndex = tempIndex;
buf.append(line.substring(0,endStringIndex+1));
buf.append("</font>");
buf.append(normal);
line = line.substring(endStringIndex+1);
}
}
buf.append( keywordFilter(line) );
return buf.toString();
}
EDIT
in response to the first few comments and answers, here's what I tried:
A snippet from htmlFilter(String), but it doesn't work :(
//replace '&' i.e. ampersands with HTML escape sequence for ampersand.
line = line.replaceAll("&", "&");
//line = line.replaceAll(" ", " ");
line = line.replaceAll("" + (char)35, "#");
// replace less-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll("<", "<");
// replace greater-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll(">", ">");
line = multiLineCommentFilter(line);
//replace the '\\' i.e. escape for backslash with HTML escape sequences.
//fixes a problem when backslashes preceed quotes.
//line = line.replaceAll("\\\"", "\"");
//line = line.replaceAll("" + (char)92 + (char)92, "\\");
return line;

My idea is that when a backslash is met, ignore the next character.
String str = "blah\"blah\\blah\n";
int index = 0;
while (true) {
// find the beginning
while (index < str.length() && str.charAt(index) != '\"')
index++;
int beginIndex = index;
if (index == str.length()) // no string found
break;
index++;
// find the ending
while (index < str.length()) {
if (str.charAt(index) == '\\') {
// escape, ignore the next character
index += 2;
} else if (str.charAt(index) == '\"') {
// end of string found
System.out.println(beginIndex + " " + index);
break;
} else {
// plain content
index++;
}
}
if (index >= str.length())
throw new IllegalArgumentException(
"String literal is not properly closed by a double-quote");
index++;
}

Check for char found at tempIndex-1 it it is \ then don't consider as beginning or ending of string.
String originalLine=line;
if ((tempIndex = originalLine.indexOf("\"", tempIndex + 1)) > -1) {
if (tempIndex==0 || originalLine.charAt(tempIndex - 1) != '\\') {
...

Steps to follow:
First replace all \" with some temp string such as
String tempStr="forward_slash_followed_by_double_quote";
line = line.replaceAll("\\\\\"", tempStr);
//line = line.replaceAll("\\\"", tempStr);
do what ever you are doing
Finally replace that temp string with \"
line = line.replaceAll(tempStr, "\\\\\"");
//line = line.replaceAll(tempStr, "\\\"");

The trouble with finding a quote and then trying to work out whether it's escaped is that it's not enough to simply look at the previous character to see if it's a backslash - consider
String basedir = "C:\\Users\\";
where the \" isn't an escaped quote, but is actually an escaped backslash followed by an unescaped quote. In general a quote preceded by an odd number of backslashes is escaped, one preceded by an even number of backslashes isn't.
A more sensible approach would be to parse through the string one character at a time from left to right rather than trying to jump ahead to quote characters. If you don't want to have to learn a proper parser generator like JavaCC or antlr then you can tackle this case with regular expressions using the \G anchor (to force each subsequent match to start at the end of the previous one with no gaps) - if we assume that str is a substring of your input starting with the character following the opening quote of a string literal then
Pattern p = Pattern.compile("\\G(?:\\\\u[0-9A-Fa-f]{4}|\\\\.|[^\"\\\\])");
StringBuilder buf = new StringBuilder();
Matcher m = p.matcher(str);
while(m.find()) buf.append(m.group());
will leave buf containing the content of the string literal up to but not including the closing quote, and will handle escapes like \", \\ and unicode escapes \uNNNN.

Use double slash "\\"" instead of "\""... Maybe it works...

Equivalent to StringTokenizer with multiple characters delimiters

I try to split a String into tokens.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
Any idea ?
Example:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.

You can use the Pattern and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
This produces the result below:
's'
'='
'a'
'&'
'=>'
'b'

Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like
http://www.antlr.org/

Try this:
String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";
String[] res = test.split(regEx);
for(String s : res){
System.out.println("Token: "+s);
}
I added the '=A' at the end to show that that is also parsed.
As mentioned in another answer, if you need the atypical behaviour of keeping the delimiters in the result, you will probably need to create you parser yourself....but in that case you really have to think about what a "delimiter" is in your code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: CSV parser - skipping quotes - java

Related

Regex to Extract Arabic text Using Java

How to merge many List<String> elements in one based on double quote delimiter in java

StringIndexOutOfBoundsException when using delimiter

How to disguise escape character - \" within a string

Equivalent to StringTokenizer with multiple characters delimiters

Categories

Resources