Parsing comma-separated values containing quoted commas and newlines - java

I have string with some special characters.
The aim is to retrieve String[] of each line (, separated)
You have special character “ where you can have /n and ,
For example Main String
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL
Titi",God," timmy, tomy,tony,
tini".
You can see that there are you /n in "".
Can any Help me to Parse this.
Thanks
__ More Explanation
with the Main Sting I need to separate these
Here Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie,KameL,Titi
God
timmy, tomy,tony,tini
Problem is : for Julie,KameL,Titi there is line break /n or in between KameL and Titi
similar problem for timmy, tomy,tony,tini there is line break /n or in between tony and tini.
new this text is in file (compulsory line by line reading)
Alpha,Beta Charli,Delta,Delta Echo ,Frank George,Henry
1234-5,"Ida, John
", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln
",Mary / New York,123456
12543-01,"Ocean, Peter
output i want to remove this "
Alpha
Beta Charli
Delta
Delta Echo
Frank George
Henry
1234-5
Ida
John
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Mary / New York
123456
12543-01
Ocean
Peter

Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.

Description
Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.
(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)
Example
$Matches = #()
$String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"'
$Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'
Write-Host start with
write-host $String
Write-Host
Write-Host found
([regex]"(?i)(?m)$Regex").matches($String) | foreach {
write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
} # next match
Yields
start with
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"
found
key at 0 = '' = value at 0 = 'Alpha'
key at 6 = '' = value at 6 = 'Beta'
key at 11 = '' = value at 11 = 'Gama'
key at 16 = '"' = value at 17 = '23-5-2013,TOM'
key at 32 = '' = value at 32 = 'TOTO'
key at 37 = '"' = value at 38 = 'Julie, KameL\n
Titi'
key at 60 = '' = value at 60 = 'God'
key at 64 = '"' = value at 65 = 'timmy, \n
tomy,tony,tini'
Summary
(?: start non capture group
^ require start of string
| or
,\s{0,} a comma followed by any number of white space
) close the non capture group
( start capture group 1
["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote
) close capture group 1
\s{0,} consume any spaces if they exist, this means you don't need to trim the value later
( start capture group 2
(?:.|\n|\r)*? capture all characters including a new line, non greedy
) close capture group 2
\1 if there was a quote it would be stored in group 1, so if there was one then require it here
(?= start zero assertion look ahead
[,]\s{0,} must have a comma followed by optional whitespace
| or
$ end of the string
) close the zero assertion look ahead

Try this:
String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
+ "Titi\",God,\" timmy, tomy,tony,\n"
+ "tini\".";
Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
Matcher m = p.matcher(source);
while(m.find())
{
if(m.group(2) != null)
System.out.println( m.group(2).replace("\n", "") );
else if(m.group(3) != null)
System.out.println( m.group(3).replace("\n", "") );
}
If it matches a string without quotes, the result is returned in group 2.
Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block.
You might find a prettier way.
Output:
Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie, KameLTiti
God
timmy, tomy,tony,tini
.

See this related answer for a decent Java-compatible regex for parsing CSV.
It recognizes:
Newlines (after values or inside quoted values)
Quoted values containing escaped double-quotes like ""this""
In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
Then collect each Matcher group(1) in a find() loop.
Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.

Related

Regex expression is not working correctly

I am trying to find in a string in which numbers are formatted as "4.97", but if they are smaller than 1, they are in the format .97, .80 etc. I want to find these kind of substrings in the String and replace them so that they would start with a 0.
It's working for the string
String str = "Rate is : .97";
Result : "Rate is : 0.97"
But not for the string:
String str = "Rate is : .97 . XXXXXXXXX do you want . to perform another calculation . ";
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(.*\\D)(.\\d\\d.*)";
System.out.println(str.matches("(.*\\D)(.\\d\\d.*)"));
str = str.replaceAll(pattern, "$10$2");
Why is this happening?
In your second example, the .* after the last \\d will match any character except a newline which will match the rest of the string.
You might do the replacement without a capturing group using a negative lookbehind (?<!\S) to check if what is on the left is not a non whitespace char.
(?<!\S)\.[0-9]
In the replacement use a zero followed by the full match.
Regex demo | Java demo
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(?<!\\S)\\.[0-9]";
System.out.println(str.replaceAll(pattern, "0$0"));
Output
Rate is : 0.97 . XXXXXXXXX do you want . 87 to perform another calculation .
If there should be a non digit before, you could make use of a positive lookbehind
(?<=\D)\.[0-9]
Regex demo
In Java
String regex = "(?<=\\D)\\.[0-9]";
It looks like you need to add some lazy matching to your regex.
? means it will attempt to match as few times as possible, in this case it's to only pick up the first number and not go onto the second.
^(.*?\D)(.\d\d.*?)
You can see this regex work here, with a more complete explanation.
I have also added the ^ start of string matcher so to make sure only one match it created and not repeated onto the second.
First of all, your regex pattern seems to be wrong. I think you can just use:
(\D)(\.\d+)
Find a character that is not a digit, followed by a dot and at least one digit.
Second, for replacing, you could use more low-level features, such as:
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
final Pattern regex = Pattern.compile("(\\D)(\\.\\d+)");
final Matcher m = regex.matcher(str);
if (m.find()) {
str = m.replaceFirst(m.group(1) + "0" + m.group(2));
}
System.out.println(str);
But of course, this works too:
str = str.replaceAll("(\\D)(\\.\\d+)", "$10$2");
You can do a positive lookahead so that way you also catch whitespaces between . and the number.
(.(?=.\d)|(\d+))+
would give you
Then in your code you can do whatever operation on group 1(blue) and group 2(red) as you wish.

Matching a whitespace or emptry string using regex in Java

I have this regex in java
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
It works as intended but I have a new problem to get some valid dates:
1st problem:
If I have this String It was at 22-febrero-1999 and 10-enero-2009 and 01-diciembre-2000 I should get another string as febrero-enero-diciembre and I only get febrero-enero
2nd problem
If I have a single date in a String like 12-octubre-1989 I get an emptry String.
Why I have in my pattern to have whitespaces in the start and end of any date? because I have to catch only valid months in a String like adsadasd 12-validMonth-2999 asd 11-validMonth-1989 I should get both validMonth, then never get a validMonth in a String like asdadsad12-validMonth-1989 asdadsad 23-validMonth-1989 in the last one I only should get the last validMonth
PD: My java code is
String resultado = "";
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(fecha);
while (m.find()) {
resultado += m.group().split("-")[1] + "-";
}
return (resultado.compareTo("") == 0 ? "" : resultado.substring(0, resultado.length() - 1));
You might want to use a word boundary instead:
\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b
And I believe some of the months can be optimized a little bit (it could reduce readability unfortunately, but should speed things up by a notch):
\\b(\\d{2}-)((?:en|febr)ero|ma(?:rz|y)o|abril|ju[ln]io|agosto|(?:septiem|octu|noviem|diciem)bre)(-\\d{4})\\b
Perhaps try using a \b instead of \s:
String pattern = "\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b";
This will only match strings where the first digit is not preceded by another word character (digit, letter, or underscore), and the last digit is not followed by a word character. I've also removed the capturing groups around the \b, because it would always be a zero-length string, if matched.
I wouldn't use a word boundry as a delimeter.
I'd suggest to use either whitespace or NOT digit,
or no delimeter and put in a validation range of numbers for day/year.
This way you may catch more embeded dates that are in close
proximity (adjacent) to letters and underscore.
Something like:
# "(?<!\\d)\\d{2}-(?:enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)-\\d{4}(?!\\d)"
(?<! \d ) # Not a digit before us
\d{2} - # Two digits followed by dash
(?: # A month
enero
| febrero
| marzo
| abril
| mayo
| junio
| julio
| agosto
| septiembre
| octubre
| noviembre
| diciembre
)
- \d{4} # Dash followed by four digits
(?! \d ) # Not a digit after us

(Pattern and Matcher) not discovering all pattern matches

I have this string object which consists of tags(bounded by [$ and $]) and rest of the text. Im trying to isolate all of the tags. (Pattern-Matcher) recognize all of the tags properly, but two of them are combined into one. I dont have any idea why this is happening, probably some internal (Matcher-Pattern) bussiness.
String docBody = "This is sample text.\r\n[$ FOR i 1 10 1 $]\r\n This is" +
"[$ i $]-th time this message is generated.\r\n[$END$]\r\n" +
"[$ FOR i 0 10 2 $]\r\n sin([$= i $]^2) = [$= i i * #sin \"0.000\"" +
" #decfmt $]" +
"\r\n[$END$] ";
Pattern p = Pattern.compile("(\\[\\$)(.)+(\\$\\])");
Matcher m = p.matcher(docBody);
while(m.find()){
System.out.println(m.group());
}
output:
[$ FOR i 1 10 1 $]
[$ i $]
[$END$]
[$ FOR i 0 10 2 $]
[$= i $]^2) = [$= i i * #sin "0.000" #decfmt $]
[$END$]`
As you can see, this part [$= i $]^2) = [$= i i * #sin "0.000" #decfmt $] is not split into these two tags [$= i $] and [$= i i * #sin "0.000" #decfmt $]
Any suggestions why this is happening?
You should use reluctant quantifier - ".+?" instead of greedy - ".+" :
"(\\[\\$).+?(\\$\\])" // Note `?` after `.+`
If you use .+, it will match everything except the line-terminator till the last $. Note that a dot (.) matches everything except a newline. With reluctant quantifier, .+? matches only till the first $] it encounters.
In your given string, you got all those matches, because you had \r\n in between, where the .+ stops matching. If you remove all those newlines, then you will just get a single match from 1st [$ to the last $].
A good way is to replace the dot by a negated character class, example:
Pattern p = Pattern.compile("(\\[\\$)([^$]++)(\\$])");
(note that you don't need to escape closing square brackets)
But perhaps are you only interested by the content of the tags:
Pattern p = Pattern.compile("(?<=\\[\\$)[^$]++(?=\\$])");
In this case the content is the whole match

Latin Regex with symbols

I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! # # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?
Some of results are:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
The regex explanation is:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.
If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
Or several such sequences, hyphenated
Or a contiguous sequence of decimal digits (0-9)
The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophe ':
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.
If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.
Note that characters like +,(,),* etc. don't have any special meaning inside a character class.
What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.
For example consider this
Pattern pattern = Pattern.compile("a");
for (String s : pattern.split("sda a f g")) {
System.out.println("==>"+s);
}
Output will be
==>sd
==>
==> f g
A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.

Replacing variable numbers of items... regex?

Ok... I have an unsatisfactory solution to a problem.
The problem is I have input like so:
{sup 19}F({sup 3}He,t){sup 19}Ne(p){sup 18}F
and need output like so:
¹⁹F(³He,t)¹⁹Ne(p)¹⁸F
I use a series of replacements first to split each of the {sup xx} sections into {sup x}{sup x} and then use a regex to match each of those and replace the characters with their UTF-8 single equivalents. The "problem" is that the {sup} sections can have numbers 1, 2 or 3 digits long (maybe more, I don't know), and I want to "expand" them into separate {sup} sections with one digit each. ( I also have the same problem with {sub} for subscripts... )
My current solution looks like this (in java):
retval = retval.replaceAll("\\{sup ([1-9])([0-9])\\}", "{sup $1}{sup $2}");
retval = retval.replaceAll("\\{sup ([1-9])([0-9])([0-9])\\}", "{sup $1}{sup $2}{sup $3}");
My question: is there a way to do this in a single pass no matter how many digits ( or at least some reasonable number ) there are?
Yes, but it may be a bit of a hack, and you'll have to be careful it doesn't overmatch!
Regex:
(?:\{sup\s)?(\d)(?=\d*})}?
Replacement String:
{sup $1}
A short explanation:
(?: | start non-capturing group 1
\{ | match the character '{'
sup | match the substring: "sup"
\s | match any white space character
) | end non-capturing group 1
? | ...and repeat it once or not at all
( | start group 1
\d | match any character in the range 0..9
) | end group 1
(?= | start positive look ahead
\d | match any character in the range 0..9
* | ...and repeat it zero or more times
} | match the substring: "}"
) | stop negative look ahead
} | match the substring: "}"
? | ...and repeat it once or not at all
In plain English: it matches a single digit, only when looking ahead there's a } with optional digits in between. If possible, the substrings {sup and } are also replaced.
EDIT:
A better one is this:
(?:\{sup\s|\G)(\d)(?=\d*})}?
That way, digits like in the string "set={123}" won't be replaced. The \G in my second regex matches the spot where the previous match ended.
The easiest way to do this kind of thing is with something like PHP's preg_replace_callback or .NET's MatchEvaluator delegates. Java doesn't have anything like that built in, but it does expose the lower-level API that lets you implement it yourself. Here's one way to do it:
import java.util.regex.*;
public class Test
{
static String sepsup(String orig)
{
Pattern p = Pattern.compile("(\\{su[bp] )(\\d+)\\}");
Matcher m = p.matcher(orig);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb, "");
for (char ch : m.group(2).toCharArray())
{
sb.append(m.group(1)).append(ch).append("}");
}
}
m.appendTail(sb);
return sb.toString();
}
public static void main (String[] args)
{
String s = "{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F";
System.out.println(s);
System.out.println(sepsup(s));
}
}
result:
{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F
{sup 1}{sup 9}F({sup 3}He,t){sub 1}{sub 9}Ne(p){sup 1}{sup 8}F
If you wanted, you could go ahead and generate the superscript and subscript characters and insert those instead.
Sure, this is a standard Regular Expression construct. You can find out about all the metacharacters in the Pattern Javadoc, but for your purposes, you probably want the "+" metacharacter, or the {1,3} greedy quantifier. Details in the link.

Categories

Resources