Java : Splitting a String using Regex - java

I have to split a string using comma(,) as a separator and ignore any comma that is inside quotes(")
fieldSeparator : ,
fieldGrouper : "
The string to split is : "1","2",3,"4,5"
I am able to achieve it as follows :
String record = "\"1\",\"2\",3,\"4,5\"";
String[] tokens = record.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
Output :
"1"
"2"
3
"4,5"
Now the challenge is that the fieldGrouper(") should not be a part of the split tokens. I am unable to figure out the regex for this.
The expected output of the split is :
1
2
3
4,5

Update:
String[] tokens = record.split( "(,*\",*\"*)" );
Result:
Initial Solution:
( doesn't work # .split method )
This RexEx pattern will isolate the sections you want:
(?:\\")(.*?)(?:\\")
It uses non-capturing groups to isolate the pairs of escaped quotes,
and a capturing group to isolate everything in between.
Check it out here:
Live Demo

My suggestion:
"([^"]+)"|(?<=,|^)([^,]*)
See the regex demo. It will match "..." like strings and capture into Group 1 only what is in-between the quotes, and then will match and capture into Group 2 sequences of characters other than , at the start of a string or after a comma.
Here is a Java sample code:
String s = "value1,\"1\",\"2\",3,\"4,5\",value2";
Pattern pattern = Pattern.compile("\"([^\"]+)\"|(?<=,|^)([^,]*)");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<String>();
while (matcher.find()){ // Run the matcher
if (matcher.group(1) != null) { // If Group 1 matched
res.add(matcher.group(1)); // Add it to the resulting array
} else {
res.add(matcher.group(2)); // Add Group 2 as it got matched
}
}
System.out.println(res); // => [value1, 1, 2, 3, 4,5, value2]

I would try with this kind of workaround:
String record = "\"1\",\"2\",3,\"4,5\"";
record = record.replaceAll("\"?(?<!\"\\w{1,9999}),\"?|\""," ");
String[] tokens = record.trim().split(" ");
for(String str : tokens){
System.out.println(str);
}
Output:
1
2
3
4,5

My proposition:
record = record.replaceAll("\",", "|");
record = record.replaceAll(",\\\"", "|");
record = record.replaceAll("\"", "");
String[] tokens = record.split("\\|");
for (String token : tokens) {
System.out.println(token);
}

Related

Can't split a line in Java

I am facing a problem that I don't know correctly split this line. I only need RandomAdresas0 100 2018 1.
String line = Files.readAllLines(Paths.get(failas2)).get(userInp);
System.out.println(line);
arr = line.split("[\\s\\-\\.\\'\\?\\,\\_\\#]+");;
Content in line:
[Pastatas{pastatoAdresas='RandomAdresas0',pastatoAukstuSkaicius=100,pastatoPastatymoData=2018, pastatoButuKiekis=1}]
You can try this code (basically extracting a string between two delimiters):
String ss = "[Pastatas{pastatoAdresas='RandomAdresas0',pastatoAukstuSkaicius=100,pastatoPastatymoData=2018, pastatoButuKiekis=1}]";
Pattern pattern = Pattern.compile("=(.*?)[,}]");
Matcher matcher = pattern.matcher(ss);
while (matcher.find()) {
System.out.println(matcher.group(1).replace("'", ""));
}
This output:
RandomAdresas0
100
2018
Remove all the characters before '{' including '{'
Remove all the characters after '}' including '}'
You can do the both by using indexOf method and substring.
Now you will left with only the following:
pastatoAdresas='RandomAdresas0',pastatoAukstuSkaicius=100,pastatoPastatymoData=2018, pastatoButuKiekis=1
After this read this [thread][1] : Parse a string with key=value pair in a map?
Here is a solution using a regular expression and the Pattern & Matcher classes. The values you are after can be retrieved using the group() method and you get all values by looping as long as find() returns true.
String data = "[Pastatas{pastatoAdresas='RandomAdresas0',pastatoAukstuSkaicius=100,pastatoPastatymoData=2018, pastatoButuKiekis=1}]";
Pattern pattern = Pattern.compile("=([^, }]*)");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
System.out.printf("[%d:%d] %s", matcher.start(), matcher.end(), matcher.group(1));
}
The matched value is in group 1, group 0 matches the whole reg ex

Parsing comma separated string with prefix

I am getting comma sepeated string in below format:
String codeList1 = "abc,pqr,100101,P101001,R108972";
or
String codeList2 = "mno, 100101,108972";
Expected Result : Check if code is numeric after removing first alphabet. If yes, remove prefix and return. If no, still return the code.
codeList1 = "abc,pqr,100101,101001,108972";
or
codeList2 = "mno, 100101,108972";
As you can see, I can get codes (P101001 or 101001) and (R108972 ,108972) format. There is will be only one prefix only.
If I am getting(P101001), I want to remove 'P' prefix and return number 101001.
If I am getting 101001, do nothing.
Below is the working code. But is there any easier or more efficient way of achieving this. Please help
for (String code : codeList.split(",")) {
if(StringUtils.isNumeric(code)) {
codes.add(code);
} else if(StringUtils.isNumeric(code.substring(1))) {
codes.add(Integer.toString(Integer.parseInt(code.substring(1))));
} else {
codes.add(code);
}
}
If you want to remove prefixes from the numbers you can easilly use :
String[] codes = {"abc,pqr,100101,P101001,R108972", "mno, 100101,108972"};
for (String code : codes){
System.out.println(
code.replaceAll("\\b[A-Z](\\d+)\\b", "$1")
);
}
Outputs
abc,pqr,100101,101001,108972
mno, 100101,108972
If you are using Java 8+, and want to extract only the numbers, you can just use :
String codeList1 = "abc,pqr,100101,P101001,R108972";
List<Integer> results = Arrays.stream(codeList1.split("\\D")) //split with non degits
.filter(c -> !c.isEmpty()) //get only non empty results
.map(Integer::valueOf) //convert string to Integer
.collect(Collectors.toList()); //collect to results to list
Outputs
100101
101001
108972
You can use regex to do it
String str = "abc,pqr,100101,P101001,R108972";
String regex = ",?[a-zA-Z]{0,}(\\d+)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output
100101
101001
108972
Updated:
For your comment(I want to add add the codes. If single alphabet prefix found , remove it and add remaining ),you can use below code:
String str = "abc,pqr,100101,P101001,R108972";
String regex = "(?=,?)[a-zA-Z]{0,}(?=\\d+)|\\s";// \\s is used to remove space
String[] strs = str.replaceAll(regex,"").split(",");
Output:
abc
pqr
100101
101001
108972
How about this:
String codeList1 = "abc,pqr,100101,P101001,R108972";
String[] codes = codeList1.split(",");
for (String code : codes) {
if (code.matches("[A-Z]?\\d{6}")) {
String codeF = code.replaceAll("[A-Z]+", "");
System.out.println(codeF);
}
}
100101
101001
108972
Demo

java/scala: Regex for skipping odd number of backslash while splitting a String?

Here is my requirement:
Input1: adasd|adsasd\|adsadsadad|asdsad
output1: Array(adasd,adsasd\|adsadsadad,asdsad)
Input2: adasd|adsasd\\|adsadsadad|asdsad
output2: Array(adasd,adsasd\\,adsadsadad,asdsad)
Input3: adasd|adsasd\\\|adsadsadad|asdsad
output3: Array(adasd,adsasd\\\|adsadsadad,asdsad)
I was using this code:
val delimiter =Pattern.quote("|")
val esc = "\\"
val regex = "(?<!" + Pattern.quote(esc) + ")" + delimiter
But this is not working fine with all the cases.
What will be the best solution to deal with this?
Instead of splitting, use this regex for a match:
(?<=[|]|^)[^|\\]*(?:\\.[^|\\]*)*
Java Code Demo
Java code:
final String[] input = {"adasd|adsasd\\|adsadsadad|asdsad",
"adasd|adsasd\\\\|adsadsadad|asdsad",
"adasd|adsasd\\\\\\|adsadsadad|asdsad"};
final String regex = "(?<=[|]|^)[^|\\\\]*(?:\\\\.[^|\\\\]*)*";
final Pattern pattern = Pattern.compile(regex);
Matcher matcher;
for (String string: input) {
matcher = pattern.matcher(string);
System.out.println("\n*** Input: " + string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
}
Output:
*** Input: adasd|adsasd\|adsadsadad|asdsad
adasd
adsasd\|adsadsadad
asdsad
*** Input: adasd|adsasd\\|adsadsadad|asdsad
adasd
adsasd\\
adsadsadad
asdsad
*** Input: adasd|adsasd\\\|adsadsadad|asdsad
adasd
adsasd\\\|adsadsadad
asdsad
For the sake of simplicity, let's take ";"(semicolon) instead of "\"(backslash) to avoid too many escape sequences here.
We can do this split with a look-behind as below:
String[] input = { "adasd|zook;|adsadsadad|asdsad", "adasd|zook;;|adsadsadad|asdsad",
"adasd|zook;;;|adsadsadad|asdsad", "blah;|blah;;;;|blah|blahblah;|blahbloooh;;|" };
String regex = "(?<!;)(;;)+\\||(?<!;)\\|";
for(String str : input) {
System.out.println("Input : "+ str);
System.out.println("Output: ");
String[] astr = str.split(regex);
for(String nres : astr)
System.out.print(nres+", ");
System.out.println("\n");
}
Let's have a deeper look at the regex. I will split this into 2 parts:
Split on even occurrence of semicolon(;) followed by a pipe("|"):
(?<!;)(;;)+\\| :
Here we make sure we match just even occurrence with (;;)+ and a look-behind to make sure we are not matching any unintended ";" before the set of even occurrences.
Split on pipe without a preceding semicolon:
(?<!;)\\| :
Here we will just match lone pipe symbols and use look-behind to make sure no ";" before the "|"
Output for the above snippet
Hope this helps! :)

Extract values from string using regex groups

I have to extract values from string using regex groups.
Inputs are like this,
-> 1
-> 5.2
-> 1(2)
-> 3(*)
-> 2(3).2
-> 1(*).5
Now I write following code for getting values from these inputs.
String stringToSearch = "2(3).2";
Pattern p = Pattern.compile("(\\d+)(\\.|\\()(\\d+|\\*)\\)(\\.)(\\d+)");
Matcher m = p.matcher(stringToSearch);
System.out.println("1: "+m.group(1)); // O/P: 2
System.out.println("3: "+m.group(3)); // O/P: 3
System.out.println("3: "+m.group(5)); // O/P: 2
But, my problem is only first group is compulsory and others are optional.
Thats why I need regex like, It will check all patterns and extract values.
Use non-capturing groups and turn them to optional by adding ? quantifier next to those groups.
^(\d+)(?:\((\d+|\*)\))?(?:\.(\d+))?$
DEMO
Java regex would be,
"(?m)^(\\d+)(?:\\((\d\+|\\*)\\))?(?:\\.(\\d+))?$"
Example:
String input = "1\n" +
"5.2\n" +
"1(2)\n" +
"3(*)\n" +
"2(3).2\n" +
"1(*).5";
Matcher m = Pattern.compile("(?m)^(\\d+)(?:\\((\\d+|\\*)\\))?(?:\\.(\\d+))?$").matcher(input);
while(m.find())
{
if (m.group(1) != null)
System.out.println(m.group(1));
if (m.group(2) != null)
System.out.println(m.group(2));
if (m.group(3) != null)
System.out.println(m.group(3));
}
Here is an alternate approach that is simpler to understand.
First replace all non-digit, non-* characters by a colon
Split by :
Code:
String repl = input.replaceAll("[^\\d*]+", ":");
String[] tok = repl.split(":");
RegEx Demo

Pattern matching for character and end of line

I have a string which is in following format:
I am extracting this Hello:A;B;C, also Hello:D;E;F
How do I extract the strings A;B;C and D;E;F?
I have written below code snippet to extract but not able to extract the last matching character D;E;F
Pattern pattern = Pattern.compile("(?<=Hello:).*?(?=,)");
The $ means end-of-line.
Thus this should work:
Pattern pattern = Pattern.compile("(?<=Hello:).*?(?=,|$)");
So you look-ahead for a comma or the end-of-line.
Test.
Try this:
String test = "I am extracting this Hello:Word;AnotherWord;YetAnotherWord, also Hello:D;E;F";
// any word optionally followed by ";" three times, the whole thing followed by either two non-word characters or EOL
Pattern pattern = Pattern.compile("(\\w+;?){3}(?=\\W{2,}|$)");
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Word;AnotherWord;YetAnotherWord
D;E;F
Assuming you mean omitting certain patterns in a string:
String s = "I am extracting this Hello:A;B;C, also Hello:D;E;F" ;
ArrayList<String> tokens = new ArrayList<String>();
tokens.add( "A;B;C" );
tokens.add( "D;E;F" );
for( String tok : tokens )
{
if( s.contains( tok ) )
{
s = s.replace( tok, "");
}
}
System.out.println( s );

Categories

Resources