Need nested parenthesis for this regex - java

I have got dataset like this with three columns
Col1, Col2, Col2
aaa,Arizona DL USTATES,12
bbb,Idaho DL USTATES,35
ccc,Idaho DL USTATES,28
ddd,Wisconsin DL USTATES,11
eeee,Wisconsin DL USTATES,35
What I want to do is that I want to extract the first word of the second column(what is a state name) and put it in the first column.
Expected Output:
Arizona,Arizona randam USTATES,12
Idaho,Idaho randam USTATES,35
Idaho,Idaho randam USTATES,28
Wisconsin,Wisconsin random USTATES,11
The regex that I have is
^[^,]+,([^ ]+) [^\n]+$
With my () I can extract the state name, but How do get the output? What I want is nested parenthesis, something like this
^[^,]+,(([^ ]+) [^\n]+)$
and then the output will be
\1,\2
I should point out that I want to do it using regex replace only.
Edit:
I have solved it by using regex to get all of the state names in a column and then merged it, but I want to know if there are any advanced regex which can be used here.

String s = "aaa,Arizona DL USTATES,12";
String st = s.split(",")[1].split(" ")[0];
s = s.replaceFirst("\\w+\\,", st + ",");

Your regex with nested parentheses works fine; you just need to use String's replaceFirst method and note that Java uses $ for group references. Also note that the groups are enumerated in the order they occur in the regex, so the outer group is $1 because it starts first:
String line = "aaa,Arizona DL USTATES,12";
String result = line.replaceFirst("^[^,]+,(([^ ]+) [^\n]+)$", "$2, $1");
// result is "Arizona, Arizona DL USTATES,12"

Related

Regex: Remove postfix string in any word after occurance of any of list of strings in a paragraph

I have a bigger string and a list of strings. I want to change the bigger string such that
- For any occurrence of a string in list in a bigger string, remove the suffix part till next space.
Bigger String
WITH dataTab0 AS (SELECT TO_CHAR(to_date(tab_0_0.times),'YYYYMMDD') AS TIME_ID_CATEGORYe93bc60a0041,tab_0_0.request_id AS PAGE_IMPRESSIONf6beefc4b44e4b FROM full_contents_2
List
TIME_ID_CATEGORY
PAGE_IMPRESSION
...
I need to remove suffix like e93bc60a0041 and f6beefc4b44e4b which is coming after TIME_ID_CATEGORY and PAGE_IMPRESSION
I expect following result. I need regex based/effective solution in java to achieve the same.
WITH dataTab0 AS (SELECT TO_CHAR(to_date(tab_0_0.times),'YYYYMMDD') AS TIME_ID_CATEGORY,tab_0_0.request_id AS PAGE_IMPRESSION FROM full_contents_2
How about something like this? Essentially matching TIME_ID_CATEGORY or PAGE_IMPRESSION into Group 1, and anything that follows (i.e. suffix) as Group 2.
(TIME_ID_CATEGORY|PAGE_IMPRESSION)(\w+)
Regex Demo
And then simply replace contents of Group 2 with empty string. Or just replace with Group 1, this will also get rid of the suffix (see below code snippet).
Example code snippet:
public static void main(String args[]) throws Exception {
String line = "WITH dataTab0 AS (SELECT TO_CHAR(to_date(tab_0_0.times),'YYYYMMDD') AS TIME_ID_CATEGORYe93bc60a0041,tab_0_0.request_id AS PAGE_IMPRESSIONf6beefc154b44e4b FROM full_contents_2";
Pattern p = Pattern.compile("(TIME_ID_CATEGORY|PAGE_IMPRESSION)(\\w+)");
Matcher m = p.matcher(line);
if (m.find()) {
String output = m.replaceAll("$1");
System.out.println(output);
//WITH dataTab0 AS (SELECT TO_CHAR(to_date(tab_0_0.times),'YYYYMMDD') AS TIME_ID_CATEGORY,tab_0_0.request_id AS PAGE_IMPRESSION FROM full_contents_2
}
}
My guess is that maybe a simple expression,
[a-f0-9]{14}
replaced with an empty string might actually work here, if we only have those 14-length substrings.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

IllegalArgumentException: Illegal group reference while replaceFirst

I'm trying to replace first occurence of String matching my regex, while iterating those occurences like this:
(this code is very simplified, so don't try to find some bigger sense of it)
Matcher tagsMatcher = Pattern.compile("\\{[sdf]\\}").matcher(value);
int i = 0;
while (tagsMatcher.find()) {
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "$s");
i++;
}
I'm getting IllegalArgumentException: Illegal group reference while executing replaceFirst. Why?
replacement part in replaceFirst(regex,replacement) can contain references to groups matched by regex. To do this it is using
$x syntax where x is integer representing group number,
${name} where name is name of named group (?<name>...)
Because of this ability $ is treated as special character in replacement, so if you want to make $ literal you need to
escape it with \ like replaceFirst(regex,"\\$whatever")
or let Matcher escape it for you using Matcher.quote method replaceFirst(regex,Matcher.quote("$whatever"))
BUT you shouldn't be using
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "\\$s");
inside loop because each time you do, you need to traverse entire string to find part you want to replace, so each time you need to start from beginning which is very inefficient.
Regex engine have solution for this inefficiency in form of matcher.appendReplacement(StringBuffer, replacement) and matcher.appendTail(StringBuffer).
appendReplacement method is adding to StringBuffer all data until current match, and lets you specify what should be put in place of matched by regex part
appendTail adds part which exists after last matched part
So your code should look more like
StringBuffer sb = new StringBuffer();
int i = 0;
Matcher tagsMatcher = Pattern.compile("\\{[sdf]\\}").matcher(value);
while (tagsMatcher.find()) {
tagsMatcher.appendReplacement(sb, Matcher.quoteReplacement("%" + (i++) + "$s"));
}
value = sb.toString();
You need to escape the dollar symbol.
value = value.replaceFirst("\\{[sdf]\\}", "%" + i + "\\$s");
Illegal group reference error occurs mainly because of trying to refer a group which really won't exists.
Special character $ can be handled is simple way. Check below example
public static void main(String args[]){
String test ="Other company in $ city ";
String test2 ="This is test company ";
try{
test2= test2.replaceFirst(java.util.regex.Pattern.quote("test"), Matcher.quoteReplacement(test));
System.out.println(test2);
test2= test2.replaceAll(java.util.regex.Pattern.quote("test"), Matcher.quoteReplacement(test));
System.out.println(test2);
}catch(Exception e){
e.printStackTrace();
}
}
Output:
This is Other company in $ city company
This is Other company in $ city company
I solved it by using apache commons, org.apache.commons.lang3.StringUtils.replaceOnce. This is regex safe.

Matching everything after the first comma in a string

I am using java to do a regular expression match. I am using rubular to verify the match and ideone to test my code.
I got a regex from this SO solution , and it matches the group as I want it to in rubular, but my implementation in java is not matching. When it prints 'value', it is printing the value of commaSeparatedString and not matcher.group(1) I want the captured group/output of println to be "v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso"
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
//match everything after first comma
String myRegex = ",(.*)";
Pattern pattern = Pattern.compile(myRegex);
Matcher matcher = pattern.matcher(commaSeparatedString);
String value = "";
if (matcher.matches())
value = matcher.group(1);
else
value = commaSeparatedString;
System.out.println(value);
(edit: I left out that commaSeparatedString will not always contain 2 commas. Rather, it will always contain 0 or more commas)
If you don't have to solve it with regex, you can try this:
int size = commaSeparatedString.length();
value = commaSeparatedString.substring(commaSeparatedString.indexOf(",")+1,size);
Namely, the code above returns the substring which starts from the first comma's index.
EDIT:
Sorry, I've omitted the simpler version. Thanks to one of the commentators, you can use this single line as well:
value = commaSeparatedString.substring( commaSeparatedString.indexOf(",") );
The definition of the regex is wrong. It should be:
String myRegex = "[^,]*,(.*)";
You are yet another victim of Java's misguided regex method naming.
.matches() automatically anchors the regex at the beginning and end (which is in total contradiction with the very definition of "regex matching"). The method you are looking for is .find().
However, for such a simple problem, it is better to go with #DelShekasteh's solution.
I would do this like
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
System.out.println(commaSeparatedString.substring(commaSeparatedString.indexOf(",")+1));
Here is another approach with limited split
String[] spl = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso".split(",", 2);
if (spl.length == 2)
System.out.println(spl[1]);
Byt IMHO Del's answer is best for your case.
I would use replaceFirst
String commaSeparatedString = "Vtest7,v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso";
System.out.println(commaSeparatedString.replaceFirst(".*?,", ""));
prints
v123_gpbpvl-testpv1,v223_gpbpvl-testpv1-iso
or you could use the shorter but obtuse
System.out.println(commaSeparatedString.split(",", 2)[1]);

How to count a position of element, relative to another element using regex?

Given String
// 1 2 3
String a = "letters.1223434.more_letters";
I'd like to recognize that numbers come in a 2nd position after the first dot
I then would like to use this knowledge to replace "2nd position of"
// 1 2 3
String b = "someWords.otherwords.morewords";
with "hello" to effectively make
// 1 2 3
String b = "someWords.hello.morewords";
Substitution would have to be done based on the original position of matched element in String a
How can this be done using regex please?
For finding those numbers you can use group mechanism (round brackets in regular expresions):
import java.util.regex.*;
...
String data = "letters.1223434.more_letters";
String pattern="(.+?)\\.(.+?)\\.(.+)";
Matcher m = Pattern.compile(pattern).matcher(data);
if (m.find()) //or while if needed
for (int i = 1; i <= m.groupCount(); i++)
//group 0 == whole String, so I ignore it and start from i=1
System.out.println(i+") [" + m.group(i) + "] start="+m.start(i));
// OUT:
//1) [letters] start=0
//2) [1223434] start=8
//3) [more_letters] start=16
BUT if your goal is just replacing text between two dots try maybe replaceFirst(String regex, String replacement) method on String object:
//find ALL characters between 2 dots once and replace them
String a = "letters.1223434abc.more_letters";
a=a.replaceFirst("\\.(.+)\\.", ".hello.");
System.out.println(a);// OUT => letters.hello.more_letters
regex tells to search all characters between two dots (including these dots), so replacement should be ".hello." (with dots).
If your String will have more dots it will replace ALL characters between first and last dot. If you want regex to search for minimum number of characters necessary to satisfy the pattern you need to use Reluctant Quantifier ->? like:
String b = "letters.1223434abc.more_letters.another.dots";
b=b.replaceFirst("\\.(.+?)\\.", ".hello.");//there is "+?" instead of "+"
System.out.println(b);// OUT => letters.hello.more_letters.another.dots
What you want to do is not directly possible in RegExp, because you cannot get access to the number of the capture group and use this in the replacement operation.
Two alternatives:
If you can use any programming language: Split a using regexp into groups. Check each group if it matches your numeric identifier condition. Split the b string into groups. Replace the corresponding match.
If you only want to use a number of regexp, then you can concatenate a and b using a unique separator (let's say |). Then match .*?\.\d+?\..*?|.*?\.(.*?)\..*? and replace $1. You need to apply this regexp in the three variations first position, second position, third position.
the regex for string a would be
\w+\.(\d+)\.\w+
using the match group to grab the number.
the regex for the second would be
\w+\.(\w+)\.\w+
to grab the match group for the second string.
Then use code like this to do what you please with the matches.
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.find();
where patternStr is the pattern I mentioned above and inputStr is the input string.
You can use variations of this to try each combination you want. So you can move the match group to the first position, try that. If it returns a match, then do the replacement in the second string at the first position. If not, go to position 2 and so on...

extracting specific but unknown values from a string in Java

I am trying to extract values from a MySQL insert command in Java. The insert command is just a string as far as Java is concerned. it will be of the format
INSERT INTO employees VALUES ("John Doe", "45", "engineer");
I need to pull the '45' out of that statement. I can't pinpoint its index because names and job titles will be different. I only need the age. Other than overly complex string manipulation which I could probably figure out in time, is there a more straight forward way of isolating those characters? I just cant seem to wrap my mind around how to do it and I am not very familiar with regular expressions.
If this is the specific format of your message, then a regex like that should help:
INSERT INTO employees VALUES (".*?", "(.*?)", ".*?");
The read the first group of the result and you should get the age.
In regular expressions (X) defines a matching group that captures X (where X can be any regular expression). This means that if the entire regular expression matches, then you can easily find out the value within this matching group (using Matcher.group() in Java).
You can also have multiple matching groups in a single regex like this:
INSERT INTO employees VALUES ("(.*?)", "(.*?)", "(.*?)");
So your code could look like this:
String sql = "INSERT INTO employees VALUES (\"John Doe\", \"45\", \"engineer\");";
final Pattern patter = Pattern.compile("INSERT INTO employees VALUES (\"(.*?)\", \"(.*?)\", \"(.*?)\");");
final Matcher matcher = pattern.matcher(sql);
if (matcher.matches()) {
String name = matcher.group(1);
String age = matcher.group(2);
String job = matcher.group(3);
// do stuff ...
}
assuming that name doesn't contain any " you can use regex .*?".*?".*?"(\d+)".* and group(1) gives you the age.
As far as I understand your insert command will insert into 3 columns only. What you can probably do split the string on the character comma (,) and then get the second element of the array, trim left and right white spaces and then extract the elements of it except the first and last character. That should fetch you the age. Writing a psuedocode for it:
String insertQuery="INSERT INTO employees VALUES (\"John Doe\", \"45\", \"engineer\")";
String splitQuery=insertQuery.split(",");
String age=splitQuery[1];
age=age.trim();
age=age.substring(1, age.length-2);
If you are sure that there is only one instance of a number in the string, the regular expression you need is very simple:
//Assuming that str contains your insert statement
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher(str);
if(m.find()) System.out.println(m.group());
How about String.split by ","?
final String insert = "INSERT INTO employees VALUES (\"John Doe\", \"45\", \"engineer\"); ";
System.out.println(insert.split(",")[1].trim());

Categories

Resources