Split complex string with Regex in JAVA

Split complex string with Regex in JAVA - java

I want to Split following strings to an array with Regex in JAVA but I don't know how to do.
string1="advmod(likes-4, also-3)" ==> advmod, likes, also
string2="nsubj(likes-4, dog24-2)" ==> bsubj, likes, dog24
string3="num(dog24-3, 8-2)" ==> num, dog24, 8
Please help me to do this work? how split the string like "num(dog24-3, 8-2)" in three tokens num, dog24 and 8 and then putting they to an string array.
Thanks a lot.

This is generic:
String string[] = {"advmod(likes-4, also-3)",// ==> advmod , likes , also
"nsubj(likes-4, dog24-2)",// ==> bsubj , likes , dog24
"num(dog24-3, 8-2)"};//==> num ,dog24 , 8
Pattern p = Pattern.compile("(\\w+)\\(([^-]+).*, ([^-]+)");
for (int i = 0; i < string.length; i++) {
Matcher m = p.matcher(string[i]);
while(m.find()) {
System.out.print(i+": ");
for(int j=1; j<= m.groupCount(); j++){
System.out.print(m.group(j));
if(j!=m.groupCount()) {
System.out.print(", ");
}
}
System.out.println("");
}
}
Hope this helps, it works for me.
This is the output:
0: advmod, likes, also
1: nsubj, likes, dog24
2: num, dog24, 8

For 3rd String
String re1="(num)"; // Word 1
String re2=".*?"; // Non-greedy match on filler
String re3="(dog24)"; // Alphanum 1
String re4=".*?"; // Non-greedy match on filler
String re5="(8)"; // Integer Number 1
Pattern p = Pattern.compile(re1+re2+re3+re4+re5,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String word1=m.group(1);
String alphanum1=m.group(2);
String int1=m.group(3);
System.out.print("("+word1.toString()+")"+"("+alphanum1.toString()+")"+"("+int1.toString()+")"+"\n");
}

You if you want to split, you could use this:
str.split("\\(|-[0-9]+(?:,\\s+|\\))");
ideone demo.

You really haven't described your grammar, but assuming that it's something like looks like a Java method or a Prolog statement, try
final static String TOKEN_CHARACTERS="[\w\d-]"
final Pattern p = Pattern.compile("^(" + TOKEN_CHARACTERS + "+)\((" + TOKEN_CHARACTERS + "+,\s*(" + TOKEN_CHARACTERS + ")\)$";
Then split on the -; I presume that it really is there for some reason, and it's not clear that it's always present (if so, you can change the pattern to hard-code the single - instead of considering it part of the token). If you allow additional space or such, adjust accordingly.

Related

How would I replace this function with a regex replace

I have a file name with this format yy_MM_someRandomString_originalFileName.
example:
02_01_fEa3129E_my Pic.png
I want replace the first 2 underscores with / so that the example becomes:
02/01/fEa3129E_my Pic.png
That can be done with replaceAll, but the problem is that files may contain underscores as well.
#Test
void test() {
final var input = "02_01_fEa3129E_my Pic.png";
final var formatted = replaceNMatches(input, "_", "/", 2);
assertEquals("02/01/fEa3129E_my Pic.png", formatted);
}
private String replaceNMatches(String input, String regex,
String replacement, int numberOfTimes) {
for (int i = 0; i < numberOfTimes; i++) {
input = input.replaceFirst(regex, replacement);
}
return input;
}
I solved this using a loop, but is there a pure regex way to do this?
EDIT: this way should be able to let me change a parameter and increase the amount of underscores from 2 to n.

You could use 2 capturing groups and use those in the replacement where the match of the _ will be replaced by /
^([^_]+)_([^_]+)_
Replace with:
$1/$2/
Regex demo | Java demo
For example:
String regex = "^([^_]+)_([^_]+)_";
String string = "02_01_fEa3129E_my Pic.png";
String subst = "$1/$2/";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceFirst(subst);
System.out.println(result);
Result
02/01/fEa3129E_my Pic.png

Your current solution has few problems:
It is inefficient - because each replaceFirst need to start from beginning of string so it needs to iterate over same starting characters many times.
It has a bug - because of point 1. while iterating from beginning instead of last modified place, we can replace value which was inserted previously.
For instance if we want to replace single character two times, each with X like abc -> XXc after code like
String input = "abc";
input = input.replaceFirst(".", "X"); // replaces a with X -> Xbc
input = input.replaceFirst(".", "X"); // replaces X with X -> Xbc
we will end up with Xbc instead of XXc because second replaceFirst will replace X with X instead of b with X.
To avoid that kind of problems you can rewrite your code to use Matcher#appendReplacement and Matcher#appendTail methods which ensures that we will iterate over input once and can replace each matched part with value we want
private static String replaceNMatches(String input, String regex,
String replacement, int numberOfTimes) {
Matcher m = Pattern.compile(regex).matcher(input);
StringBuilder sb = new StringBuilder();
int i = 0;
while(i++ < numberOfTimes && m.find() ){
m.appendReplacement(sb, replacement); // replaces currently matched part with replacement,
// and writes replaced version to StringBuilder
// along with text before the match
}
m.appendTail(sb); //lets add to builder text after last match
return sb.toString();
}
Usage example:
System.out.println(replaceNMatches("abcdefgh", "[efgh]", "X", 2)); //abcdXXgh

Java regex except combination of symbols

I'm trying to find substing contains any character, but not include combination "[%"
As examples:
Input: atrololo[%trololo
Output: atrololo
Input: tro[tro%tro[%trololo
Output: tro[tro%tro
I already wrote regex, take any symbol except [ or %:
[A-Za-z-0-9\s!-$-&/:-#\\-`\{-~]*
I must put in the end of my expression something like [^("[%")], but i can't solve how it should input.
You may check my regular in
https://www.regex101.com/
Put as test string this:
sdfasdsdfasa##!55#321!2h/ хf[[[[[sds d
asgfdgsdf[[[%for (int i = 0; i < 5; i++){}%]
[% fo%][%r(int i = 0; i < 5; i++){ %]*[%}%]
[%for(int i = 0; i < 5; i++){%][%=i%][%}%]
[%#n%]<[%# n + m %]*[%#%]>[%#%]
%?s.equals(""TEST"")%]TRUE[%#3%]![%#%][%?%]
Kind regards.

You could use a negative lookahead based regex like below to get the part before the [%
^(?:(?!\[%).)*
(?:(?!\[%).)* matches any character but not of [% zero or more times.
DEMO
String s = "tro[tro%tro[%trololo";
Pattern regex = Pattern.compile("^(?:(?!\\[%).)*");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group()); // output : tro[tro%tro
}
OR
A lookahead based regex,
^.*?(?=\[%)
DEMO
Pattern regex = Pattern.compile("^.*?(?=\\[%)");
OR
You could split the input string based on the regex \[% and get the parts you want.
String s = "tro[tro%tro[%trololo";
String[] part = s.split("\\[%");
System.out.println(part[0]); // output : tro[tro%tro

Using your input/output pairs as the spec:
String input; // the starting string
String output = input.replaceAll("\\[%.*", "");

match ;ABC12;10;250.3 using regex java

String regex = "^;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}.[\\d]{1,}";
String str = ";ABC12;10;250.3";
System.out.println(str.matches(regex));
The above regex works fine.
Consider the following strings
str1=";ABC12;10;250.3"
str2=;ABB62;5;2.3
str3=;ABF02;8;25120.3
str4=;AKC12;11;2504.303
Now i have the string as String strToMatch= str1,str2,str3,str4
How do i convert my regex expression above inorder to match the above string.
Note : There can be n number of comma separated values in the above string. And i also need to take care that the string strToMatch doesnot end with comma.

You can capture the regex with round brackets and repeat one or more times:
String regex = "^(;[A-Z0-9]{5};\\d+;\\d+\\.\\d+){1,}";

Try this pattern instead: (;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,},?)+
This has two differences to your pattern: first I use \\. to denote that this has to be a . because a single dot means "any character" in regex.
Then I used the grouping brackets (...) and the + at the end to say: "Look for this once or more". As the , is optional at the end, I added a ?
If you want to get single matches to process using a Matcher later on, a simple modification should do the trick: (;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,}),?
The + is gone and the ,? is outside the grouping brackets, because those are now capturing brackets (as well).
Example:
final Pattern pattern = Pattern.compile("(;[A-Z0-9]{5};[\\d]{1,};[\\d]{1,}\\.[\\d]{1,}),?");
final Matcher matcher = pattern.matcher(";ABC12;10;250.3,;ABB62;5;2.3,;ABF02;8;25120.3,;AKC12;11;2504.303");
while (matcher.find()) {
System.out.println("Whole match: " + matcher.group());
for (int i = 1; i <= matcher.groupCount(); ++i) {
System.out.println("Group #" + i + ": " + matcher.group(i));
}
}

I have found below way of solving the problem.
String strToMatch = ";ABC12;10;250.3,;ABB62;5;2.3,;ABF02;8;25120.3,;AKC12;11;2504.303";
if(strToMatch.endsWith(",") || strToMatch.startsWith(","))
return false;
else{
String[] str = strToMatch.split(",");
int count = 0;
for (String s : str){
String regex = ";[A-Z0-9]{5};\\d+;\\d+\\.\\d+";
if(s.matches(regex))
return false;
}
return true;
}
Any simpler way than this?

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.

This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij

Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here

Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

How does String.split work?

Why does the following code return the output below?
I would expect that 2 and 3 provide the same string splitting of 1.
Log.d(TAG, " 1 ---------------------------");
String originalText = "hello. .hello1";
Pattern p = Pattern.compile("[a-zA-Z]+|\\s|\\W|\\d");
Matcher m = p.matcher(originalText);
while (m.find()) {
Log.d(TAG, m.group(0));
}
Log.d(TAG, "2 --------------------------- " + originalText);
String [] scrollString = p.split(originalText);
int i;
for (i=0; i<scrollString.length; i++)
Log.d(TAG, scrollString[i]);
Log.d(TAG, "3 --------------------------- " + originalText);
scrollString = originalText.split("[a-zA-Z]+|\\s|\\W|\\d");
for (i=0; i<scrollString.length; i++)
Log.d(TAG, scrollString[i]);
OUTPUT:
1 ---------------------------
hello
.
.
hello
1
2 ---------------------------
3 ---------------------------

No. 1 will find the pattern and return that, whereas No. 2 and 3 will return the text in between the found pattern (which serves as the delimiter in those cases).

Your subject doesn't match what you are asking.
The Subject asks about String.split() you are doing Pattern.split() which one do you really want help with?
When using String.split(); you pass in the regular expression to apply to the string, not the string you want to split!
JavaDoc for String.split();
final String s = "this is the string I want to split";
final String[] sa = s.split(" ");
you are calling .split on p ( Pattern.split(); )
Pattern p = Pattern.compile("[a-zA-Z]+|\\s|\\W|\\d");
String [] scrollString = p.split(originalText);
these too methods have different behaviors.

the split() methods don't add the captured part of the string (the delimiter) to the result array
if you want the delimiters you'll have to play with lookahead and lookbehind (or use version 1)

No. Every character in your string is covered by the split pattern, hence taken as something you don't want. Therefore, you get the empty result.
You can imagine that your pattern first finds "hello", then split hopes to find something, but alas!, it finds another "separation" character.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split complex string with Regex in JAVA - java

You if you want to split, you could use this: str.split("\\(|-[0-9]+(?:,\\s+|\\))"); ideone demo.

Related

How would I replace this function with a regex replace

Java regex except combination of symbols

match ;ABC12;10;250.3 using regex java

Iterating through String with .find() in Java regex

How does String.split work?

Categories

Resources