Parsing text using Regex

Parsing text using Regex - java

So I am trying to parse a String that contains two key components. One tells me the timing options, and the other is position.
Here is what the text looks like
KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif
The {iiii} is the position and the {ttt} is the timing options.
I need to separate the {ttt} and {iiii} out so I can get a full file name: example, position 1 and time slice 1 = KB_H9Oct4GFP_20130305_p0000001t000000001z001c02.tif
So far here is how I am parsing them:
int startTimeSlice = 1;
int startTile = 1;
String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
Pattern patternTime = Pattern.compile(regexTime);
Matcher matcherTime = patternTime.matcher(filePattern);
if (!matcherTime.find() || matcherTime.groupCount() != 3)
{
throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
}
String timePrefix = matcherTime.group(1);
int tCount = matcherTime.group(2).length();
String timeSuffix = matcherTime.group(3);
String timeMatcher = timePrefix + "%0" + tCount + "d" + timeSuffix;
String timeFileName = String.format(timeMatcher, startTimeSlice);
String regex = "([^{]*)\\{([i]+)\\}(.*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(timeFileName);
if (!matcher.find() || matcher.groupCount() != 3)
{
throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
}
String prefix = matcher.group(1);
int iCount = matcher.group(2).length();
String suffix = matcher.group(3);
String nameMatcher = prefix + "%0" + iCount + "d" + suffix;
String fileName = String.format(nameMatcher, startTile);
Unfortunately my code is not working and it fails when checking if the second matcher finds anything in timeFileName.
After the first regex check it gets the following as the timeFileName: 000000001z001c02.tif, so it is cutting off the beginning potions including the {iiii}
Unfortunately I cannot assuming which group goes first ({iiii} or {ttt}), so I am trying to devise a solution that just handles {ttt} first and then processes {iiii}.
Also, here is another example of valid text that I am also trying to parse: F_{iii}_{ttt}.tif

Steps to follow:
Find string {ttt...} in file name
Form a number format based on no of "t" in string
Find string {iiii...} in file name
Form a number format based on no of "i" in string
Use String.replace() method to replace time and possition
Here is the code:
String filePattern = "KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif";
int startTimeSlice = 1;
int startTile = 1;
Pattern patternTime = Pattern.compile("(\\{[t]*\\})");
Matcher matcherTime = patternTime.matcher(filePattern);
if (matcherTime.find()) {
String timePattern = matcherTime.group(0);// {ttt}
NumberFormat timingFormat = new DecimalFormat(timePattern.replaceAll("t", "0")
.substring(1, timePattern.length() - 1));// 000
Pattern patternPosition = Pattern.compile("(\\{[i]*\\})");
Matcher matcherPosition = patternPosition.matcher(filePattern);
if (matcherPosition.find()) {
String positionPattern = matcherPosition.group(0);// {iiii}
NumberFormat positionFormat = new DecimalFormat(positionPattern
.replaceAll("i", "0").substring(1, positionPattern.length() - 1));// 0000
System.out.println(filePattern.replace(timePattern,
timingFormat.format(startTimeSlice)).replace(positionPattern,
positionFormat.format(startTile)));
}
}

Okay, so after a bit of testing I found a way to handle the case:
For parsing the {ttt} I can use the regex: (.*)\\{t([t]+)\\}(.*)
Now this means I have to increment tCount by one to account for the t I grab from \\{t
Same goes for {iii}: (.*)\\{i([i]+)\\}(.*)

Your first pattern looks like this:
String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
This finds a string consisting of a sequence of zero or more non-{ characters, followed by {t...t}, followed by other characters.
When your input is
KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif
the first substring that matches is
iiii}t00000{ttt}z001c02.tif
The { before the i's can't match, because you told it only to match non-{ characters. The result is that when you re-form the string to do the second match, it will start with iiii} and therefore won't match {iiii} like you're trying to do.
When you're looking for {ttt...}, I don't see any reason to exclude { or any other character from the first part of the string. So changing the regex to
"^(.*)\\{(t+\\}(.*)$"
may be a simple way to fix this. Note that if you want to make sure you include the entire beginning of the string and the entire end of the string in your groups, you should include ^ and $ to match the beginning and end of the string, respectively; otherwise the matcher engine may decide not to include everything. In this case, it won't, but it's a good habit to get into anyway, because that makes things explicit and doesn't require anyone to know the difference between "greedy" and "reluctant" matching. Or use matches() instead of find(), since matches() automatically tries to match the entire string.

Perhaps an easier way to do this (as confirmed by http://regex101.com/r/vG7kY7) is
(\{i+\}).*(\{t+\})
You don't need the [] around a single character you are matching. Keep it simple. i+ means "one or more i's", and as long as these are in the order given, this expression will work (with the first match being {iiii} and the second {ttttt}).
You may need to escape the backslash when writing it in a string...

Related

Parse out specific characters from java string

I have been trying to drop specific values from a String holding JDBC query results and column metadata. The format of the output is:
[{I_Col1=someValue1, I_Col2=someVal2}, {I_Col3=someVal3}]
I am trying to get it into the following format:
I_Col1=someValue1, I_Col2=someVal2, I_Col3=someVal3
I have tried just dropping everything before the "=", but some of the "someVal" data has "=" in them. Is there any efficient way to solve this issue?
below is the code I used:
for(int i = 0; i < finalResult.size(); i+=modval) {
String resulttemp = finalResult.get(i).toString();
String [] parts = resulttemp.split(",");
//below is only for
for(int z = 0; z < columnHeaders.size(); z++) {
String replaced ="";
replaced = parts[z].replace("*=", "");
System.out.println("Replaced: " + replaced);
}
}

You don't need any splitting here!
You can use replaceAll() and the power of regular expressions to simply replace all occurrences of those unwanted characters, like in:
someString.replaceAll("[\\[\\]\\{\\}", "")
When you apply that to your strings, the resulting string should exactly look like required.

You could use a regular expression to replace the square and curly brackets like this [\[\]{}]
For example:
String s = "[{I_Col1=someValue1, I_Col2=someVal2}, {I_Col3=someVal3}]";
System.out.println(s.replaceAll("[\\[\\]{}]", ""));
That would produce the following output:
I_Col1=someValue1, I_Col2=someVal2, I_Col3=someVal3
which is what you expect in your post.
A better approach however might be to match instead of replace if you know the character set that will be in the position of 'someValue'. Then you can design a regex that will match this perticular string in such a way that no matter what seperates I_Col1=someValue1 from the rest of the String, you will be able to extract it :-)
EDIT:
With regards to the matching approach, given that the value following I_Col1= consists of characters from a-z and _ (regardless of the case) you could use this pattern: (I_Col\d=\w+),?
For example:
String s = "[{I_Col1=someValue1, I_Col2=someVal2}, {I_Col3=someVal3}]";
Matcher m = Pattern.compile("(I_Col\\d=\\w+),?").matcher(s);
while (m.find())
System.out.println(m.group(1));
This will produce:
I_Col1=someValue1
I_Col2=someVal2
I_Col3=someVal3

You could do four calls to replaceAll on the string.
String query = "[{I_Col1=someValue1, I_Col2=someVal2}, {I_Col3=someVal3}]"
String queryWithoutBracesAndBrackets = query.replaceAll("\\{", "").replaceAll("\\]", "").replaceAll("\\]", "").replaceAll("\\[", "")
Or you could use a regexp if you want the code to be more understandable.
String query = "[{I_Col1=someValue1, I_Col2=someVal2}, {I_Col3=someVal3}]"
queryWithoutBracesAndBrackets = query.replaceAll("\\[|\\]|\\{|\\}", "")

Replacing Strings with a number in it without a for loop

So I currently have this code;
for (int i = 1; i <= this.max; i++) {
in = in.replace("{place" + i + "}", this.getUser(i)); // Get the place of a user.
}
Which works well, but I would like to just keep it simple (using Pattern matching)
so I used this code to check if it matches;
System.out.println(StringUtil.matches("{place5}", "\\{place\\d\\}"));
StringUtil's matches;
public static boolean matches(String string, String regex) {
if (string == null || regex == null) return false;
Pattern compiledPattern = Pattern.compile(regex);
return compiledPattern.matcher(string).matches();
}
Which returns true, then comes the next part I need help with, replacing the {place5} so I can parse the number. I could replace "{place" and "}", but what if there were multiple of those in a string ("{place5} {username}"), then I can't do that anymore, as far as I'm aware, if you know if there is a simple way to do that then please let me know, if not I can just stick with the for-loop.

then comes the next part I need help with, replacing the {place5} so I can parse the number
In order to obtain the number after {place, you can use
s = s.replaceAll(".*\\{place(\\d+)}.*", "$1");
The regex matches arbitrary number of characters before the string we are searching for, then {place, then we match and capture 1 or more digits with (\d+), and then we match the rest of the string with .*. Note that if the string has newline symbols, you should append (?s) at the beginning of the pattern. $1 in the replacement pattern "restores" the value we need.

Getting the last part of the string

I have a string :
"id=40114662&mode=Edit&reminderId=44195234"
All i want from this string is the final number 44195234. I can't use :
String reminderIdFin = reminderId.substring(reminderId.lastIndexOf("reminderId=")+1);
as i cant have the = sign as the point it splits the string. Is there any other way ?

Try String.split(),
reminderIdFin.split("=")[3];

You can use indexOf() method to get where this part starts:
int index = reminderIdFin.indexOf("Id=") + 3;
the plus 3 will make it so that it jumps over these characters. Then you can use substring to pull out your wanted string:
String newString = reminderIdFin.substring(index);

Remove everything else and you'll be left with your target content:
String reminderIdFin = reminderId.replaceAll(".*=", "");
The regex matches everything up to the last = (the .* is "greedy").

How to find expression, evaluate and replace in Java?

I have the following expressions inside a String (that comes from a text file):
{gender=male#his#her}
{new=true#newer#older}
And I would like to:
Find the occurences of that pattern {variable=value#if_true#if_false}
Temporarily store those variables in fields such as variableName, variableValue, ifTrue, ifFalse as Strings.
Evaluate an expression based on variableName and variableValue according to local variables (like String gender = "male" and String new = "true").
And finally replace the pattern with ifTrue or ifFalse according to (3).
Should I use String.replaceAll() in some way, or how do I look for this expression and save the strings that are inside? Thanks for your help
UPDATE
It would be something like PHP's preg_match_all.
UPDATE 2
I solved this by using Pattern and Matcher as I post as an answer below.

If the strings always take this format, then string.split('#') is probably the way to go. This will return an array of strings in the '#' separator (e.g. "{gender=male#his#her}".split('#') = {"{gender=male", "his", "her}"}; use substring to remove the first and last character to get rid of the braces)

After strugling for a while I managed to get this working using Pattern and Matcher as follows:
// \{variable=value#if_true#if_false\}
Pattern pattern = Pattern.compile(Pattern.quote("\\{") + "([\\w\\s]+)=([\\w\\s]+)#([\\w\\s]+)#([\\w\\s]+)" + Pattern.quote("\\}"));
Matcher matcher = pattern.matcher(doc);
// if we'll make multiple replacements we should keep an offset
int offset = 0;
// perform the search
while (matcher.find()) {
// by default, replacement is the same expression
String replacement = matcher.group(0);
String field = matcher.group(1);
String value = matcher.group(2);
String ifTrue = matcher.group(3);
String ifFalse = matcher.group(4);
// verify if field is gender
if (field.equalsIgnoreCase("Gender")) {
replacement = value.equalsIgnoreCase("Female")?ifTrue:ifFalse;
}
// replace the string
doc = doc.substring(0, matcher.start() + offset) + replacement + doc.substring(matcher.end() + offset);
// adjust the offset
offset += replacement.length() - matcher.group(0).length();
}

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O

I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");

You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true

Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing text using Regex - java

Okay, so after a bit of testing I found a way to handle the case: For parsing the {ttt} I can use the regex: (.)\\{t([t]+)\\}(.) Now this means I have to increment tCount by one to account for the t I grab from \\{t Same goes for {iii}: (.)\\{i([i]+)\\}(.)

Related

Parse out specific characters from java string

Replacing Strings with a number in it without a for loop

Getting the last part of the string

How to find expression, evaluate and replace in Java?

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing text using Regex - java

Okay, so after a bit of testing I found a way to handle the case: For parsing the {ttt} I can use the regex: (.*)\\{t([t]+)\\}(.*) Now this means I have to increment tCount by one to account for the t I grab from \\{t Same goes for {iii}: (.*)\\{i([i]+)\\}(.*)

Related

Parse out specific characters from java string

Replacing Strings with a number in it without a for loop

Getting the last part of the string

How to find expression, evaluate and replace in Java?

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

Categories

Resources

Okay, so after a bit of testing I found a way to handle the case: For parsing the {ttt} I can use the regex: (.)\\{t([t]+)\\}(.) Now this means I have to increment tCount by one to account for the t I grab from \\{t Same goes for {iii}: (.)\\{i([i]+)\\}(.)