We get xml with invalid duration, like PT10HMS (note lack of numbers before M and S). I have handled this by reading the file and fixing by iterating the duration string character by character and inserting 0 between 2 letters that are side by side (except between P and T). I was wondering if there was a more elegant solution maybe using a regex with sed or anything else?
thanks for any suggestions
An idea for a Java solution here (sure sed can be used too).
String incorrectDuration = "PT10HMS";
String dur = incorrectDuration.replaceAll("(?<!\\d+)[HMS]", "0$0");
This produces
PT10H0M0S
Personally I would prefer deleting the letters that do not have a number in front of them:
String dur = incorrectDuration.replaceAll("(?<!\\d+)[HMS]", "");
Now I get
PT10H
In both cases Duration.parse(dur) works and gives the expected result.
(?<!\\d+) is a negative lookbehind: with this the regex only matches if the H, M or S is not preceded by a string of digits.
Edit: I am probably overdoing it in the following. I was just curious how I could produce my preferred string also in the case where you have got for example PTHMS as you mentioned in the comment. For production code you will probably want to stick with the simpler solution above.
String durationString = "PTHMS";
// if no digits, insert 0 before last letter
if (! durationString.matches(".*\\d.*")) {
durationString = durationString.replaceFirst("(?=[HMS]$)", "0");
}
// then delete letters that do not have a digit before them
durationString = durationString.replaceAll("(?<!\\d)[HMS]", "");
This produces
PT0S
(?=[HMS]$) is a lookahead. It matches the empty string but only if this empty string is followed by either H, M or S and then the end of the string. So replacing this empty string with 0 gives us PTHM0S. Confident that there is now at least one digit in the string, we can proceed to delete letters that don’t have a digit before them.
It still wouldn’t work if you had just PT. As I understand, this doesn’t happen. If it did, you would prefer for example durationString = PT0S; inside the if statement instead.
Related
I have three different sentences which contains repetetive parts.
I want to merge three different regex groups in one, and then replace all mathes to white space.
I am asking you for help, how should I megre these groups ?
String locked = "LOCKED (center)"; //LOCKED() - always the same part
String idle = "Idle (second)"; // Idle() - always the same part
String OK = "example-OK"; // -OK - always the same part
I've built three regular expressions, but they are split. How should i megre them ?
String forLocked = locked.replaceAll("^LOCKED\\s\\((.*)\\)", "$1");
String forIdle = idle.replaceAll("^Idle\\s\\((.*)\\)", "$1");
String forOK = OK.replaceAll("(.*)\\-OK", "$1");
I think this technically works, but it doesn't "feel great."
private static final String REGEX =
"^((Idle|LOCKED) *)?\\(?([a-z]+)\\)?(-OK)?$";
... your code ...
System.out.println(locked.replaceAll(REGEX, "$3"));
System.out.println(idle.replaceAll(REGEX, "$3"));
System.out.println(OK.replaceAll(REGEX, "$3"));
Output is:
center
second
example
Breaking down the expression:
^((Idle|LOCKED) *)? - Possibly starts with Idle or Locked followed by zero or more spaces
\\(?([a-z]+)\\)? - Has a sequence of lowercase characters possible embedded inside optional parentheses (also, we want to match that sequence)
(-OK)?$ - Possibly ends with the literal -OK.
There are still some issues though. The optional parentheses aren't in any way tied together, for example. Also, this would give false positives for compounds like Idle (second)-OK --> second.
I had a more stringent regex at first, but one of the additional challenges is to keep a consistent match index on the group you want to replace with (here, $3.) In other words, there's a whole set of regex where if you could use, say $k and $j in different situations, it would be easier. But, that goes against the whole point of having a single regex to begin with (if you need some pre-existing knowledge of the input you're about to match.) Better would be to assume that we know nothing about what is inside the identifiers locked, idle, and OK.
You can merge them with | like this:
String regex = "^LOCKED\\s\\((.*)\\)|^Idle\\s\\((.*)\\)|(.*)\\-OK$";
String forLocked = locked.replaceAll(regex, "$1");
String forIdle = idle.replaceAll(regex, "$2");
String forOK = OK.replaceAll(regex, "$3");
I have numbers like this that need leading zero's removed.
Here is what I need:
00000004334300343 -> 4334300343
0003030435243 -> 3030435243
I can't figure this out as I'm new to regular expressions. This does not work:
(^0)
You're almost there. You just need quantifier:
str = str.replaceAll("^0+", "");
It replaces 1 or more occurrences of 0 (that is what + quantifier is for. Similarly, we have * quantifier, which means 0 or more), at the beginning of the string (that's given by caret - ^), with empty string.
Accepted solution will fail if you need to get "0" from "00". This is the right one:
str = str.replaceAll("^0+(?!$)", "");
^0+(?!$) means match one or more zeros if it is not followed by end of string.
Thank you to the commenter - I have updated the formula to match the description from the author.
If you know input strings are all containing digits then you can do:
String s = "00000004334300343";
System.out.println(Long.valueOf(s));
// 4334300343
Code Demo
By converting to Long it will automatically strip off all leading zeroes.
Another solution (might be more intuitive to read)
str = str.replaceFirst("^0+", "");
^ - match the beginning of a line
0+ - match the zero digit character one or more times
A exhausting list of pattern you can find here Pattern.
\b0+\B will do the work. See demo \b anchors your match to a word boundary, it matches a sequence of one or more zeros 0+, and finishes not in a word boundary (to not eliminate the last 0 in case you have only 00...000)
The correct regex to strip leading zeros is
str = str.replaceAll("^0+", "");
This regex will match 0 character in quantity of one and more at the string beginning.
There is not reason to worry about replaceAll method, as regex has ^ (begin input) special character that assure the replacement will be invoked only once.
Ultimately you can use Java build-in feature to do the same:
String str = "00000004334300343";
long number = Long.parseLong(str);
// outputs 4334300343
The leading zeros will be stripped for you automatically.
I know this is an old question, but I think the best way to do this is actually
str = str.replaceAll("(^0+)?(\d+)", "$2")
The reason I suggest this is because it splits the string into two groups. The second group is at least one digit. The first group matches 1 or more zeros at the start of the line. However, the first group is optional, meaning that if there are no leading zeros, you just get all of the digits. And, if str is only a zero, you get exactly one zero (because the second group must match at least one digit).
So if it's any number of 0s, you get back exactly one zero. If it starts with any number of 0s followed by any other digit, you get no leading zeros. If it starts with any other digit, you get back exactly what you had in the first place.
Here is the simple and proper solution.
str = str.replaceAll(/^0+/g, "");
Global Flag g is required when using replaceAll with regex
I would like to know if it is possible (and if possible, how can i implement it) to manipulate an String value (Java) using one regex.
For example:
String content = "111122223333";
String regex = "?";
Expected result: "1111 2222 3333 ##";
With one regex only, I don't think it is possible. But you can:
first, replace (?<=(.))(?!\1) with a space;
then, use a string append to append " ##".
ie:
Pattern p = Pattern.compile("(?<=(.))(?!\\1)");
String ret = p.matcher(input).replaceAll(" ") + " ##";
If what you meant was to separate all groups, then drop the second operation.
Explanation: (?<=...) is a positive lookbehind, and (?!...) a negative lookahead. Here, you are telling that you want to find a position where there is one character behind, which is captured, and where the same character should not follow. And if so, replace with a space. Lookaheads and lookbehinds are anchors, and like all anchors (including ^, $, \A, etc), they do not consume characters, this is why it works.
OK, since the OP has redefined the problem (ie, a group of 12 digits which should be separated in 3 groups of 4, then followed by ##, the solution becomes this:
Pattern p = Pattern.compile("(?<=\\d)(?=(?:\\d{4})+$)");
String ret = p.matcher(input).replaceAll(" ") + " ##";
The regex changes quite a bit:
(?<=\d): there should be one digit behind;
(?=(?:\d{4})+$): there should be one or more groups of 4 digits afterwards, until the end of line (the (?:...) is a non capturing grouping -- not sure it really makes a difference for Java).
Validating that the input is 12 digits long can easily be done with methods which are not regex-related at all. And this validation is, in fact, necessary: unfortunately, this regex will also turn 12345 into 1 2345, but there is no way around that, for the reason that lookbehinds cannot match arbitrary length regexes... Except with the .NET languages. With them, you could have written:
(?<=^(?:\d{4})+)(?=(?:\d{4})+$
I'm trying to count the number of 0s in a string of numbers. Not exactly just the character 0, but the number zero. e.g. I want to count 0, 0.0, 0.000 etc. The numbers will be separated by spaces, e.g.:
1.0 5.0 1 5.4 12 0.1 14.2675 0.0 0.00005
A simple search for " 0" in the string nearly does the job (I have to first insert a leading space in the string for this to work - in case the first number is a zero). However it doesn't work for numbers in the form 0.x e.g. 0.1, 0.02 etc. I suppose I need to check for 0 and see if there is a decimal point and then non-zero numbers after it, but I have no idea how to do that. Something like:
" 0*|(0\\.(?!\\[1-9\\]))"
Anyone have any ideas how I might accomplish this? Using a regular expression preferably. Or if it it's easier, I'm happy to count the number of non-zero elements. Thank you.
NOTE: I'm using split in Java to do this (split the string using the regular expression and then count with .length()).
How about this:
(?<=^|\s)[0.]+(?=\s|$)
Explanation:
(?<=^|\s) # Assert position after a space or the start of the string
[0.]+ # Match one or more zeroes/decimal points
(?=\s|$) # Assert position before a space or the end of the string
Remember to double the backslashes in Java strings.
You should instead split by whitespace and use Double.parseDouble() on each fragment, then if it indeed is a double, compare it to 0.
String[] parts = numbers.split("\\s+");
int numZeros = 0;
for (String s: parts) {
try {
if (Double.parseDouble(s) == 0) {
numZeros ++;
}
}
catch (Exception e) {
}
}
There is no easy solution for the regex anyway. The easiest thought would be to use the \b boundary operator, but it fails badly. Also, the Double.parseDouble means that things like -0 are supported too.
split() is not the solution to this problem, though it can be part of the solution, as Antti's answer demonstrated. You'll find it much easier to match the zero-valued numbers with find() in a loop and count the matches, like this:
String s = "1.0 5.0 1 5.4 12 0.1 14.2675 0.0 0.00005 0. .0 0000 -0.0";
Pattern p = Pattern.compile("(?<!\\S)-?(?:0+(?:\\.?0*)|\\.0+)(?!\\S)");
Matcher m = p.matcher(s);
int n = 0;
while (m.find()) {
System.out.printf("%n%s ", m.group());
n++;
}
System.out.printf("%n%n%d zeroes total%n", n);
output:
0.0
0.
.0
0000
-0.0
5 zeroes total
This is how Tim meant for you to use the regex in his answer, too (I think). Breaking down my regex, we have:
(?<!\\S) is a negative lookbehind that matches a position that's not preceded by a non-whitespace character. It's equivalent to Tim's positive lookbehind, (?<=^|\s), which explicitly matches the beginning of the string or right after a whitespace character.
-?(?:0+(?:\\.?0*)|\\.0+) matches an optional minus sign followed by at least one zero and at most one decimal point.
(?!\\S) is equivalent to (?=\s|$) - it matches right before a whitespace character or at the end of the string.
The lookbehind and lookahead ensure that you always match the whole token, just like you would if you were splitting on whitespace. Without those, it would also match zeros that are part of a non-zero tokens like 1230.0456.
EDIT (in response to a comment): My main objection to using split() is that it's needlessly convoluted. You're creating an array of strings comprising all the parts of the string you don't care about, then doing some math on the array's length to get the information you want. Sure it's only one line of code, but it does a very poor job of communicating its intent. Anyone who's not not already familiar with the idiom could have a very difficult time sussing out what it does.
Then there's the trailing empty tokens issue: if you use the split technique on my revised sample string you'll get a count of 4, not 5. That's because the last chunk of the string matches the split regex, meaning the last token should be an empty string. But Java (following Perl's lead) silently drops trailing empty tokens by default. You can override that behavior by passing a negative integer as the second argument, but what if you forget to do that? It's a very easy mistake to make, and potentially a very difficult one to troubleshoot.
As for performance, the two approaches are virtually identical in speed (I don't know about memory they use). It's not likely to be a problem when working with reasonably-sized texts.
I need to convert a string like
"string"
to
"*s*t*r*i*n*g*"
What's the regex pattern? Language is Java.
You want to match an empty string, and replace with "*". So, something like this works:
System.out.println("string".replaceAll("", "*"));
// "*s*t*r*i*n*g*"
Or better yet, since the empty string can be matched literally without regex, you can just do:
System.out.println("string".replace("", "*"));
// "*s*t*r*i*n*g*"
Why this works
It's because any instance of a string startsWith(""), and endsWith(""), and contains(""). Between any two characters in any string, there's an empty string. In fact, there are infinite number of empty strings at these locations.
(And yes, this is true for the empty string itself. That is an "empty" string contains itself!).
The regex engine and String.replace automatically advances the index when looking for the next match in these kinds of cases to prevent an infinite loop.
A "real" regex solution
There's no need for this, but it's shown here for educational purpose: something like this also works:
System.out.println("string".replaceAll(".?", "*$0"));
// "*s*t*r*i*n*g*"
This works by matching "any" character with ., and replacing it with * and that character, by backreferencing to group 0.
To add the asterisk for the last character, we allow . to be matched optionally with .?. This works because ? is greedy and will always take a character if possible, i.e. anywhere but the last character.
If the string may contain newline characters, then use Pattern.DOTALL/(?s) mode.
References
regular-expressions.info/Dot Matches (Almost) Any Character and Grouping and Backreferences
I think "" is the regex you want.
System.out.println("string".replaceAll("", "*"));
This prints *s*t*r*i*n*g*.
If this is all you're doing, I wouldn't use a regex:
public static String glitzItUp(String text) {
return insertPeriodically(text, "*", 1);
}
Putting char into a java string for each N characters
public static String insertPeriodically(
String text, String insert, int period)
{
StringBuilder builder = new StringBuilder(
text.length() + insert.length() * (text.length()/period)+1);
int index = 0;
while (index <= text.length())
{
builder.append(insert);
builder.append(text.substring(index,
Math.min(index + period, text.length())));
index += period;
}
return builder.toString();
}
Another benefit (besides simplicity) is that it's about ten times faster than a regex.
IDEOne | Working example
Just to be a jerk, I'm going to say use J:
I've spent a school year learning Java, and self-taught myself a bit of J over the course of the summer, and if you're going to be doing this for yourself, it's probably most productive to use J simply because this whole inserting an asterisk thing is easily done with one simple verb definition using one loop.
asterisked =: 3 : 0
i =. 0
running_String =. '*'
while. i < #y do.
NB. #y returns tally, or number of items in y: right operand to the verb
running_String =. running_String, (i{y) , '*'
i =. >: i
end.
]running_String
)
This is why I would use J: I know how to do this, and have only studied the language for a couple months loosely. This isn't as succinct as the whole .replaceAll() method, but you can do it yourself quite easily and edit it to your specifications later. Feel free to delete this/ troll this/ get inflamed at my suggestion of J, I really don't care: I'm not advertising it.