Ok... I have an unsatisfactory solution to a problem.
The problem is I have input like so:
{sup 19}F({sup 3}He,t){sup 19}Ne(p){sup 18}F
and need output like so:
¹⁹F(³He,t)¹⁹Ne(p)¹⁸F
I use a series of replacements first to split each of the {sup xx} sections into {sup x}{sup x} and then use a regex to match each of those and replace the characters with their UTF-8 single equivalents. The "problem" is that the {sup} sections can have numbers 1, 2 or 3 digits long (maybe more, I don't know), and I want to "expand" them into separate {sup} sections with one digit each. ( I also have the same problem with {sub} for subscripts... )
My current solution looks like this (in java):
retval = retval.replaceAll("\\{sup ([1-9])([0-9])\\}", "{sup $1}{sup $2}");
retval = retval.replaceAll("\\{sup ([1-9])([0-9])([0-9])\\}", "{sup $1}{sup $2}{sup $3}");
My question: is there a way to do this in a single pass no matter how many digits ( or at least some reasonable number ) there are?
Yes, but it may be a bit of a hack, and you'll have to be careful it doesn't overmatch!
Regex:
(?:\{sup\s)?(\d)(?=\d*})}?
Replacement String:
{sup $1}
A short explanation:
(?: | start non-capturing group 1
\{ | match the character '{'
sup | match the substring: "sup"
\s | match any white space character
) | end non-capturing group 1
? | ...and repeat it once or not at all
( | start group 1
\d | match any character in the range 0..9
) | end group 1
(?= | start positive look ahead
\d | match any character in the range 0..9
* | ...and repeat it zero or more times
} | match the substring: "}"
) | stop negative look ahead
} | match the substring: "}"
? | ...and repeat it once or not at all
In plain English: it matches a single digit, only when looking ahead there's a } with optional digits in between. If possible, the substrings {sup and } are also replaced.
EDIT:
A better one is this:
(?:\{sup\s|\G)(\d)(?=\d*})}?
That way, digits like in the string "set={123}" won't be replaced. The \G in my second regex matches the spot where the previous match ended.
The easiest way to do this kind of thing is with something like PHP's preg_replace_callback or .NET's MatchEvaluator delegates. Java doesn't have anything like that built in, but it does expose the lower-level API that lets you implement it yourself. Here's one way to do it:
import java.util.regex.*;
public class Test
{
static String sepsup(String orig)
{
Pattern p = Pattern.compile("(\\{su[bp] )(\\d+)\\}");
Matcher m = p.matcher(orig);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb, "");
for (char ch : m.group(2).toCharArray())
{
sb.append(m.group(1)).append(ch).append("}");
}
}
m.appendTail(sb);
return sb.toString();
}
public static void main (String[] args)
{
String s = "{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F";
System.out.println(s);
System.out.println(sepsup(s));
}
}
result:
{sup 19}F({sup 3}He,t){sub 19}Ne(p){sup 18}F
{sup 1}{sup 9}F({sup 3}He,t){sub 1}{sub 9}Ne(p){sup 1}{sup 8}F
If you wanted, you could go ahead and generate the superscript and subscript characters and insert those instead.
Sure, this is a standard Regular Expression construct. You can find out about all the metacharacters in the Pattern Javadoc, but for your purposes, you probably want the "+" metacharacter, or the {1,3} greedy quantifier. Details in the link.
Related
I am trying to mask the CC number, in a way that third character and last three characters are unmasked.
For eg.. 7108898787654351 to **0**********351
I have tried (?<=.{3}).(?=.*...). It unmasked last three characters. But it unmasks first three also.
Can you throw some pointers on how to unmask 3rd character alone?
You can use this regex with a lookahead and lookbehind:
str = str.replaceAll("(?<!^..).(?=.{3})", "*");
//=> **0**********351
RegEx Demo
RegEx Details:
(?<!^..): Negative lookahead to assert that we don't have 2 characters after start behind us (to exclude 3rd character from matching)
.: Match a character
(?=.{3}): Positive lookahead to assert that we have at least 3 characters ahead
I would suggest that regex isn't the only way to do this.
char[] m = new char[16]; // Or whatever length.
Arrays.fill(m, '*');
m[2] = cc.charAt(2);
m[13] = cc.charAt(13);
m[14] = cc.charAt(14);
m[15] = cc.charAt(15);
String masked = new String(m);
It might be more verbose, but it's a heck of a lot more readable (and debuggable) than a regex.
Here is another regular expression:
(?!(?:\D*\d){14}$|(?:\D*\d){1,3}$)\d
See the online demo
It may seem a bit unwieldy but since a credit card should have 16 digits I opted to use negative lookaheads to look for an x amount of non-digits followed by a digit.
(?! - Negative lookahead
(?: - Open 1st non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){14} - Close 1st non capture group and match it 14 times.
$ - End string ancor.
| - Alternation/OR.
(?: - Open 2nd non capture group.
\D*\d - Match zero or more non-digits and a single digit.
){1,3} - Close 2nd non capture group and match it 1 to 3 times.
$ - End string ancor.
) - Close negative lookahead.
\d - Match a single digit.
This would now mask any digit other than the third and last three regardless of their position (due to delimiters) in the formatted CC-number.
Apart from where the dashes are after the first 3 digits, leave the 3rd digit unmatched and make sure that where are always 3 digits at the end of the string:
(?<!^\d{2})\d(?=[\d-]*\d-?\d-?\d$)
Explanation
(?<! Negative lookbehind, assert what is on the left is not
^\d{2} Match 2 digits from the start of the string
) Close lookbehind
\d Match a digit
(?= Positive lookahead, assert what is on the right is
[\d-]* 0+ occurrences of either - or a digit
\d-?\d-?\d Match 3 digits with optional hyphens
$ End of string
) Close lookahead
Regex demo | Java demo
Example code
String regex = "(?<!^\\d{2})\\d(?=[\\d-]*\\d-?\\d-?\\d$)";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
String strings[] = { "7108898787654351", "7108-8987-8765-4351"};
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
System.out.println(matcher.replaceAll("*"));
}
Output
**0**********351
**0*-****-****-*351
Don't think you should use a regex to do what you want. You could use StringBuilder to create the required string
String str = "7108-8987-8765-4351";
StringBuilder sb = new StringBuilder("*".repeat(str.length()));
for (int i = 0; i < str.length(); i++) {
if (i == 2 || i >= str.length() - 3) {
sb.replace(i, i + 1, String.valueOf(str.charAt(i)));
}
}
System.out.print(sb.toString()); // output: **0*************351
You may add a ^.{0,1} alternative to allow matching . when it is the first or second char in the string:
String s = "7108898787654351"; // **0**********351
System.out.println(s.replaceAll("(?<=.{3}|^.{0,1}).(?=.*...)", "*"));
// => **0**********351
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
The regex can be written as a PCRE compliant pattern, too: (?<=.{3}|^|^.).(?=.*...).
It is equal to
System.out.println(s.replaceAll("(?<!^..).(?=.*...)", "*"));
See the Java demo and a regex demo.
Regex details
(?<=.{3}|^.{0,1}) - there must be any three chars other than line break chars immediately to the left of the current location, or start of string, or a single char at the start of the string
(?<!^..) - a negative lookbehind that fails the match if there are any two chars other than line break chars immediately to the left of the current location
. - any char but a line break char
(?=.*...) - there must be any three chars other than line break chars immediately to the right of the current location.
If the CC number always has 16 digits, as it does in the example, and as do Visa and MasterCard CC's, matches of the following regular expression can be replaced with an asterisk.
\d(?!\d{0,2}$|\d{13}$)
Start your engine!
Basically my problem is to validate a boolean expression, so I do not want a single or more than two & or | to appear between other expressions.
Ex. I want Pattern.compile(Regex).matcher("A || B").find() to be false, but Pattern.compile(Regex).matcher("A | B").find() and Pattern.compile(Regex).matcher("A ||| B").find() to be true, is there any Regex that can achieve this?
To match 'a' or 'aaa' but not 'aa' you need a regex with a negative look-ahead; e.g.
((?<!a)a(?!a))|(a{3,})
That says "find either an 'a' that is not preceded by an 'a' and not followed by an 'a', or a sequence of 3 or more 'a'`".
However, find with the above regex and this string "a bb aa" will give a hit. If you want to check that the string contains some 'a's and no 'aa', you will need to test the two conditions separately.
To match '|' characters instead of 'a' characters, replace 'a' with '\|' in the above.
Use regex \\s\\|(\\|\\|)*\\s for your use-case.
As pipe (|)is a special character for regex, so we need to escape it(\|). Below is the regex explanation
\\s - Space start
\\| - Atleast one | should be present
(\\|\\|)* - Then | should be present in multiple of 2
\\s - Space end
public static void main(String[] args) {
String regex = "\\s\\|(\\|\\|)*\\s";
System.out.println(Pattern.compile(regex).matcher("A || B").find());
System.out.println(Pattern.compile(regex).matcher("A | B").find());
System.out.println(Pattern.compile(regex).matcher("A ||| B").find());
}
This pattern assumes that | will have space before and after it. If there is a use-case where string doesn't have space before and after |, you can try out \b word boundary in the regex and see if it helps
The following regex matches on either 1 or 3 or more of | or & :
([^|]\\|[^|])|([^&]&[^&])|[|&]{3,}
This gives the following:
public static void main(String[] args) {
String[] strings = {"A|B", "A||B", "A|||B", "A&B", "A&&B", "A&&&B", "A&&&&&&&B"};
String regex = "([^|]\\|[^|])|([^&]&[^&])|[|&]{3,}";
Pattern pattern = Pattern.compile(regex);
Arrays.asList(strings)
.forEach(x -> System.out.println(x + ": " + pattern.matcher(x).find()));
}
Output:
A|B: true
A||B: false
A|||B: true
A&B: true
A&&B: false
A&&&B: true
A&&&&&&&B: true
Note that | and & are known as bitwise operators and are valid in Java. To match on actual invalid Java, matching on the following would do:
[|&]{3,}
How to check whether a String contains all '\r' \t' '\n'...other than spaces?
For example, String a = "a\nb", String b = "a b". I want return true for string a, false for string b.
I know there is Character.isWhiteSpace(char c), and Pattern.compile("\\s").matcher(string).find(). But they all take space(' ') into account. What I want is find out all escape characters which is considered as whitespace by Character.isWhiteSpace(char c) method except for ' '.
And I don't want to check char by char, it will be the best if there is a proper regex and I can use like Pattern.compile.
Like this?
#Test
public void testLines() {
assertTrue(Pattern.compile("[\n\r\t]").matcher("a\nb").find());
assertFalse(Pattern.compile("[\n\r\t]").matcher("a b").find());
}
You could use [^\S ] which matches everything but \S (non-whitespace) or (space).
Pattern pattern = Pattern.compile("[^\\S ]");
String a = "a\nb";
String b = "a b";
System.out.println(pattern.matcher(a).find()); // true
System.out.println(pattern.matcher(b).find()); // false
I assume that when you say "all '\r' \t' '\n'...other than spaces", what you mean is "any whitespace character other than U+0020" (where U+0020 is a simple space). Is this correct?
If so, then the following regex (general form) should work:
(?! )\s
This will match any whitespace character that is not a simple space. This regex makes use of negative lookahead.
EDIT:
As #Bubletan states in their answer, the following regex will also work:
[^\S ]
Both of these regex are equivalent. This is because (?! )\s ≣ "(is NOT the character U+0020) AND (is whitespace)" and [^\S ] ≣ "is NOT (non-whitespace OR the character U+0020) have the same truth table:
Let P(x) be the predicate "x is the character U+0020"
Let Q(x) be the predicate "x is whitespace"
P | Q | (¬P)∧Q | ¬(¬Q∨P)
–– ––– –––––––– ––––––––
T T F F
T F T T
F T F F
F F F F
Although for the sake of efficiency, you are probably better off using #Bubletan's solution ([^\S ]). Lookaround is generally slower than the alternative.
This is how you could implement it:
// Create the pattern. (do only once)
Pattern pattern = Pattern.compile("[^\\S ]");
// Test an input string. (do for each input)
Matcher matcher = pattern.matcher(string);
boolean result = matcher.find();
result will then indicate whether string contains any whitespace other than a simple space.
In Java, use [^\\h]+ . \h means all kinds of horizontal spaces. But in other languages, it is not available as far as I know.
Is it possible to subtract the characters in a Java regex back reference from a character class?
e.g., I want to use String#matches(regex) to match either:
any group of characters that are [a-z'] that are enclosed by "
Matches: "abc'abc"
Doesn't match: "1abc'abc"
Doesn't match: 'abc"abc'
any group of characters that are [a-z"] that are enclosed by '
Matches: 'abc"abc'
Doesn't match: '1abc"abc'
Doesn't match: "abc'abc"
The following regex won't compile because [^\1] isn't supported:
(['"])[a-z'"&&[^\1]]*\1
Obviously, the following will work:
'[a-z"]*'|"[a-z']*"
But, this style isn't particularly legible when a-z is replaced by a much more complex character class that must be kept the same in each side of the "or" condition.
I know that, in Java, I can just use String concatenation like the following:
String charClass = "a-z";
String regex = "'[" + charClass + "\"]*'|\"[" + charClass + "']*\"";
But, sometimes, I need to specify the regex in a config file, like XML, or JSON, etc., where java code is not available.
I assume that what I'm asking is almost definitely not possible, but I figured it wouldn't hurt to ask...
One approach is to use a negative look-ahead to make sure that every character in between the quotes is not the quotes:
(['"])(?:(?!\1)[a-z'"])*+\1
^^^^^^
(I also make the quantifier possessive, since there is no use for backtracking here)
This approach is, however, rather inefficient, since the pattern will check for the quote character for every single character, on top of checking that the character is one of the allowed character.
The alternative with 2 branches in the question '[a-z"]*'|"[a-z']*" is better, since the engine only checks for the quote character once and goes through the rest by checking that the current character is in the character class.
You could use two patterns in one OR-separated pattern, expressing both your cases:
// | case 1: [a-z'] enclosed by "
// | | OR
// | | case 2: [a-z"] enclosed by '
Pattern p = Pattern.compile("(?<=\")([a-z']+)(?=\")|(?<=')([a-z\"]+)(?=')");
String[] test = {
// will match group 1 (for case 1)
"abcd\"efg'h\"ijkl",
// will match group 2 (for case 2)
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
Output
efg'h
null
null
efg"h
Note
There is nothing stopping you from specifying the enclosing characters or the character class itself somewhere else, then building your Pattern with components unknown at compile-time.
Something in the lines of:
// both strings are emulating unknown-value arguments
String unknownEnclosingCharacter = "\"";
String unknownCharacterClass = "a-z'";
// probably want to catch a PatternSyntaxException here for potential
// issues with the given arguments
Pattern p = Pattern.compile(
String.format(
"(?<=%1$s)([%2$s]+)(?=%1$s)",
unknownEnclosingCharacter,
unknownCharacterClass
)
);
String[] test = {
"abcd\"efg'h\"ijkl",
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
// note: only main group here
System.out.println(m.group());
}
}
Output
efg'h
I have this regex in java
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
It works as intended but I have a new problem to get some valid dates:
1st problem:
If I have this String It was at 22-febrero-1999 and 10-enero-2009 and 01-diciembre-2000 I should get another string as febrero-enero-diciembre and I only get febrero-enero
2nd problem
If I have a single date in a String like 12-octubre-1989 I get an emptry String.
Why I have in my pattern to have whitespaces in the start and end of any date? because I have to catch only valid months in a String like adsadasd 12-validMonth-2999 asd 11-validMonth-1989 I should get both validMonth, then never get a validMonth in a String like asdadsad12-validMonth-1989 asdadsad 23-validMonth-1989 in the last one I only should get the last validMonth
PD: My java code is
String resultado = "";
String pattern = "(\\s)(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})(\\s)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(fecha);
while (m.find()) {
resultado += m.group().split("-")[1] + "-";
}
return (resultado.compareTo("") == 0 ? "" : resultado.substring(0, resultado.length() - 1));
You might want to use a word boundary instead:
\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b
And I believe some of the months can be optimized a little bit (it could reduce readability unfortunately, but should speed things up by a notch):
\\b(\\d{2}-)((?:en|febr)ero|ma(?:rz|y)o|abril|ju[ln]io|agosto|(?:septiem|octu|noviem|diciem)bre)(-\\d{4})\\b
Perhaps try using a \b instead of \s:
String pattern = "\\b(\\d{2}-)(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)(-\\d{4})\\b";
This will only match strings where the first digit is not preceded by another word character (digit, letter, or underscore), and the last digit is not followed by a word character. I've also removed the capturing groups around the \b, because it would always be a zero-length string, if matched.
I wouldn't use a word boundry as a delimeter.
I'd suggest to use either whitespace or NOT digit,
or no delimeter and put in a validation range of numbers for day/year.
This way you may catch more embeded dates that are in close
proximity (adjacent) to letters and underscore.
Something like:
# "(?<!\\d)\\d{2}-(?:enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)-\\d{4}(?!\\d)"
(?<! \d ) # Not a digit before us
\d{2} - # Two digits followed by dash
(?: # A month
enero
| febrero
| marzo
| abril
| mayo
| junio
| julio
| agosto
| septiembre
| octubre
| noviembre
| diciembre
)
- \d{4} # Dash followed by four digits
(?! \d ) # Not a digit after us