Java Regex replacing every digit in beginning - java

How can I replace with regex each digit at the beginning of word with the underscore character, as well as in the rest part of the word to replace all characters except letters, digits, dashes and dots to underscores?
I tried this regex:
^(\d+)|[^\w-.]
However, it replaces all digits in the beginning with a single underscore character.
So, 34567fgf-kl.)*/676hh is converted to _fgf-kl.___676hh while I need every digit in the beginning to be replaced with one underscore character like _____fgf-kl.___676hh.
Is it possible to achieve using a regex?

You can do it like this with Matcher.appendReplacement used with Matcher.find:
String fileText = "34567fgf-kl.)*/676hh";
String pattern = "^\\d+|[^\\w.-]+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(fileText);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, repeat("_", m.group(0).length()));
}
m.appendTail(sb); // append the rest of the contents
System.out.println(sb);
And the repeat is
public static String repeat(String s, int n) {
if(s == null) {
return null;
}
final StringBuilder sb = new StringBuilder(s.length() * n);
for(int i = 0; i < n; i++) {
sb.append(s);
}
return sb.toString();
}
See IDEONE demo
Also, repeat can be replaced with String repeated = StringUtils.repeat("_", m.group(0).length()); using Commons Lang StringUtils.repeat().

You can use a negative-lookbehind to individually match each leading digit, i.e. any digit that doesn't have a non-digit before it.
(?<!\D.{0,999})\d|[^\w-.]
Due to constraints in lookbehind, it cannot be unlimited. The above code can handle at most 999 leading digits.

You can also use replaceAll() with regex:
(^\d)|(?<=\d\G)\d|[^-\w.\n]
which means match:
(^\d) - digit on beginning of a line,
| - or
(?<=\d\G)\d - digit if it is preceded by previously matched digit,
| - or
[^-\w.\n] - not dash, word character (\w is [A-Za-z_0-9]), point or
new line (\n). As a [^-\w.\n] is rather broad category, maybe you will like to add some more characters, or character groups, to exclude from matching, it is enough to add it inside brackets,
DEMO
\n is added if string could be multiline. If there is just one-line string, \n is redundant.
Example in Java:
public class Test {
public static void main(String[] args) {
String example = "34567fgf-kl.)*/676hh";
System.out.println(example.replaceAll("(^\\d)|(?<=\\d\\G)\\d|[^\\w.-]", "_"));
}
}
with output:
_____fgf-kl.___676hh

Related

How to substring before nth occurence of a separator?

first;snd;3rd;4th;5th;6th;...
How can I split the above after the third occurence of the ; separator? Especially without having to value.split(";") the whole string as an array, as I won't need the values separated. Just the first part of the string up until nth occurence.
Desired output would be:
first;snd;3rd.
I just need that as a string substring, not as split separated values.
Use StringUtils.ordinalIndexOf() from Apache
Finds the n-th index within a String, handling null. This method uses String.indexOf(String).
Parameters:
str - the String to check, may be null
searchStr - the String to find, may be null
ordinal - the n-th searchStr to find
Returns:
the n-th index of the search String, -1 (INDEX_NOT_FOUND) if no match or null string input
Or this way, no libraries required:
public static int ordinalIndexOf(String str, String substr, int n) {
int pos = str.indexOf(substr);
while (--n > 0 && pos != -1)
pos = str.indexOf(substr, pos + 1);
return pos;
}
I would go with this, easy and basic:
String test = "first;snd;3rd;4th;5th;6th;";
int result = 0;
for (int i = 0; i < 3; i++) {
result = test.indexOf(";", result) +1;
}
System.out.println(test.substring(0, result-1));
Output:
first;snd;3rd
You can ofc change the 3 in the loop with the number of arguments you need
If you want to use regular expressions, it is pretty straightforward:
import re
value = "first;snd;3rd;4th;5th;6th;"
reg = r'^([\w]+;[\w]+;[\w]+)'
re.match(reg, value).group()
Outputs:
"first;snd;3rd"
More options here .
You could use a regex that uses a negated character class to match from the start of the string not a semicolon.
Then repeat a grouping structure 2 times that matches a semicolon followed by not a semicolon 1+ times.
^[^;]+(?:;[^;]+){2}
Explanation
^ Assert the start of the string
[^;]+ Negated character class to match not a semicolon 1+ times
(?: Start non capturing group
;[^;]+ Match a semicolon and 1+ times not a semi colon
){2} Close non capturing group and repeat 2 times
For example:
String regex = "^[^;]+(?:;[^;]+){2}";
String string = "first;snd;3rd;4th;5th;6th;...";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(0)); // first;snd;3rd
}
See the Java demo
If you don't want to use split, just use indexOf in a for loop to know the index of the 3rd and 4th ";" then do a substring between these index.
Also you can do a split with a regex that match the 3rd ; but it's probably not the best solution.
If you need to do this frequently it is best to compile the regex upfront in a static Pattern instance:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class NthOccurance {
static Pattern pattern=Pattern.compile("^(([^;]*;){3}).*");
public static void main(String[] args) {
String in="first;snd;3rd;4th;5th;6th;";
Matcher m=pattern.matcher(in);
if (m.matches())
System.out.println(m.group(1));
}
}
Replace the '3' by the number of elements you want.
Below code find index of 3rd occurence of ';' character and make substring.
String s = "first;snd;3rd;4th;5th;6th;";
String splitted = s.substring(0, s.indexOf(";", s.indexOf(";", s.indexOf(";") + 1) + 1));

Java regex: Replace all characters with `+` except instances of a given string

I have the following problem which states
Replace all characters in a string with + symbol except instances of the given string in the method
so for example if the string given was abc123efg and they want me to replace every character except every instance of 123 then it would become +++123+++.
I figured a regular expression is probably the best for this and I came up with this.
str.replaceAll("[^str]","+")
where str is a variable, but its not letting me use the method without putting it in quotations. If I just want to replace the variable string str how can I do that? I ran it with the string manually typed and it worked on the method, but can I just input a variable?
as of right now I believe its looking for the string "str" and not the variable string.
Here is the output its right for so many cases except for two :(
List of open test cases:
plusOut("12xy34", "xy") → "++xy++"
plusOut("12xy34", "1") → "1+++++"
plusOut("12xy34xyabcxy", "xy") → "++xy++xy+++xy"
plusOut("abXYabcXYZ", "ab") → "ab++ab++++"
plusOut("abXYabcXYZ", "abc") → "++++abc+++"
plusOut("abXYabcXYZ", "XY") → "++XY+++XY+"
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
plusOut("--++ab", "++") → "++++++"
plusOut("aaxxxxbb", "xx") → "++xxxx++"
plusOut("123123", "3") → "++3++3"
Looks like this is the plusOut problem on CodingBat.
I had 3 solutions to this problem, and wrote a new streaming solution just for fun.
Solution 1: Loop and check
Create a StringBuilder out of the input string, and check for the word at every position. Replace the character if doesn't match, and skip the length of the word if found.
public String plusOut(String str, String word) {
StringBuilder out = new StringBuilder(str);
for (int i = 0; i < out.length(); ) {
if (!str.startsWith(word, i))
out.setCharAt(i++, '+');
else
i += word.length();
}
return out.toString();
}
This is probably the expected answer for a beginner programmer, though there is an assumption that the string doesn't contain any astral plane character, which would be represented by 2 char instead of 1.
Solution 2: Replace the word with a marker, replace the rest, then restore the word
public String plusOut(String str, String word) {
return str.replaceAll(java.util.regex.Pattern.quote(word), "#").replaceAll("[^#]", "+").replaceAll("#", word);
}
Not a proper solution since it assumes that a certain character or sequence of character doesn't appear in the string.
Note the use of Pattern.quote to prevent the word being interpreted as regex syntax by replaceAll method.
Solution 3: Regex with \G
public String plusOut(String str, String word) {
word = java.util.regex.Pattern.quote(word);
return str.replaceAll("\\G((?:" + word + ")*+).", "$1+");
}
Construct regex \G((?:word)*+)., which does more or less what solution 1 is doing:
\G makes sure the match starts from where the previous match leaves off
((?:word)*+) picks out 0 or more instance of word - if any, so that we can keep them in the replacement with $1. The key here is the possessive quantifier *+, which forces the regex to keep any instance of the word it finds. Otherwise, the regex will not work correctly when the word appear at the end of the string, as the regex backtracks to match .
. will not be part of any word, since the previous part already picks out all consecutive appearances of word and disallow backtrack. We will replace this with +
Solution 4: Streaming
public String plusOut(String str, String word) {
return String.join(word,
Arrays.stream(str.split(java.util.regex.Pattern.quote(word), -1))
.map((String s) -> s.replaceAll("(?s:.)", "+"))
.collect(Collectors.toList()));
}
The idea is to split the string by word, do the replacement on the rest, and join them back with word using String.join method.
Same as above, we need Pattern.quote to avoid split interpreting the word as regex. Since split by default removes empty string at the end of the array, we need to use -1 in the second parameter to make split leave those empty strings alone.
Then we create a stream out of the array and replace the rest as strings of +. In Java 11, we can use s -> String.repeat(s.length()) instead.
The rest is just converting the Stream to an Iterable (List in this case) and joining them for the result
This is a bit trickier than you might initially think because you don't just need to match characters, but the absence of specific phrase - a negated character set is not enough. If the string is 123, you would need:
(?<=^|123)(?!123).*?(?=123|$)
https://regex101.com/r/EZWMqM/1/
That is - lookbehind for the start of the string or "123", make sure the current position is not followed by 123, then lazy-repeat any character until lookahead matches "123" or the end of the string. This will match all characters which are not in a "123" substring. Then, you need to replace each character with a +, after which you can use appendReplacement and a StringBuffer to create the result string:
String inputPhrase = "123";
String inputStr = "abc123efg123123hij";
StringBuffer resultString = new StringBuffer();
Pattern regex = Pattern.compile("(?<=^|" + inputPhrase + ")(?!" + inputPhrase + ").*?(?=" + inputPhrase + "|$)");
Matcher m = regex.matcher(inputStr);
while (m.find()) {
String replacement = m.group(0).replaceAll(".", "+");
m.appendReplacement(resultString, replacement);
}
m.appendTail(resultString);
System.out.println(resultString.toString());
Output:
+++123+++123123+++
Note that if the inputPhrase can contain character with a special meaning in a regular expression, you'll have to escape them first before concatenating into the pattern.
You can do it in one line:
input = input.replaceAll("((?:" + str + ")+)?(?!" + str + ").((?:" + str + ")+)?", "$1+$2");
This optionally captures "123" either side of each character and puts them back (a blank if there's no "123"):
So instead of coming up with a regular expression that matches the absence of a string. We might as well just match the selected phrase and append + the number of skipped characters.
StringBuilder sb = new StringBuilder();
Matcher m = Pattern.compile(Pattern.quote(str)).matcher(input);
while (m.find()) {
for (int i = 0; i < m.start(); i++) sb.append('+');
sb.append(str);
}
int remaining = input.length() - sb.length();
for (int i = 0; i < remaining; i++) {
sb.append('+');
}
Absolutely just for the fun of it, a solution using CharBuffer (unexpectedly it took a lot more that I initially hoped for):
private static String plusOutCharBuffer(String input, String match) {
int size = match.length();
CharBuffer cb = CharBuffer.wrap(input.toCharArray());
CharBuffer word = CharBuffer.wrap(match);
int x = 0;
for (; cb.remaining() > 0;) {
if (!cb.subSequence(0, size < cb.remaining() ? size : cb.remaining()).equals(word)) {
cb.put(x, '+');
cb.clear().position(++x);
} else {
cb.clear().position(x = x + size);
}
}
return cb.clear().toString();
}
To make this work you need a beast of a pattern. Let's say you you are operating on the following test case as an example:
plusOut("abXYxyzXYZ", "XYZ") → "+++++++XYZ"
What you need to do is build a series of clauses in your pattern to match a single character at a time:
Any character that is NOT "X", "Y" or "Z" -- [^XYZ]
Any "X" not followed by "YZ" -- X(?!YZ)
Any "Y" not preceded by "X" -- (?<!X)Y
Any "Y" not followed by "Z" -- Y(?!Z)
Any "Z" not preceded by "XY" -- (?<!XY)Z
An example of this replacement can be found here: https://regex101.com/r/jK5wU3/4
Here is an example of how this might work (most certainly not optimized, but it works):
import java.util.regex.Pattern;
public class Test {
public static void plusOut(String text, String exclude) {
StringBuilder pattern = new StringBuilder("");
for (int i=0; i<exclude.length(); i++) {
Character target = exclude.charAt(i);
String prefix = (i > 0) ? exclude.substring(0, i) : "";
String postfix = (i < exclude.length() - 1) ? exclude.substring(i+1) : "";
// add the look-behind (?<!X)Y
if (!prefix.isEmpty()) {
pattern.append("(?<!").append(Pattern.quote(prefix)).append(")")
.append(Pattern.quote(target.toString())).append("|");
}
// add the look-ahead X(?!YZ)
if (!postfix.isEmpty()) {
pattern.append(Pattern.quote(target.toString()))
.append("(?!").append(Pattern.quote(postfix)).append(")|");
}
}
// add in the other character exclusion
pattern.append("[^" + Pattern.quote(exclude) + "]");
System.out.println(text.replaceAll(pattern.toString(), "+"));
}
public static void main(String [] args) {
plusOut("12xy34", "xy");
plusOut("12xy34", "1");
plusOut("12xy34xyabcxy", "xy");
plusOut("abXYabcXYZ", "ab");
plusOut("abXYabcXYZ", "abc");
plusOut("abXYabcXYZ", "XY");
plusOut("abXYxyzXYZ", "XYZ");
plusOut("--++ab", "++");
plusOut("aaxxxxbb", "xx");
plusOut("123123", "3");
}
}
UPDATE: Even this doesn't quite work because it can't deal with exclusions that are just repeated characters, like "xx". Regular expressions are most definitely not the right tool for this, but I thought it might be possible. After poking around, I'm not so sure a pattern even exists that might make this work.
The problem in your solution that you put a set of instance string str.replaceAll("[^str]","+") which it will exclude any character from the variable str and that will not solve your problem
EX: when you try str.replaceAll("[^XYZ]","+") it will exclude any combination of character X , character Y and character Z from your replacing method so you will get "++XY+++XYZ".
Actually you should exclude a sequence of characters instead in str.replaceAll.
You can do it by using capture group of characters like (XYZ) then use a negative lookahead to match a string which does not contain characters sequence : ^((?!XYZ).)*$
Check this solution for more info about this problem but you should know that it may be complicated to find regular expression to do that directly.
I have found two simple solutions for this problem :
Solution 1:
You can implement a method to replace all characters with '+' except the instance of given string:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
for(int i = 0; i < str.length(); i++){
// exclude any instance string of exWord from replacing process in str
if(str.substring(i, str.length()).indexOf(exWord) + i == i){
i = i + exWord.length()-1;
}
else{
str = str.substring(0,i) + "+" + str.substring(i+1);//replace each character with '+' symbol
}
}
Note : str.substring(i, str.length()).indexOf(exWord) + i this if statement will exclude any instance string of exWord from replacing process in str.
Output:
+++++++XYZ
Solution 2:
You can try this Approach using ReplaceAll method and it doesn't need any complex regular expression:
String exWord = "XYZ";
String str = "abXYxyzXYZ";
str = str.replaceAll(exWord,"*"); // replace instance string with * symbol
str = str.replaceAll("[^*]","+"); // replace all characters with + symbol except *
str = str.replaceAll("\\*",exWord); // replace * symbol with instance string
Note : This solution will work only if your input string str doesn't contain any * symbol.
Also you should escape any character with a special meaning in a regular expression in phrase instance string exWord like : exWord = "++".

Replace all characters between two delimiters using regex

I'm trying to replace all characters between two delimiters with another character using regex. The replacement should have the same length as the removed string.
String string1 = "any prefix [tag=foo]bar[/tag] any suffix";
String string2 = "any prefix [tag=foo]longerbar[/tag] any suffix";
String output1 = string1.replaceAll(???, "*");
String output2 = string2.replaceAll(???, "*");
The expected outputs would be:
output1: "any prefix [tag=foo]***[/tag] any suffix"
output2: "any prefix [tag=foo]*********[/tag] any suffix"
I've tried "\\\\\[tag=.\*?](.\*?)\\\\[/tag]" but this replaces the whole sequence with a single "\*".
I think that "(.\*?)" is the problem here because it captures everything at once.
How would I write something that replaces every character separately?
you can use the regex
\w(?=\w*?\[)
which would match all characters before a "[\"
see the regex demo, online compiler demo
You can capture the chars inside, one by one and replace them by * :
public static String replaceByStar(String str) {
String pattern = "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)";
while (str.matches(pattern)) {
str = str.replaceAll(pattern, "$1*$2");
}
return str;
}
Use like this it will print your tx2 expected outputs :
public static void main(String[] args) {
System.out.println(replaceByStar("any prefix [tag=foo]bar[/tag] any suffix"));
System.out.println(replaceByStar("any prefix [tag=foo]loooongerbar[/tag] any suffix"));
}
So the pattern "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)" :
(.*\\[tag=.*\\].*) capture the beginning, with eventually some char in the middle
\\w is for the char you want to replace
(.*\\[\\/tag\\].*) capture the end, with eventually some char in the middle
The substitution $1*$2:
The pattern is (text$1)oneChar(text$2) and it will replace by (text$1)*(text$2)

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.
why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.
try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo
Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");
Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

Split a String based on regex

I have a string that needs to be split based on the occurrence of a ","(comma), but need to ignore any occurrence of it that comes within a pair of parentheses.
For example, B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3
Should be split into
B2B,
(A2C,AMM),
(BNC,1NF),
(106,A01),
AAA,
AX3
FOR NON NESTED
,(?![^\(]*\))
FOR NESTED(parenthesis inside parenthesis)
(?<!\([^\)]*),(?![^\(]*\))
Try below:
var str = 'B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3';
console.log(str.match(/\([^)]*\)|[A-Z\d]+/g));
// gives you ["B2B", "(A2C,AMM)", "(BNC,1NF)", "(106,A01)", "AAA", "AX3"]
Java edition:
String str = "B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3";
Pattern p = Pattern.compile("\\([^)]*\\)|[A-Z\\d]+");
Matcher m = p.matcher(str);
List<String> matches = new ArrayList<String>();
while(m.find()){
matches.add(m.group());
}
for (String val : matches) {
System.out.println(val);
}
One simple iteration will be probably better option then any regex, especially if your data can have parentheses inside parentheses. For example:
String data="Some,(data,(that),needs),to (be, splited) by, comma";
StringBuilder buffer=new StringBuilder();
int parenthesesCounter=0;
for (char c:data.toCharArray()){
if (c=='(') parenthesesCounter++;
if (c==')') parenthesesCounter--;
if (c==',' && parenthesesCounter==0){
//lets do something with this token inside buffer
System.out.println(buffer);
//now we need to clear buffer
buffer.delete(0, buffer.length());
}
else
buffer.append(c);
}
//lets not forget about part after last comma
System.out.println(buffer);
output
Some
(data,(that),needs)
to (be, splited) by
comma
Try this
\w{3}(?=,)|(?<=,)\(\w{3},\w{3}\)(?=,)|(?<=,)\w{3}
Explanation: There are three parts separated by OR (|)
\w{3}(?=,) - matches the 3 any alphanumeric character (including underscore) and does the positive look ahead for comma
(?<=,)\(\w{3},\w{3}\)(?=,) - matches this pattern (ABC,E4R) and also does a positive lookahead and look behind for the comma
(?<=,)\w{3} - matches the 3 any alphanumeric character (including underscore) and does the positive look behind for comma

Categories

Resources