Give look behind the priority over the actual regular expression - java

I am looking for a regular expression that can strip all 'a' characters from the beginning of an input word (comprising only of English alphabet).
How would I do this using an regular expression?
The following look behind based regex fails to do the job:
(?<=a*?)(\w)+
as for input abc the above regular expression would return abc.
Is there a clean way to do this using lookbehinds?
A (brute force-ish) regular expression that does work is using negation:
(?<=a*)([[^a]&&\w])*
which returns the correct answer of bc for an input word abc.
But I was wondering if there could be a more elegant regular expression, say, using the correct quantifier?

Pattern removeWords = Pattern.compile("\\b(?:a)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher fix = removeWords.matcher(YourWord);
String fixedString = fix.replaceAll("");
this will remove a from the current string and if you want to remove some other letters
Pattern removeWords = Pattern.compile("\\b(?:a|b|c)\\b\\s*",Pattern.CASE_INSENSITIVE);
you ca do it this way

I think that a regex for this problem is overkill.
You could instead do:
str = str.startsWith("a") ? str.substring(1) : str;

Try with:
(?i)\\ba?(\\w+)\\b
and replace a word with captured group 1.
Code example:
String word = "aWord Another";
word = word.replaceAll("(?i)\\ba?(\\w+)\\b", "$1");
System.out.println(word);
with output:
Word nother

There are much more simpler way to do this, but as you insist on using using lookbehinds, I will give one. The regex will be
(?<=\b)a+(\w*)
Regex Breakdown
(?<=\b) #Find all word boundaries
a+ #Match the character a literally at least once. We have already ensured using word boundary to find those a's only which are starting of word
(\w*) #Find remaining characters
Regex Demo
Java Code
String str = "abc cdavbvhsza aaabcd";
System.out.println(str.replaceAll("(?<=\\b)a+(\\w*)", "$1"));
Ideone Demo

Related

Java regex, replace certain characters except

I have this string "u2x4m5x7" and I want replace all the characters but a number followed by an x with "".
The output should be:
"2x5x"
Just the number followed by the x.
But I am getting this:
"2x45x7"
I'm doing this:
String string = "u2x4m5x7";
String s = string.replaceAll("[^0-9+x]","");
Please help!!!
Here is a one-liner using String#replaceAll with two replacements:
System.out.println(string.replaceAll("\\d+(?!x)", "").replaceAll("[^x\\d]", ""));
Here is another working solution. We can iterate the input string using a formal pattern matcher with the pattern \d+x. This is the whitelist approach, of trying to match the variable combinations we want to keep.
String input = "u2x4m5x7";
Pattern pattern = Pattern.compile("\\d+x");
Matcher m = pattern.matcher(input);
StringBuilder b = new StringBuilder();
while(m.find()) {
b.append(m.group(0));
}
System.out.println(b)
This prints:
2x5x
It looks like this would be much simpler by searching to get the match rather than replacing all non matches, but here is a possible solution, though it may be missing a few cases:
\d(?!x)|[^0-9x]|(?<!\d)x
https://regex101.com/r/v6udph/1
Basically it will:
\d(?!x) -- remove any digit not followed by an x
[^0-9x] -- remove all non-x/digit characters
(?<!\d)x -- remove all x's not preceded by a digit
But then again, grabbing from \dx would be much simpler
Capture what you need to $1 OR any character and replace with captured $1 (empty if |. matched).
String s = string.replaceAll("(\\d+x)|.", "$1");
See this demo at regex101 or a Java demo at tio.run

Regex to catch all the words and the "i'm you're etc" in Java

I am trying to split lines of a document, by creating a Pattern in Java.
The default Pattern in WordCount example is something like this: "\\s*\\b\\s*".
The problem with this pattern however, is that it splits everything to a single word, while I want to keep things such as (I'm, You're, it's) together. So far, what I've tried is [a-zA-Z]+'{0,1}[a-zA-Z]*,
the problem is that when I have a test string, for example:
Pattern BOUNDARY = "[a-zA-Z]+'{0,1}[a-zA-Z]*"
String test = "Hello i'm #£$#you ##can !!be.
and run
for(String word : BOUNDARY.split(test){
println(word)}
I get no results. Ideally, I want to get
Hello
i'm
you
can
be
Any ideas are welcome. In the regex101.com the regex I've put up works like a charm, so I'm guessing I have misunderstood something in the Java part.
Your initial pattern was splitting at a word boundary enclosed with 0+ whitespaces pattern. The second pattern is matching substrings.
Use it like this:
String BOUNDARY_STR = "[a-zA-Z]+(?:'[a-zA-Z]+)?";
String test = "Hello i'm #£$#you ##can !!be.";
Matcher matcher = Pattern.compile(BOUNDARY_STR).matcher(test);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [Hello, i'm, you, can, be]
See the Java demo
Note I used [a-zA-Z]+(?:'[a-zA-Z]+)? that matches
[a-zA-Z]+ - 1 or more ASCII letters
(?:'[a-zA-Z]+)? - an optional substring of
' - an apostrophe
[a-zA-Z]+ - 1 or more ASCII letters
You may also wrap the pattern with word boundaries to only match words that are enclosed with non-word chars, "\\b[a-zA-Z]+(?:'[a-zA-Z]+)?\\b".
To find all Unicode letters, use "\\p{L}+(?:'\\p{L}+)?".

Why does this regex capture the excluded character?

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.
Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)
Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..
How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.
I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

java regular expression [A-Z]{6}-[A-Z]{4}-[A-Z]{4}

I'm trying to write a regular expression in Java for this:
"/[A-Z]{6}-[A-Z]{4}-[A-Z]{4}/"
But it is not working. For example
"AASAAA-AAAA-AAAA".matches("/[A-Z]{6}-[A-Z]{4}-[A-Z]{4}/")
returns false.
What is the correct way?
Java != JavaScript, here you don't need to surround regex with / so try with
"AASAAA-AAAA-AAAA".matches("[A-Z]{6}-[A-Z]{4}-[A-Z]{4}")
Otherwise your regex would search for substring which also has / at start and end.
BTW you need to know that matches checks if regex matches entire String, so
"aaa".matches("aa")
is same as
"aaa".matches("^aa$")
which would return false since String couldn't be fully matched by regex.
If you would like to find substrings which would match regex you would need to use
String input = "abcd";
Pattern regex = Pattern.compile("\\w{2}");
Matcher matcher = regex.matcher(input);
while (matcher.find()){//this will try to find single match
System.out.println(matcher.group());
}
Output:
ab
cd
^[A-Z]{6}-[A-Z]{4}-[A-Z]{4}$
It's just you shouldn't put backslashes at the start and the end. Put instead ^ and $.
And ow I didn't see you have used the Javscript's syntax ! Java != Javascript

Finding duplicate words within a string regex C/W

I'm currently dabbing in regex in Java, and want to try and find duplicate words in strings. If I inputted a string such as 'This this is great.'. I was using \\b(\\w+) \\1\\b, but that only recognizes two duplicate words, such as 'this this' in a string.
Any help regarding this?
Add the "ignore case" switch (?i) to your regex:
(?i)\\b(\\w+) \\1\\b
Alternatively, you could fold the input to lower case first:
input.toLowerCase()
Note: If you're using String.matches(), the regex must match the entire input, so you'd add .* to both ends of your regex:
.*(?i)\\b(\\w+) \\1\\b.*
String pattern = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
You can use Matcher.group() and Matcher.group(1) to replace all duplicate words with this approach.

Categories

Resources