Java Regex replaceAll() with lookahead - java

I am fairly new to using regex with java. My motive is to escape all occurrences of '*' with a back slash.
This was the statement that I tried:
String replacementStr= str.replaceAll("(?=\\[*])", "\\\\");
This does not seem to work though. After some amount of tinkering, found out that this works though.
String replacementStr= str.replaceAll("(?=[]\\[*])", "\\\\");
Based on what I know of regular expressions, I thought '[]' represents an empty character class. Am I missing something here? Can someone please help me understand this?
Note: The motive of my trial was to learn to use the lookahead feature of regex. While the purpose stated in the question does not warrant the use of lookahead, am just trying to use it for educational purposes. Sorry for not making that clear!

When some metacharacters are placed within brackets, no need to escape.
In another way, I do not know if you mean escape * with \*. In that case, try the next:
String newStr = str.replace("*", "\\*");
EDIT: There is something curious in your regular expressions.
(?=\[*]) Look ahead for the character [ (0 or more times), followed by ]
(?=[]\[*]) Look ahead for one of the next characters: [, ], *
Perhaps the regex that you are looking for is the following:
(?=\*)
In Java, "(?=\\*)"

In your replaceAll("(?=\\[*])", "\\\\"); simply modify as
String newStr = str.replace("*", "\\");
Dont bother about regex
For example
String str = "abc*123*";
String newStr = str.replace("*", "\\");
System.out.println(newStr);
Shows output as
abc\123\
Know about String replace

Below Code will work
Code
String strTest = "jhgfg*gfb*gfhh";
strTest = strTest.replaceAll("\\*", "\\\\"); // strTest = strTest.replace("*", "\\");
System.out.println("String is : "+strTest);
OUTPUT
String is : jhgfg\gfb\gfhh

If the regex engine finds [], it treats the ] as a literal ]. This is never a problem because an empty character class is useless anyway, and it means you can avoid some character escaping.
There are a few rules for characters you don't have to escape in character classes:
in [] (or [^]), the ] is literal
in [-.....] or [^-.....] or [.....-] or [^.....-], the - is literal
^ is literal unless it is at the start of the character class
So you'll never need to escape ], - or ^ if you don't want to.
This is down to the Perl origins of the regex syntax. It's a very Perl-style way of doing things.

Related

How do i check if string contains char sequence and backslash "\"?

I'm trying to get true in the following test. I have a string with the backslash, that for some reason doesn't recognized.
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\.");
System.out.println(test);
I've tried a lot of variants, but only one (.*)news(.*) works. But that actually means any characters after news, i need only with \.
How can i do that?
Group the elements at the end:(.*)news\\(.*)
You can use this instead :
Boolean test = s.matches("(.*)news\\\\(.*)");
Try something like:
Boolean test = s.matches(".*news\\\\.*");
Here .* means any number of characters followed by news, followed by double back slashes (escaped in a string) and then any number of characters after that (can be zero as well).
With your regex what it means is:
.* Any number of characters
news\\ - matches by "news\" (see one slash)
. followed by one character.
which doesn't satisfies for String in your program "Good news\ everyone!"
You are testing for an escaped occurrence of a literal dot: ".".
Refactor your pattern as follows (inferring the last part as you need it for a full match):
String s = "Good news\\ everyone!";
System.out.println(s.matches("(.*)news\\\\.*"));
Output
true
Explanation
The back-slash is used to escape characters and the back-slash itself in Java Strings
In Java Pattern representations, you need to double-escape your back-slashes for representing a literal back-slash ("\\\\"), as double-back-slashes are already used to represent special constructs (e.g. \\p{Punct}), or escape them (e.g. the literal dot \\.).
String.matches will attempt to match the whole String against your pattern, so you need the terminal part of the pattern I've added
you can try this :
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\\\(.*)");
System.out.println(test);

Java Split on Spaces and Special Characters

I am trying to split a string on spaces and some specific special characters.
Given the string "john - & + $ ? . # boy"
I want to get the array:
array[0]="john";
array[1]="boy";
I've tried several regular expressions and gotten no where. Here is my current stab:
String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.#&].*");
Which preserves "john" but not "boy". Can anyone get me the rest of this?
Just use:
String[] terms = input.split("[\\s#&.?$+-]+");
You can put a short-hand character class inside a character class (note the \s), and most meta-character loses their meaning inside a character class, except for [, ], -, &, \. However, & is meaningful only when comes in pair &&, and - is treated as literal character if put at the beginning or the end of the character class.
Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines.
As #Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \W consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.
You could make your code much easier by replacing your pattern with "\\W+" (one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)
And of Course things could be made more efficient by using Guava's Splitter class
Try out this.....
Input.replace("-&+$?.#"," ").split(" ");
Breaking then step by step:
For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.
String ugly = "john - & + $ ? . # boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
to add to what have been said about Splitter, you can do something of this sort:
String str = "john - & + $ ? . # boy";
Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);
Use this format.
String s = "john - & + $ ? . # boy";
String reg = "[!_.',#? ]";
String[] res = s.split(reg);
here include every character that you want to split inside the [ ] brackets.
You can use something like below
arrayOfStringType=string.split(" |'|,|.|//+|_");
'|' will work as an or operator here.

Java String Split on any character (including regex special characters)

I'm sure I'm just overlooking something here...
Is there a simple way to split a String on an explicit character without applying RegEx rules?
For instance, I receive a string with a dynamic delimiter, I know the 5th character defines the delimiter.
String s = "This,is,a,sample";
For this, it's simple to do
String delimiter = String.valueOf(s.charAt(4));
String[] result = s.split(delimiter);
However, when I have a delimiter that's a special RegEx character, this doesn't work:
String s = "This*is*a*sample";
So... is there a way to split the string on an explicit character without trying to apply extra RegEx rules? I feel like I must be missing something pretty simple.
split uses a regular expression as its argument. * is a meta-character used to match zero of more characters in regular expressions, You could use Pattern#quote to avoid interpreting the character
String[] result = s.split(Pattern.quote(delimiter));
You need not to worry about the character type If you use Pattern
Pattern regex = Pattern.compile(s.charAt(4));
Matcher matcher = regex.matcher(yourString);
if (matcher.find()){
//do something
}
You can run Pattern.quote on the delimiter before feeding it in. This will create a string literal and escape any regex specific chars:
delimiter = Pattern.quote(delimiter);
StringUtils.split(s, delimiter);
That will treat the delimiter as just a character, not use it like a regex.
StringUtils is a part of the ApacheCommons library, which is tons of useful methods. It is worth taking a look, could save you some time in the future.
Simply put your delimiter between []
String delimiter = "["+s.charAt(4)+"]";
String[] result = s.split(delimiter);
Since [ ] is the regex matches any characters between [ ]. You can also specify a list of delimiters like [*,.+-]

the correct regex for replacing em-dash with a basic "-" in java

My question concerns the replaceAll method of String class.
My purpose is to replace all the em-dashes in a text with a basic "-".
I know the unicode character of em-dash is \u2014.
I tried it in the following way:
String s = "asd – asd";
s = s.replaceAll("\u2014", "-");
Still, the em-dash is not replaced. What is it I'm doing wrong?
Minor edit after question edit:
You might not be using an em-dash at all. If you're not sure what you have, a nice solution is to simply find and replace all dashes... em or otherwise. Take a look at this answer, you can try to use the Unicode dash punctuation property for all dashes ==> \\p{Pd}
String s = "asd – asd";
s = s.replaceAll("\\p{Pd}", "-");
Working example replacing an em dash and regular dash both with the above code.
References:
public String replaceAll(String regex, String replacement)
Unicode Regular Expressions
Based on what you posted, the problem may not actually lie with your code, but with your assumed dash. What you have looks like an en dash (width of a capital N) rather than an em dash (width of a capital M). The Unicode for the en dash is U+2013, try using that instead and see if it updates properly.
String.replaceAll takes a regex as its first parameter. If you just want to replace all occurences of a single char by another char, consider using String.replace(char, char):
String s = "asd – asd";
s = s.replace('\u2014', '-');
It works fine for me. My guess is you're not using an em-dash. Test copy-pasting the em-dash character from the character map instead of word.

java regex help

I can have one string with the following two formats:
"[HardCOdeText1 (HardCodeText2)].[HardCodeText3].[MatchString] between changeValue1 and changeValue2";
"[MatchString] between changeValue1 and changeValue2";
I would like to match if the string have "[MatchString] between" expression.
ANd depending upon which string I match, The changed value format should be one of the following:
The changed format should be :
"[HardCOdeText1 (HardCodeText2)].[HardCodeText3].[MatchString] between chamged1 and changed2"; or
"[MatchString] between changed1 and changed2";
I started to match "[MatchString] between" expression and I got stuck over there:
ANy help is appreciated.
[ and ] are reserved chars in regular expressions and you need to escape them.
Also, here is a really nice online regular expression tester that uses java regexp:
http://www.regexplanet.com/simple/index.html
Are you using "[MatchString] between" as your pattern without escaping the [] brackets? The characters [, ], ., (, and ) are all special characters in RegEx. If you want to refer to those as literal characters, you need to escape them in the pattern.
Your question is a little unclear. Maybe if you provided specific examples of a real input string you're using, and what you want the matches to look like?
I'd also check out regular-expressions.info for more information and tutorials, including info about and syntax for Java's implementation of RegEx.
This will match both kinds of line examples that you give.
String inputLine;
String outputLine;
String regex1 = "(\\[MatchString\\] between )changeValue1 and";
String regex2 = "(\\[MatchString\\] between )[^ ]+ and";
do {
inputLine = readTheInput();
outputLine = inputLine.replaceFirst(regex1, "$1changed1 and");
writeTheOutput(outputLine);
} while (thereIsStillInput());
The first regex1 looks specifically for changeValue1 while the second regex2 looks for anything following "between" and preceding "and".
This should get you started.

Categories

Resources