placeholders for regular expressions

placeholders for regular expressions - java

I'm pretty new to regular expressions, and i don't really know how to use them correctly yet.
As input i have a string, in which i want to look for a certain pattern, let's say a word enclosed in !, like this: "Hello, my name is !John!". Now i want to replace the substring inside with something different. How do i look for the substring without knowing what is inside?
String str = "I don't !know! how to do this";
str = str.replace("!placeholder!", "X");
Just like that.

str.replaceAll("!.*!", "X") would be a way to do it. There are however many different "placeholders" and special characters you should be aware of (at least to escape them). In this instance I used . to match any character and * to signify that I want any number of those. The expression then reads as "replace all exclamation points, followed by any number of characters and ending in another exclamation point with the letter X".
That would also replace the exclamation points, so perhaps you want to write str.replaceAll("!.*!", "!X!"). Or maybe you don't want to replace the string "!!" so you'd use "!.+!". But to explore all the possibilities, you should really read some tutorial like this one: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Maybe,
!\w+!\s*
or,
!\w+!
might simply work OK for those examples.
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "!\\w+!\\s*";
final String string = "I don't !know! how to do this\n"
+ "Hello, my name is !John! ";
final String subst = "something_else ";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
}
}
Output
I don't something_else how to do this
Hello, my name is something_else
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Related

Trying to understand and reproduce Regex Pattern

I am currently trying to understand Pattern and Matcher a little bit more and found the following code:
private static final Pattern PATTERN = Pattern.compile(
String.format("addPart%s(?<assembly>%s)\\+(?<amount>%s)%s(?<part>%s)",
InOutputStrings.COMMAND_SEPARATOR,
InOutputStrings.NAME_PATTERN,
InOutputStrings.NUMBER_PATTERN,
InOutputStrings.INNER_SEPARATOR,
InOutputStrings.NAME_PATTERN));
private String assemblyName;
private int amount;
private String partName;
...
assemblyName = matcher.group("assembly");
amount = tryParse(matcher.group("amount"));
partName = matcher.group("part");
whereby
NAME_PATTERN("[a-zA-Z]+"),
NUMBER_PATTERN("(?!(0[0-9]))[0-9]+"),
COMMAND_SEPARATOR(" "),
ARGUMENT_SEPARATOR(";"),
INNER_SEPARATOR(":")
What would be a valid input here?
Could someone show me how this would look like for the input-pattern
"add track <startPoint> -> <endPoint>"?
I am working on a Command-line pattern and this would be a good way of implementing the input parsing.
Also, what is the meaning of "?", "\\+" and "<assembly>"...?

What would be a valid input here?
addPart foo+42:bar
...how this would look like for the input-pattern...
add track (?<startPoint>[a-zA-Z]+) -> (?<endPoint>[a-zA-Z]+)
...what is the meaning of...
?! is a Negative Lookahead
?<assembly> is a Named Capturing Group
\\+ is a literal plus sign (not a RegEx operator)
Also note that %s is a variable reference for String.format. It is not a RegEx operator either.

How to match a string of tuples in Java?

I have strings like "(C,D) (E,F) (G,H) (J,K)" and "(C,D) (E,F) (G,H) (J,K)" or "((C,D) (E,F) (G,H) (J,K)". How to return true if regex matches pattern like in first string (which is a one tuple or series of tuples seperated by one whitespace). I tried something like "(\([A-Z],[A-Z]\)[ |$])+?", but it does not capture the final pair of tuple. In case of 2nd and 3rd string it should return false.

Here is the problem of your regex:
(\([A-Z],[A-Z]\)[ |$])+?
^^^^^
You thought that meant "space or end of string", didn't you? It actually means "space or | or dollar sign". A lot of special characters lose their special meaning when placed inside a character class.
You should replace it with (?: |$) instead. Also, the +? at the end should be a greedy +:
(\([A-Z],[A-Z]\)(?: |$))+
Personally, I don't really like this "space or end of string" thing. I would prefer repeating the tuple pattern (especially when the repeated pattern is not long):
(?:\([A-Z],[A-Z]\) )*(?:\([A-Z],[A-Z]\))
Needless to say, you should match with matches, not find.

If you want to match a string of parenthesised pairs of comma-separated capital letters, with a single space between each pair, you could use a pattern like this:
^\\([A-Z],[A-Z]\\)( \\([A-Z],[A-Z]\\))*$
That is: letter,comma,letter all in parentheses, following by zero or more occurrences of the similar parenthetic expressions, each preceded by a space.

I guess, you might be able to do that with:
\s*|\(([^()\r\n]+)\)
If the pattern would not return an empty string, would be false.
RegEx Demo
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\([^()\\r\\n]+\\)|\\s*";
final String string = "(C,D) (E,F) (G,H) (J,K)\n"
+ "(C,D) (E,F) (G,H) (J,K)\n"
+ "((C,D) (E,F) (G,H) (J,K)";
final String subst = "";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
}
}
Output
(
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Source
Regular expression to match balanced parentheses

Java regex to match the start of the word?

Objective: for a given term, I want to check if that term exist at the start of the word. For example if the term is 't'. then in the sentance:
"This is the difficult one Thats it"
I want it to return "true" because of :
This, the, Thats
so consider:
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "/\\b"+term+"[^\\b]*?\\b/gi";
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
}
}
I am getting following Exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 7
/\bt[^\b]*?\b/gi
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.escape(Pattern.java:2416)
at java.util.regex.Pattern.range(Pattern.java:2577)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.util.regex.Pattern.matches(Pattern.java:1128)
at java.lang.String.matches(String.java:2063)
at HelloWorld.main(HelloWorld.java:8)
Also the following does not work:
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "\\b"+term+"gi";
//String regex = ".";
System.out.println(regex);
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
System.out.println(m.find());
}
}
Example:
{ This , one, Two, Those, Thanks }
for words This Two Those Thanks; result should be true.
Thanks

Since you're using the Java regex engine, you need to write the expressions in a way Java understands. That means removing trailing and leading slashes and adding flags as (?<flags>) at the beginning of the expression.
Thus you'd need this instead:
String regex = "(?i)\\b"+term+".*?\\b"
Have a look at regular-expressions.info/java.html for more information. A comparison of supported features can be found here (just as an entry point): regular-expressions.info/refbasic.html

In Java we don't surround regex with / so instead of "/regex/flags" we just write regex. If you want to add flags you can do it with (?flags) syntax and place it in regex at position from which flag should apply, for instance a(?i)a will be able to find aa and aA but not Aa because flag was added after first a.
You can also compile your regex into Pattern like this
Pattern pattern = Pattern.compile(regex, flags);
where regex is String (again not enclosed with /) and flag is integer build from constants from Pattern like Pattern.DOTALL or when you need more flags you can use Pattern.CASE_INSENSITIVE|Pattern.MULTILINE.
Next thing which may confuse you is matches method. Most people are mistaken by its name, because they assume that it will try to check if it can find in string element which can be matched by regex, but in reality, it checks if entire string can be matched by regex.
What you seem to want is mechanism to test of some regex can be found at least once in string. In that case you may either
add .* at start and end of your regex to let other characters which are not part of element you want to find be matched by regex engine, but this way matches must iterate over entire string
use Matcher object build from Pattern (representing your regex), and use its find() method, which will iterate until it finds match for regex, or will find end of string. I prefer this approach because it will not need to iterate over entire string, but will stop when match will be found.
So your code could look like
String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.find());
In case your term could contain some regex special characters but you want regex engine to treat them as normal characters you need to make sure that they will be escaped. To do this you can use Pattern.quote method which will add all necessary escapes for you, so instead of
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
for safety you should use
Pattern pattern = Pattern.compile("\\b"+Pattern.quote(term), Pattern.CASE_INSENSITIVE);

String regex = "(?i)\\b"+term;
In Java, the modifiers must be inserted between "(?" and ")" and there is a variant for turning them off again: "(?-" and ")".
For finding all words beginning with "T" or "t", you may want to use Matcher's find method repeatedly. If you just need the offset, Matcher's start method returns the offset.
If you need to match the full word, use
String regex = "(?i)\\b"+term + "\\w*";

String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("^[+"+term+"].*",Pattern.CASE_INSENSITIVE);
String[] strings = str.split(" ");
for (String s : strings) {
if (pattern.matcher(s).matches()) {
System.out.println(s+"-->"+true);
} else {
System.out.println(s+"-->"+false);
}
}

Replace all spaces except the ones with in HTML tags

I need to replace all spaces with html code, i.e. &nbsp, in a string. Currently following, does the replacement but it also replaces the spaces with in html tags like <a href="http://google.com" />.
string.replaceAll(" ", "&nbsp")
But I need it to not change the tags.
Example:
String s1 = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>"
After replacment, it should be like;
String s1 = "Hello!,&nbspCheck&nbspout&nbspthis&nbsp<^a href=\"http://www.entrepreneur.com/article/234538\">10&nbspMovies&nbspEvery&nbspEntrepreneur&nbspNeeds&nbspto&nbspWatch&nbsp<^/a>"
Can anybody suggest a more intelligent regex to accomplish the task?

I know you have already accepted an answer, but your problem has another simple solution that wasn't mentioned. This situation sounds very similar to this question to "regex-match a pattern, excluding..."
With all the disclaimers about using regex to parse html, here is a simple way to do it.
We can solve it with a beautifully-simple regex:
<[^<>]*>|( )
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
This full Java program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>";
Pattern regex = Pattern.compile("<[^<>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, " ");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
How to match a pattern unless...

If we can assume that the only use of > and < in the string is for the tags, then this regex will work:
(?![^<]*>)
It works for your example.
How it works:
matches the space character. This is exactly like what you did.
(?! starts a negative lookahead. This means that this regex will match only if it is not followed by something that matches the regex in the lookahead.
[^<]* matches any character that is not <, multiple times
> matches >
) closes the lookahead.
In other words, this regex matches any space, but with the requirement there must be a < before every > after the space.

How to remove special characters from a string?

I want to remove special characters like:
- + ^ . : ,
from an String using Java.

That depends on what you define as special characters, but try replaceAll(...):
String result = yourString.replaceAll("[-+.^:,]","");
Note that the ^ character must not be the first one in the list, since you'd then either have to escape it or it would mean "any but these characters".
Another note: the - character needs to be the first or last one on the list, otherwise you'd have to escape it or it would define a range ( e.g. :-, would mean "all characters in the range : to ,).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (, {, $ etc.):
String result = yourString.replaceAll("[\\-\\+\\.\\^:,]","");
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S} (keep in mind that in Java strings you'd have to escape back slashes: "\\p{P}\\p{S}").
A third way could be something like this, if you can exactly define what should be left in your string:
String result = yourString.replaceAll("[^\\w\\s]","");
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can't explain them all, so have a look at the reference section of regular-expressions.info.
Here's less restrictive alternative to the "define allowed characters" approach, as suggested by Ray:
String result = yourString.replaceAll("[^\\p{L}\\p{Z}]","");
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can't use [\P{L}\P{Z}] (upper case P means not having that property), since that would mean "everything that is not a letter or not whitespace", which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.

This will replace all the characters except alphanumeric
replaceAll("[^A-Za-z0-9]","");

As described here
http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such as String.matches, String.replaceAll and String.split will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$##!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
String is = one
Number is = 9196390097

Try replaceAll() method of the String class.
BTW here is the method, return type and parameters.
public String replaceAll(String regex,
String replacement)
Example:
String str = "Hello +-^ my + - friends ^ ^^-- ^^^ +!";
str = str.replaceAll("[-+^]*", "");
It should remove all the {'^', '+', '-'} chars that you wanted to remove!

To Remove Special character
String t2 = "!##$%^&*()-';,./?><+abdd";
t2 = t2.replaceAll("\\W+","");
Output will be : abdd.
This works perfectly.

Use the String.replaceAll() method in Java.
replaceAll should be good enough for your problem.

You can remove single char as follows:
String str="+919595354336";
String result = str.replaceAll("\\\\+","");
System.out.println(result);
OUTPUT:
919595354336

If you just want to do a literal replace in java, use Pattern.quote(string) to escape any string to a literal.
myString.replaceAll(Pattern.quote(matchingStr), replacementStr)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

placeholders for regular expressions - java

Related

Trying to understand and reproduce Regex Pattern

How to match a string of tuples in Java?

Java regex to match the start of the word?

Replace all spaces except the ones with in HTML tags

How to remove special characters from a string?

Categories

Resources