Java Regex with Pattern and Matcher

Java Regex with Pattern and Matcher - java

I am using Pattern and Matcher classes from Java ,
I am reading a Template text and I want to replace :
src="scripts/test.js" with src="scripts/test.js?Id=${Id}"
src="Servlet?Template=scripts/test.js" with src="Servlet?Id=${Id}&Template=scripts/test.js"
I'm using the below code to execute case 2. :
//strTemplateText is the Template's text
Pattern p2 = Pattern.compile("(?i)(src\\s*=\\s*[\"'])(.*?\\?)");
Matcher m2 = p2.matcher(strTemplateText);
strTemplateText = m2.replaceAll("$1$2Id=" + CurrentESSession.getAttributeString("Id", "") + "&");
The above code works correctly for case 2. but how can I create a regex to combine both cases 1. and 2. ?
Thank you

You don't need a regular expression. If you change case 2 to
replace Servlet?Template=scripts/test.js with Servlet?Template=scripts/test.js&Id=${Id}
all you need to do is to check whether the source string does contain a ? if not add ?Id=${Id} else add &Id=${Id}.
After all
if (strTemplateText.contains("?") {
strTemplateText += "&Id=${Id}";
}
else {
strTemplateText += "?Id=${Id}";
}
does the job.
Or even shorter
strTemplate += strTemplateText.contains("?") ? "&Id=${Id}" : "?Id=${Id}";

Your actual question doesn't match up so well with your example code. The example code seems to handle a more general case, and it substitutes an actual session Id value instead of a reference to one. The code below takes the example code to be more indicative of what you really want, but the same approach could be adapted to what you asked in the question text (using a simpler regex, even).
With that said, I don't see any way to do this with a single replaceAll() because the replacement text for the two cases is too different. You could nevertheless do it with one regex, in one pass, if you used a different approach:
Pattern p2 = Pattern.compile("(src\\s*=\\s*)(['\"])([^?]*?)(\\?.*?)?\\2",
Pattern.CASE_INSENSITIVE);
Matcher m2 = p2.matcher(strTemplateText);
StringBuffer revisedText = new StringBuffer();
while (m2.find()) {
// Append the whole match except the closing quote
m2.appendReplacement(revisedText, "$1$2$3$4");
// group 4 is the optional query string; null if none was matched
revisedText.append((m2.group(4) == null) ? '?' : '&');
revisedText.append("Id=");
revisedText.append(CurrentESSession.getAttributeString("Id", ""));
// append a copy of the opening quote
revisedText.append(m2.group(2));
}
m2.appendTail(revisedText);
strTemplateText = revisedText.toString();
That relies on BetaRide's observation that query parameter order is not significant, although the same general approach could accommodate a requirement to make Id the first query parameter, as in the question. It also matches the end of the src attribute in the pattern to the correct closing delimiter, which your pattern does not address (though it needs to do to avoid matching text that spans more than one src attribute).
Do note that nothing in the above prevents a duplicate query parameter 'Id' being added; this is consistent with the regex presented in the question. If you want to avoid that with the above approach then in the loop you need to parse the query string (when there is one) to determine whether an 'Id' parameter is already present.

You can do the following:
//strTemplateText is the Template's text
String strTemplateText = "src=\"scripts/test.js\"";
strTemplateText = "src=\"Servlet?Template=scripts/test.js\"";
java.util.regex.Pattern p2 = java.util.regex.Pattern.compile("(src\\s*=\\s*[\"'])(.*?)((?:[\\w\\s\\d.\\-\\#]+\\/?)+)(?:[?]?)(.*?\\=.*)*(['\"])");
java.util.regex.Matcher m2 = p2.matcher(strTemplateText);
System.out.println(m2.matches());
strTemplateText = m2.replaceAll("$1$2$3?Id=" + CurrentESSession.getAttributeString("Id", "") + (m2.group(4)==null? "":"&") + "$4$5");
System.out.println(strTemplateText);
It works on both cases.
If you are using java > 1.6; then, you could use custom-named group-capturing features for making the regex exp. more human-readable and easier to debug.

Related

RegEx for 4 values within a link (This or this or this etc)

Having a bit of trouble with this.
Say I have a link that could contain these values:
Balearic|Ibiza|Majorca|Menorca
Example1: http://site.com/Menorca
Example2: http://site.com/Ibiza
I just need a RegEx to say if the link contains any of those 4 (case insensitive as well)
Can someone point in the right direction - it is not in a particular language but the software I work in is Java based.
Thanks a lot - and I'll keep trying in the meantime! :)

You can just use:
// assuming url is your URL variable
if (url.matches("(?i)^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)\b.*$")) {
// match succeeded
}
(?i) will make sure case is ignored while doing this comparison.

Your list of values is a valid regex, just add the "i" option to make it case insensitive, per http://rubular.com/r/euscuu7Fwj

Here you can find a regexp that matches what you ask:
^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)$
to use it in Java, you may want to do the following:
String url = "http://site.com/Ibiza";
Pattern p = Pattern.compile("^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(url); // get a matcher object

Usually regex against URL is not a good practice.
Here's a mixed approach. It will look for your values only in the path of your URLs, thus both simplifying the regular expression and validating the URL. Case-insensitive.
try {
URI menorca = new URI("http://site.com/Menorca");
System.out.println(menorca.getPath().substring(1));
URI ibiza = new URI("http://site.com/Ibiza");
System.out.println(ibiza.getPath().substring(1));
Pattern pattern = Pattern.compile("Balearic|Ibiza|Majorca|Menorca", Pattern.CASE_INSENSITIVE);
System.out.println(pattern.matcher(menorca.getPath()).find());
System.out.println(pattern.matcher(ibiza.getPath()).find());
}
catch (URISyntaxException use) {
use.printStackTrace();
}
Output:
Menorca
Ibiza
true
true

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Canonical equivalence in Pattern

I am referring to the test harness listed here http://docs.oracle.com/javase/tutorial/essential/regex/test_harness.html
The only change I made to the class is that the pattern is created as below:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex(Pattern.CANON_EQ set): "),Pattern.CANON_EQ);
As the tutorial at http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html suggests I put in the pattern or regex as a\u030A and string to match as \u00E5 but it ends on a No Match Found. I saw both the strings are a small case 'a' with a ring on top.
Have I not understood the use case correctly?

The behavior you're seeing has nothing to do with the Pattern.CANON_EQ flag.
Input read from the console is not the same as a Java string literal. When the user (presumably you, testing out this flag) types \u00E5 into the console, the resultant string read by console.readLine is equivalent to "\\u00E5", not "å". See for yourself: http://ideone.com/lF7D1
As for Pattern.CANON_EQ, it behaves exactly as described:
Pattern withCE = Pattern.compile("^a\u030A$",Pattern.CANON_EQ);
Pattern withoutCE = Pattern.compile("^a\u030A$");
String input = "\u00E5";
System.out.println("Matches with canon eq: "
+ withCE.matcher(input).matches()); // true
System.out.println("Matches without canon eq: "
+ withoutCE.matcher(input).matches()); // false
http://ideone.com/nEV1V

How to modify this regular expression to be case insensitive while searching for curse words?

At the moment, this profanity filter finds darn and golly but not Darn or Golly or DARN or GOLLY.
List<String> bannedWords = Arrays.asList("darn", "golly", "gosh");
StringBuilder re = new StringBuilder();
for (String bannedWord : bannedWords)
{
if (re.length() > 0)
re.append("|");
String quotedWord = Pattern.quote(bannedWord);
re.append(quotedWord);
}
inputString = inputString.replaceAll(re.toString(), "[No cursing please!]");
How can it be modified to be case insensitive?

Start the expression with (?i).
I.e., change re.toString() to "(?i)" + re.toString().
From the documentation of Pattern
(?idmsux-idmsux) Nothing, but turns match flags i d m s u x on - off
where i is the CASE_INSENSITIVE flag.

You need to set the CASE_INSENSITIVE flag, or simply add (?i) to the beginning of your regex.
StringBuilder re = new StringBuilder("(?i)");
You'll also need to change your conditional to
if (re.length() > 4)
Setting the flag via #ratchetFreak's answer is probably best, however. It allows for your condition to stay the same (which is more intuitive) and gives you a clear idea of what's going on in the code.
For more info, see this question and in particular this answer which gives some decent explanation into using regex's in java.

use a precompiled java.util.regex.Pattern
Pattern p = Pattern.compile(re.toString(),Pattern.CASE_INSENSITIVE);//do this only once
inputString = p.matcher(inputString).replaceAll("[No cursing please!]");

Getting value of $1 from matcher.replaceAll()

In my application I need get the link and break it if it is bigger than 10(example) chars.
The problem is, if I send the whole text, for example: "this is my website www.stackoverflow.com" directly to this matcher
Pattern patt = Pattern.compile("(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\'\".,<>???“”‘’]))");
Matcher matcher = patt.matcher(text);
matcher.replaceAll("$1");
it would show the whole website, without breaking it.
What I was trying to do, is to get the value of $1, so i could break the second one, keeping the first one correctly.
I've got another method to break the string up.
UPDATE
What I want to get is only the website so I could break it after all. It would help me a lot.

You can't use replaceAll; you should iterate through the matches and process each one individually. Java's Matcher already has an API for this:
// expanding on the example in the 'appendReplacement' JavaDoc:
Pattern p = Pattern.compile("..."); // your URL regexp
Matcher m = p.matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String truncatedURL = m.group(1).replaceFirst("^(.{10}).*","$1..."); // i iz smrt
m.appendReplacement(sb,
"<a href=\"http://$1\" target=\"_blank\">"); // simple replacement for $1
sb.append(truncatedURL);
sb.append("</a>");
}
m.appendTail(sb);
System.out.println(sb.toString());
(For performance, you should factor out compiled Patterns for the replace* calls inside the loop.)
Edit: use sb.append() so not to worry about escaping $ and \ in 'truncatedURL'.

I think that you have a similar problem to the one mentioned on this question
Java : replacing text URL with clickable HTML link
they suggested something like this
String basicUrlRegex = "(.*://[^<>[:space:]]+[[:alnum:]/])";
myString.replaceAll(basicUrlRegex, "$1");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.