How write in java regex any string of characters? - java

I have a text and I'd like to write a regular expression to extract the string after second #. For example:
# some text with letter, digit 123 1234 and symbols {[ #text_to_extract.
How would I write a regular expression to extract only the string after second #. This code seems like a step in the right direction:
Pattern p = Pattern.compile("##(.+?)");
Matcher m = p.matcher("asdasdas##textToExtract");
This works when text between # is empty, but how do I specify any text in a regex?
Pattern.compile("#(*)#(.+?)"); ?
Edited:
One more condition, text can be between # and # but doesn't have to.

Don't capture the first group
Change the plain * to .*.
Make the second wildcard greedy, since it will otherwise capture only a single character
Pattern.compile("#.*#(.+)");

The "non-greedy" operator should be removed. (.*?) should be (.*) ... Otherwise you match just the minimum of the text after the second #. Definitely need a "." in front of the *. It means "0 or more of the proceeding character. Actually, maybe you want [^#]* instead... so it matches anything but the at symbol.. so you're guaranteed to get everything, even if . doesn't match newlines. Anyway, here's working code.
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone {
public static void main(String[] args) throws java.lang.Exception {
// Pattern p = Pattern.compile("#(*)#(.+?)");
Pattern p = Pattern.compile("#.*#(.+)");
Matcher m = p.matcher("asdasdas##textToExtract");
while (m.find()) {
System.out.println(m.group(1));
}
}
}
Play with the code here: http://ideone.com/rxB5Zy

You should do it this way
Matcher m =Pattern.compile("^[^#]*#[^#]*#([^#]*)").matcher(input);

Related

Regular expression to handle two different file extensions

I am trying to create a regular expression that takes a file of name
"abcd_04-04-2020.txt" or "abcd_04-04-2020.txt.gz"
How can I handle the "OR" condition for the extension. This is what I have so far
if(fileName.matches("([\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}.[a-zA-Z]{3})")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}
This handles only the .txt. How can I handle ".txt.gz"
Thanks
Why not just use endsWith instead complex regex
if(fileName.endsWith(".txt") || fileName.endsWith(".txt.gz")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}
You can use the below regex to achieve your purpose:
^[\w-]+\d{2}-\d{2}-\d{4}\.txt(?:\.gz)?$
Explanation of the above regex:]
^,$ - Matches start and end of the test string resp.
[\w-]+ - Matches word character along with hyphen one or more times.
\d{} - Matches digits as many numbers as mentioned in the curly braces.
(?:\.gz)? - Represents non-capturing group matching .gz zero or one time because of ? quantifier. You could have used | alternation( or as you were expecting OR) but this is legible and more efficient too.
You can find the demo of the above regex here.
IMPLEMENTATION IN JAVA:
import java.util.regex.*;
public class Main
{
private static final Pattern pattern = Pattern.compile("^[\\w-]+\\d{2}-\\d{2}-\\d{4}\\.txt(?:\\.gz)?$", Pattern.MULTILINE);
public static void main(String[] args) {
String testString = "abcd_04-04-2020.txt\nabcd_04-04-2020.txt.gz\nsomethibsnfkns_05-06-2020.txt\n.txt.gz";
Matcher matcher = pattern.matcher(testString);
while(matcher.find()){
System.out.println(matcher.group(0));
}
}
}
You can find the implementation of the above regex in java in here.
NOTE: If you want to match for valid dates also; please visit this.
You can replace .[a-zA-Z]{3} with .txt(\.gz)
if(fileName.matches("([\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}).txt(\.gz)?")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}
? will work for your required | . Try adding
(.[a-zA-Z]{2})?
to your original regex
([\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}.[a-zA-Z]{3}(.[a-zA-Z]{2})?)
A possible way of doing it:
Pattern pattern = Pattern.compile("^[\\w._-]+_\\d{2}-\\d{2}-\\d{4}(\\.txt(\\.gz)?)$");
Then you can run the following test:
String[] fileNames = {
"abcd_04-04-2020.txt",
"abcd_04-04-2020.tar",
"abcd_04-04-2020.txt.gz",
"abcd_04-04-2020.png",
".txt",
".txt.gz",
"04-04-2020.txt"
};
Arrays.stream(fileNames)
.filter(fileName -> pattern.matcher(fileName).find())
.forEach(System.out::println);
// output
// abcd_04-04-2020.txt
// abcd_04-04-2020.txt.gz
I think what you want (following from the direction you were going) is this:
[\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.[a-zA-Z]{3}(?:$|\\.[a-zA-Z]{2}$)
At the end, I have a conditional statement. It has to either match the end of the string ($) OR it has to match a literal dot followed by 2 letters (\\.[a-zA-Z]{2}). Remember to escape the ., because in regex . means "match any character".

Regex matching word that is in the middle of any character except a letter

I'd like to know how to detect word that is between any characters except a letter from alphabet. I need this, because I'm working on a custom import organizer for Java. This is what I have already tried:
The regex expression:
[^(a-zA-Z)]InitializationEvent[^(a-zA-Z)]
I'm searching for the word "InitializationEvent".
The code snippet I've been testing on:
public void load(InitializationEvent event) {
It looks like adding space before the word helps... is the parenthesis inside of alphabet range?
I tested this in my program and it didn't work. Also I checked it on regexr.com, showing same results - class name not recognized.
Am I doing something wrong? I'm new to regex, so it might be a really basic mistake, or not. Let me know!
Lose the parentheses:
[^a-zA-Z]InitializationEvent[^a-zA-Z]
Inside [], parentheses are taken literally, and by inverting the group (^) you prevent it from matching because a ( is preceding InitializationEvent in your string.
Note, however, that the above regex will only match if InitializationEvent is neither at the beginning nor at the end of the tested string. To allow that, you can use:
(^|[^a-zA-Z])InitializationEvent([^a-zA-Z]|$)
Or, without creating any matching groups (which is supposed to be cleaner, and perform better):
(?:^|[^a-zA-Z])InitializationEvent(?:[^a-zA-Z]|$)
how to detect word that is between any characters except a letter from alphabet
This is the case where lookarounds come handy. You can use:
(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
(?<![a-zA-Z]) is negative lookbehind to assert that there is no alphabet at previous position
(?![a-zA-Z]) is negative lookahead to assert that there is no alphabet at next position
RegEx Demo
The parentheses are causing the problem, just skip them:
"[^a-zA-Z]InitializationEvent[^a-zA-Z]"
or use the predefined non-word character class which is slightly different because it also excludes numbers and the underscore:
"\\WInitializationEvent\\W"
But as it seems you want to match a class name, this might be ok because the remaining character are exactly those that are allowed in a class name.
I'm not sure about your application but from a regexp perspective you can use negative lookaheads and negative lookbehinds to define what cannot surround the String to specify a match.
I have added the negative lookahead (?![a-zA-Z]) and the negative lookbehind (?<![a-zA-Z]) in place of your [^(a-zA-Z)] originally supplied to create: (?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
Quick Fiddle I created:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String pattern = "(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])";
String sourceString = "public void load(InitializationEvent event) {";
String sourceString2 = "public void load(BInitializationEventA event) {";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(sourceString);
if (m.find( )) {
System.out.println("Found value of pattern in sourceString: " + m.group(0) );
} else {
System.out.println("NO MATCH in sourceString");
}
Matcher m2 = r.matcher(sourceString2);
if (m2.find( )) {
System.out.println("Found value of pattern in sourceString2: " + m2.group(0) );
} else {
System.out.println("NO MATCH in sourceString2");
}
}
}
output:
sh-4.3$ java -Xmx128M -Xms16M HelloWorld
Found value of pattern in sourceString: InitializationEvent
NO MATCH in sourceString2
You seem really close:
[^(a-zA-Z)]*(InitializationEvent)[^(a-zA-Z)]*
I think this is what you are looking for. The asterisk provides a match for zero or many of the character or group before it.
EDIT/UPDATE
My apologies on the initial response.
[^a-zA-Z]+(InitializationEvent)[^a-zA-Z]+
My regex is a little rusty, but this will match on any non-alphabet character one or many times prior to the InitializationEvent and after.

Replace all spaces except the ones with in HTML tags

I need to replace all spaces with html code, i.e. &nbsp, in a string. Currently following, does the replacement but it also replaces the spaces with in html tags like <a href="http://google.com" />.
string.replaceAll(" ", "&nbsp")
But I need it to not change the tags.
Example:
String s1 = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>"
After replacment, it should be like;
String s1 = "Hello!,&nbspCheck&nbspout&nbspthis&nbsp<^a href=\"http://www.entrepreneur.com/article/234538\">10&nbspMovies&nbspEvery&nbspEntrepreneur&nbspNeeds&nbspto&nbspWatch&nbsp<^/a>"
Can anybody suggest a more intelligent regex to accomplish the task?
I know you have already accepted an answer, but your problem has another simple solution that wasn't mentioned. This situation sounds very similar to this question to "regex-match a pattern, excluding..."
With all the disclaimers about using regex to parse html, here is a simple way to do it.
We can solve it with a beautifully-simple regex:
<[^<>]*>|( )
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
This full Java program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>";
Pattern regex = Pattern.compile("<[^<>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, " ");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
How to match a pattern unless...
If we can assume that the only use of > and < in the string is for the tags, then this regex will work:
(?![^<]*>)
It works for your example.
How it works:
matches the space character. This is exactly like what you did.
(?! starts a negative lookahead. This means that this regex will match only if it is not followed by something that matches the regex in the lookahead.
[^<]* matches any character that is not <, multiple times
> matches >
) closes the lookahead.
In other words, this regex matches any space, but with the requirement there must be a < before every > after the space.

Replace a comma that is not in parentheses using regex

I have this string:
john(man,24,engineer),smith(man,23),lucy(female)
How do I replace a comma which not in the parentheses with #?
The result should be:
john(man,24,engineer)#smith(man,23)#lucy(female)
My code:
String str = "john(man,24,engineer),smith(man,23),lucy(female)";
Pattern p = Pattern.compile(".*?(?:\\(.*?\\)).+?");
Matcher m = p.matcher(str);
System.out.println(m.matches()+" "+m.find());
Why is m.matches() true and m.find() false? How can I achieve this?
Use a negative lookahead to achieve this:
,(?![^()]*\))
Explanation:
, # Match a literal ','
(?! # Start of negative lookahead
[^()]* # Match any character except '(' & ')', zero or more times
\) # Followed by a literal ')'
) # End of lookahead
Regex101 Demo
A simple regex for another approach in case we encounter unbalanced parentheses as insmiley:) or escape\)
While the lookahead approach works (and I too am a fan), it breaks down with input such as ,smiley:)(man,23), so I'll give you an alternative simple regex just in case. For the record, it's hard to find an simple approach that works all of the time because of potential nesting.
This situation is very similar to this question about "regex-matching a pattern unless...".
We can solve it with a beautifully-simple regex:
\([^()]*\)|(,)
Of course we can avoid more unpleasantness by allowing the parentheses matched on the left to roll over escaped parentheses:
\((?:\\[()]|[^()])*\)|(,)
The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
This program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "john(man,24,engineer),smith(man,23),smiley:)(notaperson) ";
Pattern regex = Pattern.compile("\\([^()]*\\)|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "#");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
For more information about the technique
How to match (or replace) a pattern except in situations s1, s2, s3...

regex for letters or numbers in brackets

I am using Java to process text using regular expressions. I am using the following regular expression
^[\([0-9a-zA-Z]+\)\s]+
to match one or more letters or numbers in parentheses one or more times. For instance, I like to match
(aaa) (bb) (11) (AA) (iv)
or
(111) (aaaa) (i) (V)
I tested this regular expression on http://java-regex-tester.appspot.com/ and it is working. But when I use it in my code, the code does not compile. Here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^[\([0-9a-zA-Z]+\)\s]+");
String[] words = pattern.split("(a) (1) (c) (xii) (A) (12) (ii)");
String w = pattern.
for(String s:words){
System.out.println(s);
}
}
}
I tried to use \ instead of \ but the regex gave different results than what I expected (it matches only one group like (aaa) not multiple groups like (aaa) (111) (ii).
Two questions:
How can I fix this regex and be able to match multiple groups?
How can I get the individual matches separately (like (aaa) alone and then (111) and so on). I tried pattern.split but did not work for me.
Firstly, you want to escape any backslashes in the quotation marks with another backslash. The Regex will treat it as a single backslash. (E.g. call a word character \w in quotation marks, etc.)
Secondly, you got to finish the line that reads:
String w = pattern.
That line explains why it doesn't compile.
Here is my final solution to match the individual groups of letters/numbers in brackets that appear at the beginning of a line and ignore the rest
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
static ArrayList<String> listOfEnums;
public static void main(String[] args) {
listOfEnums = new ArrayList<String>();
Pattern pattern = Pattern.compile("^\\([0-9a-zA-Z^]+\\)");
String p = "(a) (1) (c) (xii) (A) (12) (ii) and the good news (1)";
Matcher matcher = pattern.matcher(p);
boolean isMatch = matcher.find();
int index = 0;
//once you find a match, remove it and store it in the arrayList.
while (isMatch) {
String s = matcher.group();
System.out.println(s);
//Store it in an array
listOfEnums.add(s);
//Remove it from the beginning of the string.
p = p.substring(listOfEnums.get(index).length(), p.length()).trim();
matcher = pattern.matcher(p);
isMatch = matcher.find();
index++;
}
}
}
1) Your regex is incorrect. You want to match individual groups of letters / numbers in brackets, and the current regex will match only a single string of one or more such groups. I.e. it will match
(abc) (def) (123)
as a single group rather than three separate groups.
A better regex that would match only up to the closing bracket would be
\([0-9a-zA-Z^\)]+\)
2) Java requires you to escape all backslashes with another backslash
3) The split() method will not do what you want. It will find all matches in your string then throw them away and return an array of what is left over. You want to use matcher() instead
Pattern pattern = Pattern.compile("\\([0-9a-zA-Z^\\)]+\\)");
Matcher matcher = pattern.matcher("(a) (1) (c) (xii) (A) (12) (ii)");
while (matcher.find()) {
System.out.println(matcher.group());
}

Categories

Resources