Java regex question

Java regex question - java

I have a text something like
ab1ab2ab3ab4cd
Can one create a java regular expression to obtain all subtrings that start with "ab" and end with "cd"? e.g:
ab1ab2ab3ab4cd
ab2ab3ab4cd
ab3ab4cd
ab4cd
Thanks

The regex (?=(ab.*cd)) will group such matches in group 1 as you can see:
import java.util.regex.*;
public class Main {
public static void main(String[] args) throws Exception {
Matcher m = Pattern.compile("(?=(ab.*cd))").matcher("ab1ab2ab3ab4cd");
while (m.find()) {
System.out.println(m.group(1));
}
}
}
which produces:
ab1ab2ab3ab4cd
ab2ab3ab4cd
ab3ab4cd
ab4cd
You need the look ahead, (?= ... ), otherwise you'll just get one match. Note that regex will fail to produce the desired results if there are more than 2 cd's in your string. In that case, you'll have to resort to some manual string algorithm.

Looks like you want either ab\w+?cd or \bab\w+?cd\b

/^ab[a-z0-9]+cd$/gm
If only a b c and digits 0-9 can appear in the middle as in the examples:
/^ab[a-c\d]+cd$/gm
See it in action: http://regexr.com?2tpdu

Related

Regular expression to handle two different file extensions

I am trying to create a regular expression that takes a file of name
"abcd_04-04-2020.txt" or "abcd_04-04-2020.txt.gz"
How can I handle the "OR" condition for the extension. This is what I have so far
if(fileName.matches("([\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}.[a-zA-Z]{3})")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}
This handles only the .txt. How can I handle ".txt.gz"
Thanks

Why not just use endsWith instead complex regex
if(fileName.endsWith(".txt") || fileName.endsWith(".txt.gz")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}

You can use the below regex to achieve your purpose:
^[\w-]+\d{2}-\d{2}-\d{4}\.txt(?:\.gz)?$
Explanation of the above regex:]
^,$ - Matches start and end of the test string resp.
[\w-]+ - Matches word character along with hyphen one or more times.
\d{} - Matches digits as many numbers as mentioned in the curly braces.
(?:\.gz)? - Represents non-capturing group matching .gz zero or one time because of ? quantifier. You could have used | alternation( or as you were expecting OR) but this is legible and more efficient too.
You can find the demo of the above regex here.
IMPLEMENTATION IN JAVA:
import java.util.regex.*;
public class Main
{
private static final Pattern pattern = Pattern.compile("^[\\w-]+\\d{2}-\\d{2}-\\d{4}\\.txt(?:\\.gz)?$", Pattern.MULTILINE);
public static void main(String[] args) {
String testString = "abcd_04-04-2020.txt\nabcd_04-04-2020.txt.gz\nsomethibsnfkns_05-06-2020.txt\n.txt.gz";
Matcher matcher = pattern.matcher(testString);
while(matcher.find()){
System.out.println(matcher.group(0));
}
}
}
You can find the implementation of the above regex in java in here.
NOTE: If you want to match for valid dates also; please visit this.

You can replace .[a-zA-Z]{3} with .txt(\.gz)
if(fileName.matches("([\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}).txt(\.gz)?")){
Pattern.compile("[._]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.");
}

? will work for your required | . Try adding
(.[a-zA-Z]{2})?
to your original regex
([\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}.[a-zA-Z]{3}(.[a-zA-Z]{2})?)

A possible way of doing it:
Pattern pattern = Pattern.compile("^[\\w._-]+_\\d{2}-\\d{2}-\\d{4}(\\.txt(\\.gz)?)$");
Then you can run the following test:
String[] fileNames = {
"abcd_04-04-2020.txt",
"abcd_04-04-2020.tar",
"abcd_04-04-2020.txt.gz",
"abcd_04-04-2020.png",
".txt",
".txt.gz",
"04-04-2020.txt"
};
Arrays.stream(fileNames)
.filter(fileName -> pattern.matcher(fileName).find())
.forEach(System.out::println);
// output
// abcd_04-04-2020.txt
// abcd_04-04-2020.txt.gz

I think what you want (following from the direction you were going) is this:
[\\w._-]+[0-9]{2}-[0-9]{2}-[0-9]{4}\\.[a-zA-Z]{3}(?:$|\\.[a-zA-Z]{2}$)
At the end, I have a conditional statement. It has to either match the end of the string ($) OR it has to match a literal dot followed by 2 letters (\\.[a-zA-Z]{2}). Remember to escape the ., because in regex . means "match any character".

Regular Expression for filtering invalid windows characters in Java

I am looking for a regular expression which will allow me to check if the String has invalid (Windows) Characters.
Here is my sample code:-
public class Test {
public static void main(String[] args) {
String folderName = ">aa?|<";
Pattern p = Pattern.compile(".[\\\\/:\"*<>|].*$");
Matcher m = p.matcher(folderName);
if (m.matches()) {
System.out.println("Match");
} else {
System.out.println("Un-match");
}
}
}
The pattern works fine if the special characters are in between the alphabets ( like for ex. "a>a")
Can anyone please suggest the appropriate expression.
I have searched many links but couldn't get a solution.
Thanks in advance!

This is because your initial dot is matching exactly one character. Change it to .* to match it zero or more characters.
So change .[\\\\/:\"*<>|].*$ to .*[\\\\/:\"*<>|].*$.

Find string that does not contain some substring

I have a one liner string that looks like this:
My db objects are db.main_flow_tbl, 'main_flow_audit_tbl',
main_request_seq and MAIN_SUBFLOW_TBL.
I want to use regular expressions to return database tables that start with main but do not contain words audit or seq, and irrespective of the case. So in the above example strings main_flow_tbl and MAIN_SUBFLOW_TBL shall return. Can someone help me with this please?

Here is a fully regex based solution:
public static void main(String[] args) throws Exception {
final String in = "My db objects are db.main_flow_tbl, 'main_flow_audit_tbl', main_request_seq and MAIN_SUBFLOW_TBL.";
final Pattern pat = Pattern.compile("main_(?!\\w*?(?:audit|seq))\\w++", Pattern.CASE_INSENSITIVE);
final Matcher m = pat.matcher(in);
while(m.find()) {
System.out.println(m.group());
}
}
Output:
main_flow_tbl
MAIN_SUBFLOW_TBL
This assumes that table names can only contain A-Za-Z_ which \w is the shorthand for.
Pattern breakdown:
main_ is the liternal "main" that you want tables to start with
(?!\\w*?(?:audit|seq)) is a negative lookahead (not followed by) which takes any number of \w characters (lazily) followed by either "audit" or "seq". This excludes tables names that contain those sequences.
\\w++ consume any table characters possesively.
EDIT
OP's comment they may contain numbers as well
In this case use this pattern:
main_(?![\\d\\w]*?(?:audit|seq))[\\d\\w]++
i.e. use [\\d\\w] rather than \\w

String str
while ((str.startsWith("main"))&&!str.contains("audit")||!str.contains("seq")){
//your code here
}

If the string matches
^main_(\w_)*(?!(?:audit|seq))
it should be what you want...

How write in java regex any string of characters?

I have a text and I'd like to write a regular expression to extract the string after second #. For example:
# some text with letter, digit 123 1234 and symbols {[ #text_to_extract.
How would I write a regular expression to extract only the string after second #. This code seems like a step in the right direction:
Pattern p = Pattern.compile("##(.+?)");
Matcher m = p.matcher("asdasdas##textToExtract");
This works when text between # is empty, but how do I specify any text in a regex?
Pattern.compile("#(*)#(.+?)"); ?
Edited:
One more condition, text can be between # and # but doesn't have to.

Don't capture the first group
Change the plain * to .*.
Make the second wildcard greedy, since it will otherwise capture only a single character
Pattern.compile("#.*#(.+)");

The "non-greedy" operator should be removed. (.*?) should be (.*) ... Otherwise you match just the minimum of the text after the second #. Definitely need a "." in front of the *. It means "0 or more of the proceeding character. Actually, maybe you want [^#]* instead... so it matches anything but the at symbol.. so you're guaranteed to get everything, even if . doesn't match newlines. Anyway, here's working code.
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone {
public static void main(String[] args) throws java.lang.Exception {
// Pattern p = Pattern.compile("#(*)#(.+?)");
Pattern p = Pattern.compile("#.*#(.+)");
Matcher m = p.matcher("asdasdas##textToExtract");
while (m.find()) {
System.out.println(m.group(1));
}
}
}
Play with the code here: http://ideone.com/rxB5Zy

You should do it this way
Matcher m =Pattern.compile("^[^#]*#[^#]*#([^#]*)").matcher(input);

Get substring between two characters

How do you build a regex to return for the characters between < and # of a string?
For example <1001#10.2.2.1> would return 1001.
Would something using <.?> work?

Would something using "<.?>" work?
A slightly modified version of it would work: <.*?# (you need an # at the end, and you need a reluctant quantifier *? in place of an optional mark ?). However it could be inefficient because of backtracking. Something like this would be better:
<([^#]*)#
This expression starts by finding <, taking as many non-# characters as it could, and capturing the # before stopping.
Parentheses denote a capturing group. Use regex API to extract it:
Pattern p = Pattern.compile("<([^#]*)#");
Matcher m = p.matcher("<1001#10.2.2.1>");
if (m.find()) {
System.out.println(m.group(1));
}
This prints 1001 (demo).

What about the next:
(?<=<)[^#]*
e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=<)[^#]*");
public static void main(String[] args) {
String input = "<1001#10.2.2.1>";
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output:
1001

Um.
<([0-9]*?)#
I'm assuming it's numbers only.
if all characters use this..
<(.*?)#
tested here..
Maybe i'm lacking knowledge but my understanding of regex is that you need () to get the capture groups... otherwise if you don't you'll just be selecting characters without actually "capturing" them.
so this..
<.?>
won't do anything .

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex question - java

I have a text something like ab1ab2ab3ab4cd Can one create a java regular expression to obtain all subtrings that start with "ab" and end with "cd"? e.g: ab1ab2ab3ab4cd ab2ab3ab4cd ab3ab4cd ab4cd Thanks

Looks like you want either ab\w+?cd or \bab\w+?cd\b

/^ab[a-z0-9]+cd$/gm If only a b c and digits 0-9 can appear in the middle as in the examples: /^ab[a-c\d]+cd$/gm See it in action: http://regexr.com?2tpdu

Related

Regular expression to handle two different file extensions

Regular Expression for filtering invalid windows characters in Java

Find string that does not contain some substring

How write in java regex any string of characters?

Get substring between two characters

Categories

Resources