Java Regex file extension - java

I have to check if a file name ends with a gzip extension. In particular I'm looking for two extensions: ".tar.gz" and ".gz". I would like to capture the file name (and path) as a group using a single regular expression excluding the gzip extension if any.
I tested the following regular expressions on this example path
String path = "/path/to/file.txt.tar.gz";
Expression 1:
String rgx = "(.+)(?=([\\.tar]?\\.gz)$)";
Expression 2:
String rgx = "^(.+)[\\.tar]?\\.gz$";
Extracting group 1 in this way:
Matcher m = Pattern.compile(rgx).matcher(path);
if(m.find()){
System.out.println(m.group(1));
}
Both regular expressions give me the same result: /path/to/file.txt.tar and not /path/to/file.txt.
Any help will be appreciated.
Thanks in advance

You can use the following idiom to match both your path+file name, an gzip extensions in one go:
String[] inputs = {
"/path/to/foo.txt.tar.gz",
"/path/to/bar.txt.gz",
"/path/to/nope.txt"
};
// ┌ group 1: any character reluctantly quantified
// | ┌ group 2
// | | ┌ optional ".tar"
// | | | ┌ compulsory ".gz"
// | | | | ┌ end of input
Pattern p = Pattern.compile("(.+?)((\\.tar)?\\.gz)$");
for (String s: inputs) {
Matcher m = p.matcher(s);
if (m.find()) {
System.out.printf("Found: %s --> %s %n", m.group(1), m.group(2));
}
}
Output
Found: /path/to/foo.txt --> .tar.gz
Found: /path/to/bar.txt --> .gz

You need to make the part that matches the file name reluctant, i.e. change (.+) to (.+?):
String rgx = "^(.+?)(\\.tar)?\\.gz";
// ^^^
Now you get:
Matcher m = Pattern.compile(rgx).matcher(path);
if(m.find()){
System.out.println(m.group(1)); // /path/to/file.txt
}

Use a capturing group based regex.
^(.+)/(.+)(?:\\.tar)?\\.gz$
And,
Get the path from index 1.
Get the filename from index 2.
DEMO

Related

Splitting the string in key=value groups using Regex (Java)

I am not big expert in regexp, that's why i ask you to suggest an efficient way of splitting this string in the key=value groups.
The input string:
x-x="11111" y-y="John-Doe 23" db {rty='Y453'} code {codeDate='2000-03-01T00:00:00'}
What i need is to get key=value pairs:
key=x-x, value="11111"
key=y-y, value="John-Doe 23"
key=rty, value='Y453'
key=codeDate, value='2000-03-01T00:00:00'
My solution is here but i fear it's not the simplest one.
String str = "x-x=\"11111\" y-y=\"John-Doe 23\" db {rty='Y453'} code {codeDate='2000-03-01T00:00:00'}";
Matcher m = Pattern.compile("(\\w+-*\\w*)=((\"|')(\\w+( |-|:)*)+(\"|'))").matcher(str);
while(m.find()) {
String key = m.group(1);
String value = m.group(2);
System.out.printf("key=%s, value=%s\n", key, value);
}
Thanks in advance for your help.
You can use this regex with 3 capturing groups and a back-reference:
([\w-]+)=((['"]).*?\3)
RegEx Demo
RegEx Breakup:
([\w-]+): Match and capture key name in group #1
=: Match =
(: Start group #2
(['"]): Match and capture a quote in group #3
.*?: Match 0 or more of any character (lazy match)
\3: Back-reference to group #3 to match closing quote of same type
): End of capture group #2
You will get your matches in .group(1) and .group(2).
For select values in between single and double quotes value in group 1
String ResultString = null;
try {
Pattern regex = Pattern.compile("[\"'](.*?[^\\\\])[\"']", Pattern.DOTALL | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}

Regular expression for extracting instance ID, AMI ID, Volume ID

Given the following string
Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305
I want to be able to extract the following using a regular expression
i-b9b4ffaa
ami-dbcf88b1
vol-e97db305
This is the regular expression I came up with, which currently doesn't do what I need :
Pattern p = Pattern.compile("Created by CreateImage([a-z]+[0.9]+)([a-z]+[0.9]+)([a-z]+[0.9]+)",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305");
System.out.println(m.matches()); --> false
You may match all words starting with letters, followed with a hyphen, and then having alphanumeric chars:
String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
Pattern pattern = Pattern.compile("(?i)\\b[a-z]+-[a-z0-9]+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
// => i-b9b4ffaa, ami-dbcf88b1, vol-e97db305
See the Java demo
Pattern details:
(?i) - a case insensitive modifier (embedded flag option)
\\b - a word boundary
[a-z]+ - 1 or more ASCII letters
- - a hyphen
[a-z0-9]+ - 1 or more alphanumerics.
To make sure these values appear on the same line after Created by CreateImage, use a \G-based regex:
String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
Pattern pattern = Pattern.compile("(?i)(?:Created by CreateImage|(?!\\A)\\G)(?:(?!\\b[a-z]+-[a-z0-9]+).)*\\b([a-z]+-[a-z0-9]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
See this demo.
Note that the above pattern is based on the \G operator that matches the end of the last successful match (so we only match after a match or after Created...) and a tempered greedy token (?:(?!\\b[a-z]+-[a-z0-9]+).)* (matching any symbol other than a newline that does not start a sequence: word boundary+letters+-+letters|digits) that is very resource consuming.
You should consider using a two-step approach to first check if a string starts with Created... string, and then process it:
String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
if (s.startsWith("Created by CreateImage")) {
Matcher n = Pattern.compile("(?i)\\b[a-z]+-[a-z0-9]+").matcher(s);
while(n.find()) {
System.out.println(n.group(0));
}
}
See another demo

String#replaceAll() to replace *anything but a =* group

I have a parameter of key-value like this:
sign="aaaabbbb="
And I want to get the parameter name sign and the value "aaaabbb="(with quote signs)
I thought I could split the string with = to get the first elem of the array which is the parameter name and do a String.replaceAll() to remove the sign= to get the value. Anyway here is my sample code:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("\\[^=]+=","");
//EDIT: s.replaceAll("[^=]+=","") will not do the job either.
System.out.println(ss[0]);
System.out.println(value);
}
}
but the output shows this:
sign
sign="aaaabbbb="
Why \\[^=]+= not matching sign= and replace it with empty string here?Quite a newbie of Java regex, need some help.
Thanks in advance.
In Java you can use the following:
String str = "sign=\"aaaabbbb=\"";
String var1 = str.substring(0, str.indexOf('='));
String var2 = str.substring(str.indexOf('=')+1);
System.out.println("var1="+var1+", var2="+var2);
The above would have the following output:
var1=sign, var2="aaaabbbb="
Try the following regex ^\\w+= with replaceAll() instead of your regex:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("^\\w+=","");
System.out.println(ss[0]);
System.out.println(value);
}
}
This will remove the sign=.
You can see the DEMO here.
Note that with your "\\[^=]+=" regex you were trying to match the character [ literally in the beginning of your regex.
And it explains why you got sign="aaaabbbb=" as a result with replaceAll() which didn't replace anything because there's no match.
You're probably better off with an actual Pattern and back-references here.
For instance:
String[] test = {
"sign=\"aaaabbbb=\"",
// assuming a HTTP GET-styled parameter list
"blah?sign=\"aaaabbbb=\"",
"foo?sign=\"aaaabbbb=\"&blah=\"hodor\""
};
// | group 1: literal "sign"
// | | literal key-value delimiter and double quote
// | | | group 2: any character reluctantly quantified
// | | | | literal ending double quote
// | | | | | look-ahead for either "&" or end
// | | | | |
Pattern p = Pattern.compile("(sign)=\"(.+?)\"(?=$|&)");
Matcher m = null;
for (String s: test) {
m = p.matcher(s);
while (m.find()) {
System.out.printf(
"Found key: \"%s\" and value: \"%s\"%n", m.group(1), m.group(2)
);
}
}
Output
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Notes
I'm assuming a HTTP GET styled parameter list, but maybe you don't need to actually check for a next parameter key-value pair delimiter (i.e. &) - in which case you can remove the & part
I'm also assuming you want the "s out of your value back-reference, which kind of makes the following & check useless
Your current pattern for the replaceAll invocation will match as follows:
// | literal "[" (double-escaped)
// ||literal "^" or "=" (in character class)
// || | ... greedily quantified (1+ occurrences)
// || || literal "="
"\\[^=]+="
Finally, if you really, really want to use String#replaceAll for this, here's a slightly different pattern than the one above:
for (String s: test) {
System.out.println(
s.replaceAll(
".*(sign)=\"(.+?)\"(?=$|&).*",
"Found key: \"$1\" and value: \"$2\""
)
);
}
It still uses back-references and will produce the same result, albeit in a uglier way: you can't reuse the $1 and $2 group values, since you're creating a new String replacing the original one.
Last possible solution, using String#'split. This is the ugliest as it won't work well with a list of parameters:
for (String s: test) {
System.out.println(
// | negative look-behind for start of input
// | | literal "="
// | | | literal "
// | | |
Arrays.toString(s.split("(?<!^)=\""))
);
}
Output
[sign, aaaabbbb]
[blah?sign, aaaabbbb] --> yuck
[foo?sign, aaaabbbb, &blah, hodor"] --> yuck again
The double slash is a mistake, because it is escaping the [ to a literal [, which will never match.
Instead, do this:
String name = s.replaceAll("=.*", "");
String value = s.replaceAll(".*?=", "");

Regex pattern to match certain url

I have a large text and I only want to use certain information from it. The text looks like this:
Some random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_1_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_2_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_3_av.m3u8
I only want the http text. There are several of them in the text but I only need one of them. The regular expression should be "starts with http and ends with .m3u8".
I looked at the glossary of all the different expression but it is very confusing to me. I tried "/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{12,30})([\/\w \.-]*)*\/?$/" as my pattern. But is that enough?
All help is appreciated. Thank you.
Assuming your text is line-separated at every line representation in your example, here's a snippet that will work:
String text =
"Some random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
"More random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
// removed some for brevity
"More random text here" +
System.getProperty("line.separator") +
// added counter-example ending with "NOPE"
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.NOPE";
// Multi-line pattern:
// ┌ line starts with http
// | ┌ any 1+ character reluctantly quantified
// | | ┌ dot escape
// | | | ┌ ending text
// | | | | ┌ end of line marker
// | | | | |
Pattern p = Pattern.compile("^http.+?\\.m3u8$", Pattern.MULTILINE);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
Output
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
Edit
For a refined "filter" by the "index_x" file of the URL, you can simply add it in the Pattern between the protocol and ending of the line, e.g.:
Pattern.compile("^http.+?index_0.+?\\.m3u8$", Pattern.MULTILINE);
I didn't test it, but this should do the trick:
^(http:\/\/.*\.m3u8)
It is the answer of #capnibishop, but with a little change.
^(http://).*(/index_1)[^/]*\.m3u8$
Added the missing "$" sign at the end. This ensures it matches
http://something.m3u8
and not
http://something.m3u81
Added the condition to match index_1 at the end of the line, which means it wil match:
http://something/index_1_something_else.m3u8
and not
http://something/index_1/something_else.m3u8

Java Regex for custom function

I'm looking for a Regex pattern that matches the following, but I'm kind of stumped so far. I'm not sure how to grab the results of the two groups I want, marked by id, and attr.
Should match:
account[id].attr
account[anotherid].anotherattr
These should respectively return id, attr,
and anotherid, anotherattr
Any tips?
Here's a complete solution mapping your id -> attributes:
String[] input = {
"account[id].attr",
"account[anotherid].anotherattr"
};
// | literal for "account"
// | | escaped "["
// | | | group 1: any character
// | | | | escaped "]"
// | | | | | escaped "."
// | | | | | | group 2: any character
Pattern p = Pattern.compile("account\\[(.+)\\]\\.(.+)");
Map<String, String> output = new LinkedHashMap<String, String>();
// iterating over input Strings
for (String s: input) {
// matching
Matcher m = p.matcher(s);
// finding only once per input String. Change to a while-loop if multiple instances
// within single input
if (m.find()) {
// back-referencing group 1 and 2 as key -> value
output.put(m.group(1), m.group(2));
}
}
System.out.println(output);
Output
{id=attr, anotherid=anotherattr}
Note
In this implementation, "incomplete" inputs such as "account[anotherid]." will not be put in the Map as they don't match the Pattern at all.
In order to have these cases put as id -> null, you only need to add a ? at the end of the Pattern.
That will make the last group optional.

Categories

Resources