Regex pattern to match certain url - java

I have a large text and I only want to use certain information from it. The text looks like this:
Some random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_1_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_2_av.m3u8
More random text here
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_3_av.m3u8
I only want the http text. There are several of them in the text but I only need one of them. The regular expression should be "starts with http and ends with .m3u8".
I looked at the glossary of all the different expression but it is very confusing to me. I tried "/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{12,30})([\/\w \.-]*)*\/?$/" as my pattern. But is that enough?
All help is appreciated. Thank you.

Assuming your text is line-separated at every line representation in your example, here's a snippet that will work:
String text =
"Some random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
"More random text here" +
System.getProperty("line.separator") +
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8" +
System.getProperty("line.separator") +
// removed some for brevity
"More random text here" +
System.getProperty("line.separator") +
// added counter-example ending with "NOPE"
"http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.NOPE";
// Multi-line pattern:
// ┌ line starts with http
// | ┌ any 1+ character reluctantly quantified
// | | ┌ dot escape
// | | | ┌ ending text
// | | | | ┌ end of line marker
// | | | | |
Pattern p = Pattern.compile("^http.+?\\.m3u8$", Pattern.MULTILINE);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
Output
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
http://xxx-f.xxx.net/i/xx/open/xxxx/1370235-005A/EPISOD-1370235-005A-xxx_,892,144,252,360,540,1584,xxxx,.mp4.csmil/index_0_av.m3u8
Edit
For a refined "filter" by the "index_x" file of the URL, you can simply add it in the Pattern between the protocol and ending of the line, e.g.:
Pattern.compile("^http.+?index_0.+?\\.m3u8$", Pattern.MULTILINE);

I didn't test it, but this should do the trick:
^(http:\/\/.*\.m3u8)

It is the answer of #capnibishop, but with a little change.
^(http://).*(/index_1)[^/]*\.m3u8$
Added the missing "$" sign at the end. This ensures it matches
http://something.m3u8
and not
http://something.m3u81
Added the condition to match index_1 at the end of the line, which means it wil match:
http://something/index_1_something_else.m3u8
and not
http://something/index_1/something_else.m3u8

Related

String#replaceAll() to replace *anything but a =* group

I have a parameter of key-value like this:
sign="aaaabbbb="
And I want to get the parameter name sign and the value "aaaabbb="(with quote signs)
I thought I could split the string with = to get the first elem of the array which is the parameter name and do a String.replaceAll() to remove the sign= to get the value. Anyway here is my sample code:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("\\[^=]+=","");
//EDIT: s.replaceAll("[^=]+=","") will not do the job either.
System.out.println(ss[0]);
System.out.println(value);
}
}
but the output shows this:
sign
sign="aaaabbbb="
Why \\[^=]+= not matching sign= and replace it with empty string here?Quite a newbie of Java regex, need some help.
Thanks in advance.
In Java you can use the following:
String str = "sign=\"aaaabbbb=\"";
String var1 = str.substring(0, str.indexOf('='));
String var2 = str.substring(str.indexOf('=')+1);
System.out.println("var1="+var1+", var2="+var2);
The above would have the following output:
var1=sign, var2="aaaabbbb="
Try the following regex ^\\w+= with replaceAll() instead of your regex:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("^\\w+=","");
System.out.println(ss[0]);
System.out.println(value);
}
}
This will remove the sign=.
You can see the DEMO here.
Note that with your "\\[^=]+=" regex you were trying to match the character [ literally in the beginning of your regex.
And it explains why you got sign="aaaabbbb=" as a result with replaceAll() which didn't replace anything because there's no match.
You're probably better off with an actual Pattern and back-references here.
For instance:
String[] test = {
"sign=\"aaaabbbb=\"",
// assuming a HTTP GET-styled parameter list
"blah?sign=\"aaaabbbb=\"",
"foo?sign=\"aaaabbbb=\"&blah=\"hodor\""
};
// | group 1: literal "sign"
// | | literal key-value delimiter and double quote
// | | | group 2: any character reluctantly quantified
// | | | | literal ending double quote
// | | | | | look-ahead for either "&" or end
// | | | | |
Pattern p = Pattern.compile("(sign)=\"(.+?)\"(?=$|&)");
Matcher m = null;
for (String s: test) {
m = p.matcher(s);
while (m.find()) {
System.out.printf(
"Found key: \"%s\" and value: \"%s\"%n", m.group(1), m.group(2)
);
}
}
Output
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Notes
I'm assuming a HTTP GET styled parameter list, but maybe you don't need to actually check for a next parameter key-value pair delimiter (i.e. &) - in which case you can remove the & part
I'm also assuming you want the "s out of your value back-reference, which kind of makes the following & check useless
Your current pattern for the replaceAll invocation will match as follows:
// | literal "[" (double-escaped)
// ||literal "^" or "=" (in character class)
// || | ... greedily quantified (1+ occurrences)
// || || literal "="
"\\[^=]+="
Finally, if you really, really want to use String#replaceAll for this, here's a slightly different pattern than the one above:
for (String s: test) {
System.out.println(
s.replaceAll(
".*(sign)=\"(.+?)\"(?=$|&).*",
"Found key: \"$1\" and value: \"$2\""
)
);
}
It still uses back-references and will produce the same result, albeit in a uglier way: you can't reuse the $1 and $2 group values, since you're creating a new String replacing the original one.
Last possible solution, using String#'split. This is the ugliest as it won't work well with a list of parameters:
for (String s: test) {
System.out.println(
// | negative look-behind for start of input
// | | literal "="
// | | | literal "
// | | |
Arrays.toString(s.split("(?<!^)=\""))
);
}
Output
[sign, aaaabbbb]
[blah?sign, aaaabbbb] --> yuck
[foo?sign, aaaabbbb, &blah, hodor"] --> yuck again
The double slash is a mistake, because it is escaping the [ to a literal [, which will never match.
Instead, do this:
String name = s.replaceAll("=.*", "");
String value = s.replaceAll(".*?=", "");

Java Regex file extension

I have to check if a file name ends with a gzip extension. In particular I'm looking for two extensions: ".tar.gz" and ".gz". I would like to capture the file name (and path) as a group using a single regular expression excluding the gzip extension if any.
I tested the following regular expressions on this example path
String path = "/path/to/file.txt.tar.gz";
Expression 1:
String rgx = "(.+)(?=([\\.tar]?\\.gz)$)";
Expression 2:
String rgx = "^(.+)[\\.tar]?\\.gz$";
Extracting group 1 in this way:
Matcher m = Pattern.compile(rgx).matcher(path);
if(m.find()){
System.out.println(m.group(1));
}
Both regular expressions give me the same result: /path/to/file.txt.tar and not /path/to/file.txt.
Any help will be appreciated.
Thanks in advance
You can use the following idiom to match both your path+file name, an gzip extensions in one go:
String[] inputs = {
"/path/to/foo.txt.tar.gz",
"/path/to/bar.txt.gz",
"/path/to/nope.txt"
};
// ┌ group 1: any character reluctantly quantified
// | ┌ group 2
// | | ┌ optional ".tar"
// | | | ┌ compulsory ".gz"
// | | | | ┌ end of input
Pattern p = Pattern.compile("(.+?)((\\.tar)?\\.gz)$");
for (String s: inputs) {
Matcher m = p.matcher(s);
if (m.find()) {
System.out.printf("Found: %s --> %s %n", m.group(1), m.group(2));
}
}
Output
Found: /path/to/foo.txt --> .tar.gz
Found: /path/to/bar.txt --> .gz
You need to make the part that matches the file name reluctant, i.e. change (.+) to (.+?):
String rgx = "^(.+?)(\\.tar)?\\.gz";
// ^^^
Now you get:
Matcher m = Pattern.compile(rgx).matcher(path);
if(m.find()){
System.out.println(m.group(1)); // /path/to/file.txt
}
Use a capturing group based regex.
^(.+)/(.+)(?:\\.tar)?\\.gz$
And,
Get the path from index 1.
Get the filename from index 2.
DEMO

java indexOf returns -1 when it's supposed to return a positive number

I'm new to Network programming and I never used Java for network programming before.
I'm writing a server using Java and I have some problem processing message from client. I used
DataInputStream inputFromClient = new DataInputStream( socket.getInputStream() );
while ( true ) {
// Receive radius from the client
byte[] r=new byte[256000];
inputFromClient.read(r);
String Ffss =new String(r);
System.out.println( "Received from client: " + Ffss );
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( '\a' ));
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( ' '));
String Str = new String("add 12341\n13243423");
String SubStr1 = new String("\n");
System.out.print("Found Index :" );
System.out.println( Str.indexOf( SubStr1 ));
}
If I do this, and have a sample input asg 23\aag, it will return:
Found Index :-1
Found Index :3
Found Index :9
It's clear that if the the String object is created from scratch, indexOf can locate "\".
How come the code would have problem locating \a if the String is obtained from processing DataInputStream?
try String abc=new String("\\a"); - you need \\ to get a backslash in a string otherwise the \ defines the start of an "escape sequence".
It looks like the a is being escaped.
Have a look at this article to understand how the back slash affects a string.
Escape Sequences
A character preceded by a backslash (\) is an escape
sequence and has special meaning to the compiler. The following table
shows the Java escape sequences:
| Escape Sequence | Description|
|:----------------|------------:|
| \t | Insert a tab in the text at this point.|
| \b | Insert a backspace in the text at this point.|
| \n | Insert a newline in the text at this point.|
| \r | Insert a carriage return in the text at this point.|
| \f | Insert a formfeed in the text at this point.|
| \' | Insert a single quote character in the text at this point.|
| \" | Insert a double quote character in the text at this point.|
| \\ | Insert a backslash character in the text at this point.|

How to replace multiple words with space in a string using Java

I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again

Java regExp get sub-sting before last quote

I have a sting:
String text = "\"Alaska \"adaa\" asdas\" at [2013-10-298 13:36.062];";
I need to get substing
//"Alaska "adaa" asdas"
String text = "\"Alaska \"adaa\" asdas\"";
How to?
Why not just use lastIndexOf?
text = text.substring(0, text.lastIndexOf("\"") + 1);
One way would be replacing everything after the last quote with an empty string:
test = test.replaceAll("(?<=\")[^\"]*$", "");
// ^^^^^^^ ^^^ ^
// | | |
// Preceded by a quote ----+ | |
// Does not contain a quote -----+ |
// Goes all the way to the end ------+
Try this:
text.replace("\"[^\"]*$", "\"")

Categories

Resources