Want to extract values from text file using regex

Want to extract values from text file using regex - java

"00.00.00.00" 00.00.00.00 - - [07/Jun/2016:00:00:00 -0700] "Hey /acd?bg=1 HTTP/1.1" 200 2 "-" "00.00.00.00:0000" "Java/1.8.0_66" - - 2000
There are records as above, i want to extract values from all the fields , each field is separated by space , please help
I am using as below:
String p;
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
if (matcher.find()){
System.out.println(matcher.group(1));
}
But I am not getting the correct output. I am new to regex
The desired out put is
00.00.00.00
00.00.00.00
-
-
07/Jun/2016:00:00:01 -0700
Hey /acd?bg=1 HTTP/1.1
200

I've got a pattern that does what you want, but it isn't pretty:
^"((?:\d\d?\d?\.){3}\d\d?\d?)" ((?:\d\d?\d?\.){3}\d\d?\d?) (-) (-) (\[\d\d\/\w+\/\d{4}(?::\d\d){3} -\d{4}\]) "(.*?)" (\d{3})
To break it down a bit (because it's nasty):
^ makes it start at the beginning of the string.
((?:\d\d?\d?\.){3}\d\d?\d?) will match and capture the first IP address, with each element being composed of between 1 and 3 digits. The same pattern is then used to match the second IP address as well.
(-) will capture the hyphens - not sure why you want it, but it's in your desired input.
(\[\d\d\/\w+\/\d{4}(?::\d\d){3} -\d{4}\]) captures the timestamp (the bit in the square brackets).
"(.*?)" will match and capture the text string.
Finally, (\d{3}) will capture the HTTP status code.
Taken together, this pattern will match the stuff you want from the string you provided.

Related

How to remove everything after specific character in string using Java

I have a string that looks like this:
analitics#gmail.com#5
And it represents my userId.
I have to send that userId as parameter to the function and send it in the way that I remove number 5 after second # and append new number.
I started with something like this:
userService.getUser(user.userId.substring(0, userAfterMigration.userId.indexOf("#") + 1) + 3
What is the best way of removing everything that comes after the second # character in string above using Java?

Here is a splitting option:
String input = "analitics#gmail.com#5";
String output = String.join("#", input.split("#")[0], input.split("#")[1]) + "#";
System.out.println(output); // analitics#gmail.com#
Assuming your input would only have two at symbols, you could use a regex replacement here:
String input = "analitics#gmail.com#5";
String output = input.replaceAll("#[^#]*$", "#");
System.out.println(output); // analitics#gmail.com#

You can capture in group 1 what you want to keep, and match what comes after it to be removed.
In the replacement use capture group 1 denoted by $1
^((?:[^#\s]+#){2}).+
^ Start of string
( Capture group 1
(?:[^#\s]+#){2} Repeat 2 times matching 1+ chars other than #, and then match the #
) Close group 1
.+ Match 1 or more characters that you want to remove
Regex demo | Java demo
String s = "analitics#gmail.com#5";
System.out.println(s.replaceAll("^((?:[^#\\s]+#){2}).+", "$1"));
Output
analitics#gmail.com#
If the string can also start with ##1 and you want to keep ## then you might also use:
^((?:[^#]*#){2}).+
Regex demo

The simplest way that would seem to work for you:
str = str.replaceAll("#[^.]*$", "");
See live demo.
This matches (and replaces with blank to delete) # and any non-dot chars to the end.

Java regular expression match two same number

I want to use RE to match the file paths like below:
../90804/90804_0.jpg
../89246/89246_8.jpg
../89247/89247_14.jpg
Currently, I use the code as below to match:
Pattern r = Pattern.compile("^(.*?)[/](\\d+?)[/](\\d+?)[_](\\d+?).jpg$");
Matcher m = r.matcher(file_path);
But I found it will be an unexpected match like for:
../90804/89246_0.jpg
Is impossible in RE to match two same number?

You may use a \2 backreference instead of the second \d+ here:
s.matches("(.*?)/(\\d+)/(\\2)_(\\d+)\\.jpg")
See the regex demo. Note that if you use matches method, you won't need ^ and $ anchors.
Details
(.*?) - Group 1: any 0+ chars other than line break chars as few as possible
/ - a slash
(\\d+) - Group 2: one or more digits
/ - a slash
(\\2) - Group 3: the same value as in Group 2
_ - an underscore
(\\d+) - Group 4: one or more digits
\\.jpg - .jpg.
Java demo:
Pattern r = Pattern.compile("(.*?)/(\\d+)/(\\2)_(\\d+)\\.jpg");
Matcher m = r.matcher(file_path);
if (m.matches()) {
System.out.println("Match found");
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
System.out.println(m.group(4));
}
Output:
Match found
..
90804
90804
0

You can use this regex with a capture group and back-reference of the same:
(\d+)/\1
RegEx Demo
Equivalent Java regex string will be:
final String regex = "(\\d+)/\\1";
Details:
(\d+): Match 1+ digits and capture it in group #1
/: Math literal /
\1: Using back-reference #1, match same number as in group #1

this regEx ^(.*)\/(\d+?)\/(\d+?)_(\d+?)\.jpg$
is matching stings like this:
../90804/90804_0.jpg
../89246/89246_8.jpg
../89247/89247_14.jpg
into 4 parts.
See example Result:

How to extract the session id from an RTSP message's content?

I have a string like this:
RTSP/1.0 200 OK
CSeq: 3
Server: Ants Rtsp Server/1.0
Date: 21 Oct 2016 15:55:30 GMT
Session: 980603187; timeout=60
Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800
I want to extract the session number(980603187)
Could someone please provide some help?

Simply use a regular expression with a group, then extract the value of the group as next:
String content ="RTSP/1.0 200 OK\n" +
"CSeq: 3\n" +
"Server: Ants Rtsp Server/1.0\n" +
"Date: 21 Oct 2016 15:55:30 GMT\n" +
"Session: 980603187; timeout=60\n" +
"Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800\n";
Pattern pattern = Pattern.compile("Session: ([a-zA-Z0-9$\\-_.+]+)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output:
980603187
Explanation:
Session: ([a-zA-Z0-9$\\-_.+]+)
Session: matches the characters Session: literally (case sensitive)
([a-zA-Z0-9$\\-_.+]+): Capturing group that matches with several consecutive ALPHA, DIGIT or SAFE characters (at least one) (cf RFC 2326 chapter 3.4 Session Identifiers)

Use Regex! Having String str = .., extract the number needed with the Regex capturing anything between Session: and ;:
Session: (.+);
Feel free to specify only letters \\w+ or digits \\d+. Mind the double escaping in Java. The first matched m.group(1) is your result:
Pattern p = Pattern.compile("Session: (.+);");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
Outputs 980603187. Check out the Regex101 for the explanation.
In come cases the ; timeout is optional and to need to amend the Regex used:
Session: (.+?)[\n;]

Once you have each header you can look up the specification in RFC 2336 which specifies the RTSP protocol.
First of all, you should split your string into lines. The lines end with CR/LF according to the specification. The first line indicates the response, the other should be header fields.
The definition is:
Session = "Session" ":" session-id [ ";" "timeout" "=" delta-seconds ]
where session-id is specified as:
session-id = 1*( ALPHA | DIGIT | safe )
which means you should not confuse it with a number. The definition of safe is
safe = "\$" | "-" | "_" | "." | "+"
and alpha means all upper- and lowercase numbers. This means it is possible to put in a base 64 url encoded binary session-id, by the way.
OK, now it becomes a question of looking for the session ID. You step through all lines (except the first one) and then look for the line that matches:
^Session[ \t]*:[ \t]*([a-zA-Z0-9\$\-_.+]+).*$
this will match only valid session headers / valid session identifiers. Note that the standard is vague about white-space, so I skipped over space and tab characters before and after the colon ':'. The session identifier is then in group 1 of the regular expression.
You can of course easily extend this by including the timeout in the regular expression, once you need it.
Note that you will have to double escape the backslash characters before using the regular expression in Java. It's also possible to use the Posix character classes defined in the Pattern class to make the regular expression more readable.

If you use apache-commons in your dependencies, then you can do it within one line:
StringUtils.substringBetween(string, "Session: ", ";");

Making a Regex More Dynamic

I posted this question a couple weeks ago pertaining to extracting a capture group using regex in Java, Extracting Capture Group Using Regex, and I received a working answer. I also posted this question a couple weeks ago pertaining to character replacement in Java using regex, Replace Character in Matching Regex, and received an even better answer that was more dynamic than the one I got from my first post. I'll quickly illustrate by example. I have a string like this that I want to extract the "ID" from:
String idInfo = "Any text up here\n" +
"Here is the id\n" +
"\n" +
"?a0 12 b5\n" +
"&Edit Properties...\n" +
"And any text down here";
And in this case I want the output to just be:
a0 12 b5
But it turns out the ID could be any number of octets (just has to be 1 or more octets), and I want my regex to be able to basically account for an ID of 1 octet then any number of subsequent octets (from 0 to however many). The person I received an answer from in my Replace Character in Matching Regex post did this for a similar but different use case of mine, but I'm having trouble porting this "more dynamic" regex over to the first use case.
Currently, I have ...
Pattern p = Pattern.compile("(?s)?:Here is the id\n\n\\?([a-z0-9]{2})|(?<!^)\\G:?([a-z0-9]{2})|.*?(?=Here is the id\n\n\\?)|.+");
Matcher m = p.matcher(certSerialNum);
String idNum = m.group(1);
System.out.println(idNum);
But it's throwing an exception. In addition, I would actually like it to use all known adjacent text in the pattern including "Here is the id\n\n\?" and "\n&Edit Properties...". What corrections do I need to get this working?

Seems like you want something like this,
String idInfo = "Any text up here\n" +
"Here is the id\n" +
"\n" +
"?a0 12 b5\n" +
"&Edit Properties...\n" +
"And any text down here";
Pattern regex = Pattern.compile("Here is the id\\n+\\?([a-z0-9]{2}(?:\\s[a-z0-9]{2})*)(?=\\n&Edit Properties)");
Matcher matcher = regex.matcher(idInfo);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
a0 12 b5
DEMO

Regular Expression for string in java

I am trying to write a regular expression for these find of strings
05 IMA-POLICY-ID PIC X(15). 00020068
05 (AMENT)-GROUPCD PIC X(10).
I want to parse anything between 05 and first tab .
The line might start with tabs or spaces and then digit
Initial number can be anything 05,10,15 .
So In the first line I need to pasrse IMA-POLICY-ID and in second line (AMENT)-GROUPCD
This is the code i have written and its not finding the pattern where am i going wrong ?
Pattern p1 = Pattern.compile("^[0-9]+\\s\\S+\t$");
Matcher m1 = p1.matcher(line);
System.out.println("m1 =="+m1.group());

Pattern p1 = Pattern.compile("\\b(?:05|1[05])\\b[^\\t]*\\t");
will match anything from 05, 10 or 15 until the nearest \t.
Explanation:
\b # Start of number/word
(?:05|1[05]) # Match 05, 10 or 15
\b # End of number/word
[^\t]* # Match any number of characters except tab
\t # Match a tab

^\d+\s+([^\s]+)
this will match your requirement
demo here : http://regex101.com/r/rQ7fT3

Your regex is almost correct. Just remove the \t$ at the end of your regex. and capture the \\S+ as a group.
Pattern p1 = Pattern.compile("^[0-9]+\\s(\\S+)");
Now print it as:
if (m.find( )) {
System.out.println(m.group(1));
}

Your pattern expects the line to end after IMA-POLICY-ID etc, because of the $ at the end.
If there is no white space in the string you want to match (I assume there isn't because of your use of \S+, I'd change the pattern to ^\d+\s+(\S+) which should be sufficient to match any number at the start of a line, followed by whitespace and then the group of non-whitespace characters you want to match (note that a tab is whitespace as well).
If you need to match until the first tab or the end of the input and include other whitespace, replace (\S+) with ([^\t]+).

I can see two things that might prevent your Pattern from working.
Firstly your input Strings contain multiple tab-separated values, therefore the $ "end-of-input" character at the end of your Pattern will fail to match the String
Secondly, you want to find what's in between 05 (etc.) and the 1st tab. Therefore you need to wrap your desired expression between parenthesis (e.g. (\\S+)) and refer it by its group number (in this case, it would be group 1)
Here's an example:
String input = "05 IMA-POLICY-ID\tPIC X(15).\t00020068" +
"\r\n05 (AMENT)-GROUPCD\tPIC X(10).";
// | 0, 1, or 5 twice (refine here if needed)
// | | 1 whitespace
// | | | your queried expression (here I use a
// | | | reluctant dot search
// | | | | tab
// | | | | | anything after, reluctant
Pattern p = Pattern.compile("[015]{2}\\s(.+?)\t.+?");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Found: " + m.group(1));
}
Output
Found: IMA-POLICY-ID
Found: (AMENT)-GROUPCD

This is what i came up with and it worked :
String re = "^\\s+\\d+\\s+([^\\s]+)";
Pattern p1 = Pattern.compile(re, Pattern.MULTILINE);
Matcher m1 = p1.matcher(line);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Want to extract values from text file using regex - java

Related

How to remove everything after specific character in string using Java

Java regular expression match two same number

How to extract the session id from an RTSP message's content?

Making a Regex More Dynamic

Regular Expression for string in java

Categories

Resources