Regular expression for multiline java - java

I need to parse and extract values from a sql log similar to the one below.
SQL^^0001^^ABCDEF^^26^^XYZ
SQL^^0002^^ABCDEF^^26^^XYZ
abc
<>()_asc wHERE
SQL^^0003^^ABCDEF^^12^^XYZ
SQL^^0004^^ABCDEF^^28^^XYZ
But the logs are not single lines always. I have a regex that can capture If it is single line. Also the fields are of fixed length except the last element. Last element can vary in length.
(\w{3})\W{2}(\d{4})\W{2}(\w{6})\W{2}(\d{2})\W{2}(.*)
^^ is the delimiter but can be any other value also.
There is no fixed end of line character but I need to capture until next line SQL in this case.
How to parse the log and extract them if its multi line log. I'm trying in Java. Java or scala is preferred.

You may leverage the fact that each record starts with exactly 3 word chars followed with ^^. Thus, the last field you match should match any lines that do not start with that pattern. If the ^^ are just an example, you may just use the whole \w{3}\W{2}\d{4}\W{2}\w{6}\W{2}\d{2}\W{2} pattern as the delimiter instead of ^^.
Use
(?m)^(\w{3})\W{2}(\d{4})\W{2}(\w{6})\W{2}(\d{2})\W{2}(.*(?:\r?\n(?!\w{3}\^\^).*)*)
See the regex demo. If the ^^ is just a placeholder, as mentioned above, replace (?!\w{3}\^\^) with (?!\w{3}\W{2}\d{4}\W{2}\w{6}\W{2}\d{2}\W{2}). Or, perhaps, a shorter one will do, too: (?!\w{3}\W{2}\d{4}\b).
Details
(?m)^ - start of a line ((?m) is a Pattern.MULTILINE embedded flag option that makes ^ match a line start rather than a string start position)
(\w{3}) - Group 1: three word chars
\W{2} - 2 non-word chars
(\d{4}) - Group 2: four digits
\W{2} - 2 non-word chars
(\w{6}) - Group 3: six word chars
\W{2} - 2 non-word chars
(\d{2}) - Group 4: 2 digits
\W{2} - 2 non-word chars
(.*(?:\r?\n(?!\w{3}\^\^).*)*) - Group 5:
.* - any 0+ chars other than line break chars, as many as possible
(?:\r?\n(?!\w{3}\^\^).*)* - zero or more consecutive occurrences of:
\r?\n(?!\w{3}\W{2}) - CRLF or LF line break not followed with 3 word and then 2 non-word chars
.* - the rest of the line

Related

Regex help in android

I have two lines in Array list which contains number
line1 1234 5694 7487
line2 10/02/1992 or 1992
I used different regex to get both the line, but the problem is when I use the regex ([0-9]{4}//s?)([0-9]{4}//s?)([0-9]{4}//n) . It gets the first line cool.
But for checking the line2 I used ([0-9]{2}[/-])?([0-9]{2}[/-])?([0-9]{4}).
this regex instead of returning the last line its returning first 4 numbers of the line1.
As stated in the comments below you are using .matches which returns true if the whole string can be matched.
In your pattern ([0-9]{2}[/-])?([0-9]{2}[/-])?([0-9]{4}) it would also match only 4 digits as the first 2 groups ([0-9]{2}[/-])?([0-9]{2}[/-])? are optional due to the question mark ? leaving the 3rd group ([0-9]{4}) able to match 4 digits.
What you might do instead is to use an alternation to either match a date like format where the first 2 parts including the delimiter are optional. Or match 3 times 4 digits.
.*?(?:(?:[0-9]{2}[/-]){2}[0-9]{4}|[0-9]{4}(?:\h[0-9]{4}){2}).*
Explanation
.*? Match any character except a newline non greedy
(?: Non capturing groupo
(?:[0-9]{2}[/-]){2} Repeat 2 times matching 2 digits and / or -
[0-9]{4} Match 4 digits
| Or
[0-9]{4} Match 4 digits
(?:\\h[0-9]{4}){2} Repeat 2 times matching a horizontal whitespace char and 4 digits
) Close non capturing group
.* Match 0+ times any character except a newline
Regex demo | Java demo
For example
List<String> list = Arrays.asList(
new String[]{
"10/02/1992 or 1992",
"10/02/1992",
"10/1992",
"02/1992",
"1992",
"1234 5694 7487"
}
);
String regex = ".*?(?:(?:[0-9]{2}[/-]){2}[0-9]{4}|[0-9]{4}(?:\\h[0-9]{4}){2}).*";
for (String str: list) {
if (str.matches(regex)){
System.out.println(str);
}
}
Result
10/02/1992 or 1992
10/02/1992
1234 5694 7487
Note that in your first pattern I think you mean \\s instead of //s.
The \\s will also match a newline. If you want to match a single space you could just match that or use \\h to match a horizontal whitespace character.

RegEx of underscore delimited string

I have a string with 5 pieces of data delimited by underscores:
AAA_BBB_CCC_DDD_EEE
I want a different regex for each component.
The regex needs to return just the one component.
For example, the first would return just AAA, the second for BBB, etc.
I am able to parse out AAA with the following:
^([^_]*)?
I see that I can do a look-around like this to find:
(?<=[^_]*_).*
BBB_CCC_DDD_EEE
But the following can not find just BBB
(?<=[^_]*_)[^_]*(?=_)
Mixing lookbehind and lookahead
^([^_]+)? // 1st
(?<=_)[^_]+ // 2nd
(?<=_)[^_]+(?=_[^_]+_[^_]+$) // 3rd
(?<=_)[^_]+(?=_[^_]+$) // 4th
[^_]+$ // 5th
Just if the lengths of the strings beetween the "_" are known it can be like this
1st match
^([^_]+)?
2nd match
(?<=_)\K[^_]+
3rd match
(?<=_[A-Za-z]{3}_)\K[^_]+
4th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
5th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
each {3} is expressing the length of the string beetween "_"
If your string is always uses underscores, you might use 1 regex to capture your values in a capturing group by repeating the pattern of what is before (in this case NOT an underscore followed by an underscore) using a quantifier which you can change like {3}.
This way you can specify using the quantifier how many times you want to repeat the pattern before and then capture your match. For your example string AAA_BBB_CCC_DDD_EEE you could use {0}, {1},{2},{3} or {4}
^(?:[^_\n]+_){3}([0-9A-Za-z]+)(?:_[^_\n]+)*$
That would match:
^ Assert position at start of the line
(?:[^_\n]+_){3} In a non capturing group (?:, match NOT and underscore or a new line one or more times [^_\n]+ followed by an underscore and repeat that n times (In this example n is 3 times)
([0-9A-Za-z]+) Capture your characters in a group using for example a character class (or use [^_]+ to match not an underscore but that will also match any white space characters)
(?:_[^_\n]+)* Following after your captured values, repeat in a non capturing group matching an underscore, NOT and underscore or a new line one or more times and repeat that pattern zero or more times to get a full match
$ Assert position at the end of the line

Regex string validation

Trying to write some regex to validate a string, where null and empty strings are not allowed, but characters + new line should be allowed. The string I'm trying to validate is as follows:
First line \n
Second line \n
This is as far as i got:
^(?!\s*$).+
This fails my validation because of the new line. Any ideas? I should add, i cannot use awk.
Code
The following regex matches the entire line.
See regex in use here
^[^\r\n]*?\S.*$
The following regexes do the same as above except they're used for validation purposes only (they don't match the whole line, instead they simply ensures it's properly formed). The benefit of using these regexes over the one above is the number of steps (performance). In the regex101 links below they show as 28 steps as opposed to 34 for the pattern above.
See regex in use here
^[^\r\n]*?\S
See regex in use here
^.*?\S
Results
Input
First line \n
Second line \n
s
Output
Matches only
First line \n
Second line \n
s
Explanation
^ Assert position at the start of the line
[^\r\n]*? Match any character not present in the set (any character except the carriage return or line-feed characters) any number of times, but as few as possible (making this lazy increases performance - less steps)
\S Match any non-whitespace character
.* Match any character (excludes newline characters) any number of times
$ Assert position at the end of the line
Try this pattern:
([\S ]*(\n)*)*

Regex for only two comma separated values, keeping second value optional

I am creating regex for two comma separated values (example - coordinates), i am using regex like below -
^(\-?\d+(\.\d+)?),\s*(\-?\d+(\.\d+)?)$
The above regex mandates two comma separated values, but i want the second value as optional including comma, so i tried changing the regex like this -
^(\-?\d+(\.\d+)?)(,\s*(\-?\d+(\.\d+)?)$)?
This is working but and keeping the second value optional, but it is also allowing comma without any second value like below -
3456,
What can be added in the regex to not allowing comma if second value is not present ? Thanks.
You misplaced the quantifier with the anchor.
Use
^(-?\d+(\.\d+)?)(,\s*(-?\d+(\.\d+)?))?$
^^
See the regex demo.
You may adjust the number of capturing groups in your pattern and convert the optional group into non-capturing by adding ?:after the opening (. I'd use it like
^(-?\d+(?:\.\d+)?)(?:,\s*(-?\d+(?:\.\d+)?))?$
See another demo.
Also note you do not need to escape a hyphen outside a character class.
When using it in Java, do not forget to use double backslashes to define a literal backslash in the string literal and omit ^ and $ if you use the pattern with .matches() method:
s.matches("-?\\d+(?:\\.\\d+)?(?:,\\s*-?\\d+(?:\\.\\d+)?)?")
Details:
^ - start of string anchor
(-?\d+(\.\d+)?) - Group 1 matching an optional hyphen, 1+ digits, then an optional sequence (Group 2) of a dot followed with one or more digits
(,\s*(-?\d+(\.\d+)?))? - an optional sequence (Group 3) matching one or zero occurrences of:
, - comma
\s* - zero or more whitespaces
(-?\d+(\.\d+)?) - Group 4 matching
-? - an optional hyphen
\d+ - one or more digits
(\.\d+)? - Group 5 matching an optional sequence of a dot followed with 1 or more digits
$ - end of string

Match first occurrence of semicolon in string, only if not preceded by '--'

I'm trying to write a regular expression for Java that matches if there is a semicolon that does not have two (or more) leading '-' characters.
I'm only able to get the opposite working: A semicolon that has at least two leading '-' characters.
([\-]{2,}.*?;.*)
But I need something like
([^([\-]{2,})])*?;.*
I'm somehow not able to express 'not at least two - characters'.
Here are some examples I need to evaluate with the expression:
; -- a : should match
-- a ; : should not match
-- ; : should not match
--; : should not match
-;- : should match
---; : should not match
-- semicolon ; : should not match
bla ; bla : should match
bla : should not match (; is mandatory)
-;--; : should match (the first occuring semicolon must not have two or more consecutive leading '-')
It seems that this regex matches what you want
String regex = "[^-]*(-[^-]+)*-?;.*";
DEMO
Explanation: matches will accept string that:
[^-]* can start with non dash characters
(-[^-]+)*-?; is a bit tricky because before we will match ; we need to make sure that each - do not have another - after it so:
(-[^-]+)* each - have at least one non - character after it
-? or - was placed right before ;
;.* if earlier conditions ware fulfilled we can accept ; and any .* characters after it.
More readable version, but probably little slower
((?!--)[^;])*;.*
Explanation:
To make sure that there is ; in string we can use .*;.* in matches.
But we need to add some conditions to characters before first ;.
So to make sure that matched ; will be first one we can write such regex as
[^;]*;.*
which means:
[^;]* zero or more non semicolon characters
; first semicolon
.* zero or more of any characters (actually . can't match line separators like \n or \r)
So now all we need to do is make sure that character matched by [^;] is not part of --. To do so we can use look-around mechanisms for instance:
(?!--)[^;] before matching [^;] (?!--) checks that next two characters are not --, in other words character matched by [^;] can't be first - in series of two --
[^;](?<!--) checks if after matching [^;] regex engine will not be able to find -- if it will backtrack two positions, in other words [^;] can't be last character in series of --.
How about just splitting the string along -- and if there are two or more sub strings, checking if the last one contains a semicolon?
How about using this regex in Java:
[^;]*;(?<!--[^;]{0,999};).*
Only caveat is that it works with up to 999 character length between -- and ;
Java Regex Demo
I think this is what you're looking for:
^(?:(?!--).)*;.*$
In other words, match from the start of the string (^), zero or more characters (.*) followed by a semicolon. But replacing the dot with (?:(?!--).) causes it to match any character unless it's the beginning of a two-hyphen sequence (--).
If performance is an issue, you can exclude the semicolon as well, so it never has to backtrack:
^(?:(?!--|;).)*;.*$
EDIT: I just noticed your comment that the regex should work with the matches() method, so I padded it out with .*. The anchors aren't really necessary, but they do no harm.
You need a negative lookahead!
This regex will match any string which does not contain your original match pattern:
(?!-{2,}.*?;.*).*?;.*
This Regex matches a string which contains a semicolon, but not one occuring after 2 or more dashes.
Example:

Categories

Resources