Graylog Regex Replace Extractor Ungreedy - java

I'm trying to create a Regex Replace Graylog Extractor that can allow me to get an ID passed as path parameters.
The two cases I need to manage are the followings:
/v1/api2/5eb98050122d484001708a11
/v1/api1/5eb98050122d484001708a11/61b3330151e541232146bfb7/
The ID is always a 24 alphanumerical string.
First case is easy:
^.*([A-Za-z0-9]{24}).*$
First group matches the regex (https://regex101.com/r/Idu5Mp/1).
I need to always match the first ID: 5eb98050122d484001708a11
Also, I need it to match with the first group since in the configuration of the extractor I would use the replacement with $1.
Only solution I could find is to make the Regex Ungreedy, this way the first ID encountered will resolve the regex. Sadly I don't think it's possible to add Regex Flags in Graylog Regex Patterns.
Is there an alternative way to make the regex ungreedy?
Edit:
I've also tried the following one without any success. I don't understand why it always gets the second id within the first group.
^.*\/([A-Za-z0-9]{24})(?:\/[A-Za-z0-9]{24})?.*$

You can use
.*/([A-Za-z0-9]{24,})(?:/.*)?$
Replace with $1. See the regex demo.
Details:
.* - any zero or more chars other than line break chars as many as possible
/ - a / char
([A-Za-z0-9]{24,}) - Group 1: 24 or more alphanumeric ASCII chars
(?:/.*)? - an optional sequence of / and any zero or more chars other than line break chars as many as possible
$ - end of string.

Related

Regex match optional string greedy inbetween two random strings

I am looking for a way to match an optional ABC in the following strings.
Both strings should be matched either way, if ABC is there or not:
precedingstringwithundefinedlenghtABCsubsequentstringwithundefinedlength
precedingstringwithundefinedlenghtsubsequentstringwithundefinedlength
I've tried
.*(ABC).*
which doesn't work for an optional ABC but making ABC non greedy doesn't work either as the .* will take all the pride:
.*(ABC)?.*
This is NOT a duplicate to e.g. Regex Match all characters between two strings as I am looking for a certain string inbetween two random string, kind of the other way around.
You can use
.*(ABC).*|.*
This works like this:
.*(ABC).* pattern is searched for first, since it is the leftmost part of an alternation (see "Remember That The Regex Engine Is Eager"), it looks for any zero or more chars other than line break chars as many as possible, then captures ABC into Group 1 and then matches the rest of the line with the right-hand .*
| - or
.* - is searched for if the first alternation part does not match.
Another solution without the need to use alternation:
^(?:.*(ABC))?.*
See this regex demo. Details:
^ - start of string
(?:.*(ABC))? - an optional non-capturing group that matches zero or more chars other than line break chars as many as possible and then captures into Group 1 an ABC char sequence
.* - zero or more chars other than line break chars as many as possible.
I’ve come up with an answer myself:
Using the OR operator seems to work:
(?:(?:.*(ABC))|.*).*
If there’s a better way, feel free to answer and I will accept it.
You could use this regex: .*(ABC){0,1}.*. It means any, optional{min,max}, any. It is easier to read. I can' t say if your solution or mine is faster due to the processing speed.
Options:
{value} = n-times
{min,} = min to infinity
{min,max} = min to max
.+([ABC])?.+ should do the job

Java Regex to replace only part of string (url)

I want to replace only numeric section of a string. Most of the cases it's either full URL or part of URL, but it can be just a normal string as well.
/users/12345 becomes /users/XXXXX
/users/234567/summary becomes /users/XXXXXX/summary
/api/v1/summary/5678 becomes /api/v1/summary/XXXX
http://example.com/api/v1/summary/5678/single becomes http://example.com/api/v1/summary/XXXX/single
Notice that I am not replacing 1 from /api/v1
So far, I have only following which seem to work in most of the cases:
input.replaceAll("/[\\d]+$", "/XXXXX").replaceAll("/[\\d]+/", "/XXXXX/");
But this has 2 problems:
The replacement size doesn't match with the original string length.
The replacement character is hardcoded.
Is there a better way to do this?
In Java you can use:
str = str.replaceAll("(/|(?!^)\\G)\\d(?=\\d*(?:/|$))", "$1X");
RegEx Demo
RegEx Details:
\G asserts position at the end of the previous match or the start of the string for the first match.
(/|(?!^)\\G): Match / or end of the previous match (but not at start) in capture group #1
\\d: Match a digit
(?=\\d*(?:/|$)): Ensure that digits are followed by a / or end.
Replacement: $1X: replace it with capture group #1 followed by X
Not a Java guy here but the idea should be transferrable. Just capture a /, digits and / optionally, count the length of the second group and but it back again.
So
(/)(\d+)(/?)
becomes
$1XYZ$3
See a demo on regex101.com and this answer for a lambda equivalent to e.g. Python or PHP.
First of all you need something like this :
String new_s1 = s3.replaceAll("(\\/)(\\d)+(\\/)?", "$1XXXXX$3");

Java - Multiline regex bad performance with large string

I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.
I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.
My current regex:
(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$
Valid wildcard examples:
*test.com
*.test.com
*test
test.*
test*
*test*
I compile the pattern with:
private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);
I look for matches with:
// Wildcards
while (wildcardPatternMatch.find()) {
String wildcard = wildcardPatternMatch.group();
myProperty.add(new property(wildcard, providerId));
System.out.println(wildcard);
}
Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?
The pattern you need is
(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$
See the regex demo
Main points:
The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.
The rest is OK.
Details
(?im) - case insensitive and multiline modifiers
^ - start of a line
(?=\*|.+\*$) - the string should either start or end with *
(?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
[a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
(?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
$ - end of a line.
I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes
Your regex contains a repeated pattern:
(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+
Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times
Edit: Another good reference

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Using regex to match beginning and end of string [Java]

I have a list of files in a folder:
maze1.in.txt
maze2.in.txt
maze3.in.txt
I've used substring to remove the .txt extensions.
How do I use regex to match the front and the back of the file name?
I need it to match "maze" at the front and ".in" at the back, and the middle must be a digit (can be single or double digit).
I've tried the following
if (name.matches("name\\din")) {
//dosomething
}
It doesn't match anything. What is the correct regex expression to use?
I'm a little confused what you are asking for in particular
^(maze[0-9]*\.in)$
This will match maze(any number).in
^(maze[0-9]*\.in)\.txt$
this will match maze(any number).in.txt -- excludes the .txt NO NEED FOR USING SUB STRING!
Edit live on Debuggex
The think i would be wary about as of right now is the capture groups... I'm not particularly sure what you are doing with this regex. However, I believe explaining capture groups could benefit you.
A capture group for instance is denoted by () this is basically store them in the pattern array and is a way to parse stuff.
example maze1.in.txt
So if you want to capture the entire line minus .txt i would use this ^(maze[0-9]*\.in\.txt)$
However, if I wanted to capture things separately I would do this ^(maze)([0-9]*)(\.in)\.txt$ this will exclude .txt but include maze, the number, and .in IN separate indexes of the pattern array.
Your original solution doesn't work because string "name" is not in your text. It is "maze".
You can try this
name.matches("maze\\d{1,2}\\.in")
d{1,2} is used to match a digit(can be single or double digit).
You need regex anchors that tell the regex to
start at the beginning: ^
and signal the end of the string: $
^maze[\d]{0,2}\.in$
or in Java:
name.matches("^maze[\\d]{0,2}\\.in$");
Also, your regex wasn't matching strings with a dot (.) which would not accept your examples given. You need to add \. to the regex to accept dots because . is a special character.
It is always good to think of what you are trying to do in english, before you create regular expressions.
You want to match a word maze followed by a digit, followed by a literal period . followed by another word.
word `\w` matches a word character
digit `\d` matches a single digit
period `\.` matches a literal period
word `\w` matches a word character
putting it all together into a single string you get (keep in mind the double backslash for the Java escape and the pluses to repeat the previous match one or more times):
"\\w+\\d\\.\\w+"
The above is the generic case for any file name in the format xxx1.yyy, if you wanted to match maze and in specifically, you can just add those in as literal strings.
"maze\\d+\\.in"
example: http://ideone.com/rS7tw1
name.matches("^maze[0-9]+\\.in\\.txt$")

Categories

Resources