Making a Regex More Dynamic

Making a Regex More Dynamic - java

I posted this question a couple weeks ago pertaining to extracting a capture group using regex in Java, Extracting Capture Group Using Regex, and I received a working answer. I also posted this question a couple weeks ago pertaining to character replacement in Java using regex, Replace Character in Matching Regex, and received an even better answer that was more dynamic than the one I got from my first post. I'll quickly illustrate by example. I have a string like this that I want to extract the "ID" from:
String idInfo = "Any text up here\n" +
"Here is the id\n" +
"\n" +
"?a0 12 b5\n" +
"&Edit Properties...\n" +
"And any text down here";
And in this case I want the output to just be:
a0 12 b5
But it turns out the ID could be any number of octets (just has to be 1 or more octets), and I want my regex to be able to basically account for an ID of 1 octet then any number of subsequent octets (from 0 to however many). The person I received an answer from in my Replace Character in Matching Regex post did this for a similar but different use case of mine, but I'm having trouble porting this "more dynamic" regex over to the first use case.
Currently, I have ...
Pattern p = Pattern.compile("(?s)?:Here is the id\n\n\\?([a-z0-9]{2})|(?<!^)\\G:?([a-z0-9]{2})|.*?(?=Here is the id\n\n\\?)|.+");
Matcher m = p.matcher(certSerialNum);
String idNum = m.group(1);
System.out.println(idNum);
But it's throwing an exception. In addition, I would actually like it to use all known adjacent text in the pattern including "Here is the id\n\n\?" and "\n&Edit Properties...". What corrections do I need to get this working?

Seems like you want something like this,
String idInfo = "Any text up here\n" +
"Here is the id\n" +
"\n" +
"?a0 12 b5\n" +
"&Edit Properties...\n" +
"And any text down here";
Pattern regex = Pattern.compile("Here is the id\\n+\\?([a-z0-9]{2}(?:\\s[a-z0-9]{2})*)(?=\\n&Edit Properties)");
Matcher matcher = regex.matcher(idInfo);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
a0 12 b5
DEMO

Related

Masking sensitive logs using regex

I am trying to mask the logs by chaining replace regex in logback.xml file.
%replace(%replace(%msg){'"email":(.*?),','"email":"****"'}){'"phone":(.*?),','"phone":"****"'}))
It's working, but is there any other regex solution instead of regex replace chaining?
Can we use regex something like this?
(%replace(%msg){'"(email|phone)":(:*?)','"***",'}
I tried the above but the format is not proper.
Required output is:
{"email":"****","phone":"****"}

You can use
(%replace(%msg){'"(email|phone)":[^,]*,?','"$1":"****"'})
The "(email|phone)":[^,]*,? regex matches
" - a " char
(email|phone) - Group 1 ($1): email or phone string
": - a ": string
[^,]* - zero or more chars other than a comma
,? - an optional , char.
The replacement is "$1":"****": " + Group 1 value + ":"***".
See the regex demo.

Java replace all occurences of regex with another regex

Let's say I have a string with an xml many occurences of <tagA>:
String example = " (...) some xml here (...)
<tagA>283940</tagA>
(...) some xml here (...)
<tagA>& 9940</tagA>
<tagA>- 99440</tagA>
<tagA>< 99440</tagA>
<tagA>99440</tagA>
(...) more xml here (...) "
The content should contain only digits, but sometimes it has a random character followed by a whitespace and the the digits.
I want to remove the unwanted character and the whitespace. How to do that?
So far I know I should be looking for a regex "<tagA>. [0-9]*<\/tagA>" but I am stuck here.
I want to replace the characters because among those characters there are "&", ">", "<" signs which make the xml invalid (which prevents me from treating this as an XML).

The regex that you're looking for is:
<(\w+)>(\D{0,})(\d+)
On the search Group 1 you'll get the TAG, on the Group 2 you'll get your weird stuff (everything that is not a digit) and in Group 3 there's the number.
There's an "enhanced version" of this regex that might work in more situations: (\w{0,})(<\w+>)(\D{0,})(\d+)(\D{0,})(<\/\w+>)(\w{0,})
This will place in the Group 1 any whitespace that might be before the tag. Group 7 will take care of the trailing whitespaces.
Group 2 and 6 will match the opening tag and closing tag.
Group 3 and 5 will match any weird character that you might have between your value.
Group 4 will contain your value.
With the String::replaceAll, you can filter and sanitize by printing only the group 2, 4 and 6, getting rid of the rest.
//input data
String s = "<tagA>283940</tagA>\n" +
" <tagA>& 9940<</tagA>\n" +
" <tagA>- 99440</tagA>\n" +
" <tagA>< 99440</tagA>\n" +
" <tagA>99440</tagA>"
+ "<13243> asdfasdf </>";
String replaced = s.replaceAll("(\\s{0,})(<\\w+>)(\\D{0,})(\\d+)(\\D{0,})(<\\/\\w+>)(\\s{0,})", "$2$4$6");
System.out.println(replaced);
Output: <tagA>283940</tagA><tagA>9940</tagA><tagA>99440</tagA><tagA>99440</tagA><tagA>99440</tagA><13243> asdfasdf </>

Search a string with spaces similar to full string in JAVA

I have an issue which I am facing I would really appreciate your help.
I am using java and connecting it to postgres DB. I tried writing a query with LIKE and it works, but what I am looking is regex that works similar to LIKE where white spaces are also counted.
For example lets say we have the following entries in our array from the results of the DB as
"ca ts", "cats", "ca ts"
etc. When I type
"c a ts"
in the search filter I should retrieve all the above from that array which has all the results from the database.

You may try with replacing spaces from input and search pattern:
String input = "a b c";
String searchPattern = "ab c";
Pattern pat = Pattern.compile(searchPattern.replace(" ", ""));
System.out.println(pat.matcher(input.replace(" ", "")).matches());

You don't need regex for it. EG here 'c a ts' is "like" 'ca ts':
b=# select replace('c a ts',' ','') like replace('ca ts',' ','') as example;
example
---------
t
(1 row)

How to extract the session id from an RTSP message's content?

I have a string like this:
RTSP/1.0 200 OK
CSeq: 3
Server: Ants Rtsp Server/1.0
Date: 21 Oct 2016 15:55:30 GMT
Session: 980603187; timeout=60
Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800
I want to extract the session number(980603187)
Could someone please provide some help?

Simply use a regular expression with a group, then extract the value of the group as next:
String content ="RTSP/1.0 200 OK\n" +
"CSeq: 3\n" +
"Server: Ants Rtsp Server/1.0\n" +
"Date: 21 Oct 2016 15:55:30 GMT\n" +
"Session: 980603187; timeout=60\n" +
"Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800\n";
Pattern pattern = Pattern.compile("Session: ([a-zA-Z0-9$\\-_.+]+)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output:
980603187
Explanation:
Session: ([a-zA-Z0-9$\\-_.+]+)
Session: matches the characters Session: literally (case sensitive)
([a-zA-Z0-9$\\-_.+]+): Capturing group that matches with several consecutive ALPHA, DIGIT or SAFE characters (at least one) (cf RFC 2326 chapter 3.4 Session Identifiers)

Use Regex! Having String str = .., extract the number needed with the Regex capturing anything between Session: and ;:
Session: (.+);
Feel free to specify only letters \\w+ or digits \\d+. Mind the double escaping in Java. The first matched m.group(1) is your result:
Pattern p = Pattern.compile("Session: (.+);");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
Outputs 980603187. Check out the Regex101 for the explanation.
In come cases the ; timeout is optional and to need to amend the Regex used:
Session: (.+?)[\n;]

Once you have each header you can look up the specification in RFC 2336 which specifies the RTSP protocol.
First of all, you should split your string into lines. The lines end with CR/LF according to the specification. The first line indicates the response, the other should be header fields.
The definition is:
Session = "Session" ":" session-id [ ";" "timeout" "=" delta-seconds ]
where session-id is specified as:
session-id = 1*( ALPHA | DIGIT | safe )
which means you should not confuse it with a number. The definition of safe is
safe = "\$" | "-" | "_" | "." | "+"
and alpha means all upper- and lowercase numbers. This means it is possible to put in a base 64 url encoded binary session-id, by the way.
OK, now it becomes a question of looking for the session ID. You step through all lines (except the first one) and then look for the line that matches:
^Session[ \t]*:[ \t]*([a-zA-Z0-9\$\-_.+]+).*$
this will match only valid session headers / valid session identifiers. Note that the standard is vague about white-space, so I skipped over space and tab characters before and after the colon ':'. The session identifier is then in group 1 of the regular expression.
You can of course easily extend this by including the timeout in the regular expression, once you need it.
Note that you will have to double escape the backslash characters before using the regular expression in Java. It's also possible to use the Posix character classes defined in the Pattern class to make the regular expression more readable.

If you use apache-commons in your dependencies, then you can do it within one line:
StringUtils.substringBetween(string, "Session: ", ";");

Java String Replace Regex

I am doing some string replace in SQL on the fly.
MySQLString = " a.account=b.account ";
MySQLString = " a.accountnum=b.accountnum ";
Now if I do this
MySQLString.replaceAll("account", "account_enc");
the result will be
a.account_enc=b.account_enc
(This is good)
But look at 2nd result
a.account_enc_num=a.account_enc_num
(This is not good it should be a.accountnum_enc=b.accountnum_enc)
Please advise how can I achieve what I want with Java String Replace.
Many Thanks.

From your comment:
Is there anyway to tell in Regex only replace a.account=b.account or a.accountnum=b.accountnum. I do not want accountname to be replace with _enc
If I understand correctly you want to add _enc part only to account or accountnum. To do this you can use
MySQLString = MySQLString.replaceAll("\\baccount(num)?\\b", "$0_enc");
(num)? mean that num is optional so regex will accept account or accountnum
\\b at start mean that there can be no letters, numbers or "_" before account so it wont accept (affect) something like myaccount, or my_account.
\\b at the end will prevent other letters, numbers or "_" after account or accountnum.

It's hard to extrapolate from so few examples, but maybe what you want is:
MySQLString = MySQLString.replaceAll("account\\w*", "$0_enc");
which will append _enc to any sequence of letters, digits, and underscores that starts with account.

try
String s = " a.accountnum=b.accountnum ".replaceAll("(account[^ =]*)", "$1_enc");
it means replace any sequence characters which are not ' ' or '=' which starts the word "account" with the sequence found + "_enc".
$1 is a reference to group 1 in regex; group 1 is the expression in parenthesis (account[^ =]+), i.e. our sequence
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html for details

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Making a Regex More Dynamic - java

Related

Masking sensitive logs using regex

Java replace all occurences of regex with another regex

Search a string with spaces similar to full string in JAVA

How to extract the session id from an RTSP message's content?

Java String Replace Regex

Categories

Resources