Regex to find tokens - Java Scanner or another alternative

Regex to find tokens - Java Scanner or another alternative - java

Hi I'm trying to write a class that transfers some text into well defined tokens.
The strings are somewhat similar to code like: (brown) "fox" 'c';. What I would like to get is (either a token from Scanner or an array after slitting I think both would work just fine) ( , brown , ) , "fox" , 'c' , ; separately (as they are potential tokens) which include:
quoted text with ' and "
number with or without a decimal point
parenthesis, braces , semicolon , equals, sharp, ||,<=,&&
Currently I'm doing it with a Scanner, I've had some problems with the delimiter not being able to give me () etc. separately so I've used the following delimiter \s+|(?=[;\{\}\(\)]|\b) the thing now I would get " and ' as separate tokens as well ans I'd really like to avoid it, I've tried adding some negative lookaheads for variations of " but no luck.
I've tried to using StreamTokenizer but it does not keep the different quotes..
P.S.
I did search the site and tried to google it but even though there are many Scanner related/Regex related questions, I couldn't find something that will solve my problem.
EDIT 1:
So far I came up with \s+|^|(?=[;{}()])|(?<![.\-/'"])(?=\b)(?![.\-/'"])
I might have been not clear enough but when
I have some thing like:
"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{
I'd like to get:
"foo" ,; ,'bar',) , (,; ,{
gray,fox,=,-56565.4546,;
foo,boo,=,"hello",{
But instead I have:
"foo" ,;'bar',) , (,; ,{
gray,fox,=-56565.4546,;
foo,boo,="hello",{
Note that when there are spaces betwen the = and the rest e.g : gray fox = -56565.4546; leads to:
gray,fox,=,-56565.4546,;
What I'm doing with the above mentioned regex is :
Scanner scanner = new Scanner(line);
scanner.useDelimiter(MY_MENTIONED_REGEX_HERE);
while (scanner.hasNext()) {
System.out.println("Got: `" + scanner.next() +"`");
//Some work here
}

Description
Since you are looking for all alphanumeric text which might include a decimal point, why not just "ignore" the delimiters? The following regex will pull all the alphanumeric with decimal point chunks from your input string. This works because your sample text was:
"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{
Regex: (?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))
Summary
The regex has three paths which are:
(["']?)[-]?[a-z0-9-.]*\1 capture an open quote, followed by a minus sign if it exists, followed by some text or numbers, this continues until it reaches the close quote. This captures any text or numbers with a decimal point. The numbers are not validated so 12.32.1 would match. If your input text also contained numbers prefixed with a plus sign, then change [-] to [+-].
(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$)) lookbehind for a non alphanumeric if the previous character is a symbol, and the this character is a symbol, the next character is also a symbol or end of string, then grab the current symbol. This captures any free floating symbols which are not quotes, or multiple symbols in a row like )(;{.
(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))) if the current character is not an alphanumeric or quote, then lookbehind for an alphanumeric or quote symbol and look ahead for non alphanumeric, non quote or end of line. This captures any symbols after a quote which would not be captured by the previous expressions, like the { after "Hello".
Full Explanation
(?: start a non group capture statement. Inside this group each alternative is separated by an or | character
1st alternative: (["']?)[-]?[a-z0-9-.]*\1
1st Capturing group (["']?)
Char class ["'] 1 to 0 times matches one of the following chars: "'
Char class [-] 1 to 0 times matches one of the following chars: -
Char class [a-z0-9-.] infinite to 0 times matches one of the following chars: a-z0-9-.
\1 Matches text saved in BackRef 1
2nd alternative: (?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))
(?<=[^a-z0-9]) Positive LookBehind
Negated char class [^a-z0-9] matches any char except: a-z0-9
Negated char class [^a-z0-9] matches any char except: a-z0-9
(?=(?:[^a-z0-9]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
Group (?:[^a-z0-9]|$)
1st alternative: [^a-z0-9]
Negated char class [^a-z0-9] matches any char except: a-z0-9
2nd alternative: $End of string
3rd alternative: (?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))
(?<=[a-z0-9"']) Positive LookBehind
Char class [a-z0-9"'] matches one of the following chars: a-z0-9"'
Negated char class [^a-z0-9"'] matches any char except: a-z0-9"'
(?=(?:[^a-z0-9]|['"]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
Group (?:[^a-z0-9]|['"]|$)
1st alternative: [^a-z0-9]
Negated char class [^a-z0-9] matches any char except: a-z0-9
2nd alternative: ['"]
Char class ['"] matches one of the following chars: '"
3rd alternative: $End of string
) end the non group capture statement
Groups
Group 0 gets the entire matched string, whereas group 1 gets the quote delimiter if it exists to ensure it'll match a close quote.
Java Code Example:
Note some of the empty values in the array are from the new line character, and some are introduced from the expression. You can apply the expression and some basic logic to ensure your output array only has non empty values.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "\"foo\";'bar')(;{
gray fox=-56565.4546;
foo boo=\"hello\"{";
Pattern re = Pattern.compile("(?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))",Pattern.CASE_INSENSITIVE);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
$matches Array:
(
[0] => Array
(
[0] => "foo"
[1] =>
[2] => ;
[3] => 'bar'
[4] =>
[5] => )
[6] =>
[7] => (
[8] =>
[9] => ;
[10] =>
[11] => {
[12] =>
[13] =>
[14] =>
[15] => gray
[16] =>
[17] => fox
[18] =>
[19] => =
[20] => -56565.4546
[21] =>
[22] => ;
[23] =>
[24] =>
[25] =>
[26] => foo
[27] =>
[28] => boo
[29] =>
[30] => =
[31] => "hello"
[32] =>
[33] => {
[34] =>
)
[1] => Array
(
[0] => "
[1] =>
[2] =>
[3] => '
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
[17] =>
[18] =>
[19] =>
[20] =>
[21] =>
[22] =>
[23] =>
[24] =>
[25] =>
[26] =>
[27] =>
[28] =>
[29] =>
[30] =>
[31] => "
[32] =>
[33] =>
[34] =>
)
)

The idea is to start from particular cases to general. Try this expression:
Java string:
"([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+"
Raw pattern:
(["'])(?:[^"']+|(?!\1)["'])*\1|\|\||<=|&&|[()\[\]{};=#]|[\w.-]+
The goal here isn't to split with an hypotetic delimiter, but to match entity by entity. Note that the order of alternatives define the priority ( you can't put = before => )
example with your new specifications (need to import Pattern & Matcher):
String s = "(brown) \"fox\" 'c';foo bar || 55.555;\"foo\";'bar')(;{ gray fox=-56565.4546; foo boo=\"hello\"{";
Pattern p = Pattern.compile("([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+");
Matcher m = p.matcher(s) ;
while (m.find()) {
System.out.println("item = `" + m.group() + "`");
}

Your problem is largely that you are trying to do too much with one regular expression, and consequently not able to understand the interactions of the part. As humans we all have this trouble.
What you are doing has a standard treatment in the compiler business, called "lexing". A lexer generator accepts a regular expression for each individual token of interest to you, and builds a complex set of states that will pick out the individual lexemes, if they are distinguishable. Seperate lexical definitons per token makes them easy and un-confusing to write individually. The lexer generator makes it "easy" and efficient to recognize all the members. (If you want to define a lexeme that has specific quotes included, it is easy to do that).
See any of the parser generators widely available; they all all include lexing engines, e.g., JCup, ANTLR, JavaCC, ...

Perhaps using a scanner generator such as JFLex it will be easier to achieve your goal than with a regular expression.
Even if you prefer to write the code by hand, I think it would be better to structure it somewhat more. One simple solution would be to create separate methods which try to "consume" from your text the different types of tokens that you want to recognize. Each such method could tell whether it succeeded or not. This way you have several smaller chunks of code, resposible for the different tokens instead of just one big piece of code which is harder to understand and to write.

Related

Strip list of sensitive query string values, also encoded, with regex

What I am trying to achive
I would like to get replaced all the request parameters values which name is password, secret, token or so. Also should be replaced encoded values like password%3D%22test%22 which is password%3D%22test%22.
Here is my current regex
((authtoken|api_secret|token|password|secret|\bkey\b|private\-?_?key|pswd?)(=|%3D%22))([^&|\"|%22]*)(%22)?
Current implementation
The current implementation partially works a part of the last usecase where the request parameter contains encoded xml. The issue is that the value matches only up to first occurrence of 2. Used substitution is "$1[PROTECTED]$5"
Question
How can be the regex changed to consider whole sequence %22 in the negate set?
Expected result
for=bar => for=bar
password=value => password=[PROTECTED]
?password=value => ?password=[PROTECTED]
?password=value& => ?password=[PROTECTED]&
?password=value&password=value => ?password=[PROTECTED]&password=[PROTECTED]
foo=bar&password=value&foo=bar => foo=bar&password=[PROTECTED]&foo=bar
{"url":"https://www.host.com/p?password=myKey&password=mySecret","b":"a"}} => {"url":"https://www.host.com/p?password=[PROTECTED]&password=[PROTECTED]","b":"a"}}
https://host?api_key={$your_key}&password={$your_secret}&password={$your_secret}&a=b => https://host?api_key={$your_key}&password=[PROTECTED]&password=[PROTECTED]&a=b
https://host?&password=xyz => https://host?&password=[PROTECTED]
https://host:post?password=xyz => https://host:post?password=[PROTECTED]
http://host:post?password=xyz => http://host:post?password=[PROTECTED]
http://host:post?&password=xyz => http://host:post?&password=[PROTECTED]
http://host:post?password=xyz& => http://host:post?password=[PROTECTED]&
http://host:post?a=b&password=xyz => http://host:post?a=b&password=[PROTECTED]
http://host:post?a=b => http://host:post?a=b
http://host:post?password=xyz&a=b#hash => http://host:post?password=[PROTECTED]&a=b#hash
http://host?foo=bar&xml=%3C%3Fxml+id%3D%220abc987%22+password%3D%22secreT12345%22+binds%3D%222%22 => http://host?foo=bar&xml=%3C%3Fxml+id%3D%220abc987%22+password%3D%22[PROTECTED]%22+binds%3D%222%22

[...] is a character class. It matches a single character that is in the set ... (or, with [^...], a single character not in the set ...).
[^&|\"|%22] is equivalent to [^2&"%|]
It's better to handle the two cases separately:
...(=[^&"\s]*|%3D%22[^"\s]*?%22)
To get the replacement working correctly, you can do this:
(...words...)(?:(=)[^&"\s]*|(%3D%22)[^"\s]*?(%22))
and replace by
$1$2$3[PROTECTED]$4

Replace (whitespace added for readability)
(
(authtoken|api_secret|token|password|secret|\bkey\b|private\-?_?key|pswd?)
(=|%3D)
(%22.*?%22 | \".*?\" | [^&]*)
)
with
$2=[PROTECTED]
Where %3D is =, and %22 is ".

Regex: Match any word that is not the one defined by regex

I want to extract the words between the two bracket "blocks" and also the word in first brackets (RUNNING or STOPPED).
Example (extract the bolded part):
[ **RUNNING** ] **My First Application** [Pid: 4194]
[ **RUNNING** ] **Second app (some data)** [Pid: 5248]
[ **STOPPED** ] **Logger App**
So, as you can see, the [Pid: X] part is optional. I can write the regex as follows:
\[\s+(RUNNING|STOPPED)\s+\]\s+([^\[]+).*
and it will work. But this would fail if App name would contain the '[' character. I tried the following, but it won't work:
\[\s+(RUNNING|STOPPED)\s+\]\s+(?!\[Pid)+.*
My idea was to match any words/characters that are not starting with "[Pid", but I guess this would match any words that are not followed by "[Pid".
Is there any way to do exactly that: Match any word that is not "[Pid", i.e. match the part until first appearing of "[Pid" substring?

You may use
\[\s+(RUNNING|STOPPED)\s+\]\s+([^\[]*(?:\[(?!Pid:)[^\[]*)*)
See the regex demo
Details:
\[ - a literal [
\s+ - 1+ whitespaces
(RUNNING|STOPPED) - Group 1 capturing either RUNNING or STOPPED
\s+ - 1+ whitespaces
\] - a literal ]
\s+ - 1 or more whitespaces
([^\[]*(?:\[(?!Pid:)[^\[]*)*) - Group 2 capturing:
[^\[]* - zero or more chars other than [
(?:\[(?!Pid:)[^\[]*)* - zero or more sequences of:
\[(?!Pid:) - a [ not followed with Pid:
[^\[]* - zero or more chars other than [.
Java code:
String rx = "\\[\\s+(RUNNING|STOPPED)\\s+\\]\\s+([^\\[]*(?:\\[(?!Pid:)[^\\[]*)*)";
Pattern p = Pattern.compile(rx);
Matcher m = p.matcher("[ RUNNING ] My First Application");
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}

You can specify end of regex as [Pid or end of line by using this syntax:
\[\s+(RUNNING|STOPPED)\s+\]\s+(.*)(\[Pid|$)
Example.

You could achieve it with:
\[\ (RUNNING|STOPPED)\ \] # RUNNING or STOPPED -> group 1
(.+?) # everything afterwards in the same line lazily
(?:\[Pid:\ (\d+)\]|$) # [Pid:, numbers -> group 2, optional
See it working on regex101.com.

Weird password check matching using regex in Java

I'm trying to check a password with the following constraint:
at least 9 characters
at least 1 upper case
at least 1 lower case
at least 1 special character into the following list:
~ ! # # $ % ^ & * ( ) _ - + = { } [ ] | : ; " ' < > , . ?
no accentuated letter
Here's the code I wrote:
Pattern pattern = Pattern.compile(
"(?!.*[âêôûÄéÆÇàèÊùÌÍÎÏÐîÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ€£])"
+ "(?=.*\\d)"
+ "(?=.*[a-z])"
+ "(?=.*[A-Z])"
+ "(?=.*[`~!##$%^&*()_\\-+={}\\[\\]\\\\|:;\"'<>,.?/])"
+ ".{9,}");
Matcher matcher = pattern.matcher(myNewPassword);
if (matcher.matches()) {
//do what you've got to do when you
}
The issue is that some characters like € or £ doesn't make the password wrong.
I don't understand why this is working that way since I explicitly exclude € and £ from the authorized list.

Rather than trying to disallow those non-ascii characters why not makes your regex accept only ASCII characters like this:
Pattern pattern = Pattern.compile(
"(?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\\p{Print})\\p{ASCII}{9,})");
Also see use of \p{Print} instead of the big character class. I believe that would be suffice for you.
Check Javadoc for more details

This just allows printable Ascii. Note that it allows space character, but you could disallow space by setting \x21 instead.
Edit - I didn't see a number in the requirement, saw it in your regex, but wasn't sure.
# "^(?=.*[A-Z])(?=.*[a-z])(?=.*[`~!##$%^&*()_\\-+={}\\[\\]|:;\"'<>,.?])[\\x20-\\x7E]{9,}$"
^
(?= .* [A-Z] )
(?= .* [a-z] )
(?= .* [`~!##$%^&*()_\-+={}\[\]|:;"'<>,.?] )
[\x20-\x7E]{9,}
$

Regular expression - Negative lookahead

Using the following expression:
(?<!XYZ\d{8})(?>REF[A-Z]*)?(\d{3}+)(\d{6}+)(\d{3}+)
I am getting unexpected matches. Please could you explain why the following matches occur:
Input XYZ12345678123456789123 - Matches on 123456781234 - I was expecting it to only match on 123456789123 because it is the only sequence not preceded by (?<!XYZ\d{8})
Weirdly enough, if i use XYZ12345678REF123456789876 as input, it returns a match on 123456789876 but not REF123456789876. It correctly ignored the XYZ12345678, but it didn't pick up the optional REF characters.
Basically what i want to achieve is to extract a 12 digit identifier from a string that contains two identifiers. The first identifier has the format XYZ\d{8} and the second identifier has the format (?>REF[A-Z]*)?(\d{3}+)(\d{6}+)(\d{3}+)
To avoid a match on the wrong 12 digits in a string such as XYZ12345678123456789123, i want to say - get the twelve digits as long as the digits are not part of an XYZ\d{8} type identifier.
Edit
Here are a couple of examples of what i want to achieve
XYZ12345678123456789123 match on 123456789123
123456789123 match on 123456789123
XYZ12345678REF123456789123 should match on REF123456789123
12345678912 no match because not 12 digits
REF123456789123 match on REF123456789123
REF12345678912 no match because not 12 digits
XYZ12345678123456789123ABC match on 123456789123
XYZ123456789123 No match
XYZ1234567891234 no match

You ware almost there. Change (?<!XYZ\\d{8}) to (?<!XYZ\\d{0,7}). You need to check if your match is not part of previous identifier XYZ\\d{8} which means it cant have
XYZ
XYZ1
XYZ12
...
XYZ1234567
before it.
Demo based on your examples
String[] data ={
"XYZ12345678123456789123", //123456789123
"123456789123", //123456789123
"XYZ12345678REF123456789123 ", //REF123456789123
"12345678912", //no match because not 12 digits
"REF123456789123", //REF123456789123
"REF12345678912", //no match because not 12 digits
"XYZ12345678123456789123ABC", //123456789123
"XYZ123456789123", //no match
"XYZ1234567891234", //no match
};
Pattern p = Pattern.compile("(?<!XYZ\\d{0,7})(?>REF[A-Z]*)?(\\d{3}+)(\\d{6}+)(\\d{3}+)");
for (String s:data){
System.out.printf("%-30s",s);
Matcher m = p.matcher(s);
while (m.find())
System.out.print("match: "+m.group());
System.out.println();
}
output:
XYZ12345678123456789123 match: 123456789123
123456789123 match: 123456789123
XYZ12345678REF123456789123 match: REF123456789123
12345678912
REF123456789123 match: REF123456789123
REF12345678912
XYZ12345678123456789123ABC match: 123456789123
XYZ123456789123
XYZ1234567891234

The engine starts looking at the first character in the string.
If the string is "ABCDEF" and the regex is (?<!C)...
Looking at A, it sees there is no C to the left of it.
The assertion being satisfied, it then matches ABC.
Assertions just test the characters around it at the current position it is at.
They don't force the engine to find C first and match the char's after it.
edit
From your examples you would need somethin like this, that is anchored.
If not anchored, it could be harder.
Also, Java doesn't have branch reset, so you will have to see which group
cluster matched.
# "^(?:(?:XYZ\\d{8})(\\d{3})(\\d{6})(\\d{3})|(?:REF)(\\d{3})(\\d{6})(\\d{3})|(\\d{3})(\\d{6})(\\d{3}))"
^
(?:
(?: XYZ \d{8} )
( \d{3} ) # (1)
( \d{6} ) # (2)
( \d{3} ) # (3)
|
(?: REF )
( \d{3} ) # (4)
( \d{6} ) # (5)
( \d{3} ) # (6)
|
( \d{3} ) # (7)
( \d{6} ) # (8)
( \d{3} ) # (9)
)
alternative,
# "^(?:(?:XYZ\\d{8})|(?:REF))?(\\d{3})(\\d{6})(\\d{3})"
^
(?:
(?: XYZ \d{8} )
| (?: REF )
)?
( \d{3} ) # (1)
( \d{6} ) # (2)
( \d{3} ) # (3)

You can check if it's match is not part of previous identifier XYZ\d{8} which means it cant have
XYZ
XYZ1
XYZ12
...
XYZ1234567
before it.
Also, Java doesn't have branch reset, so you will have to see which group
cluster matched.
I will make the change
(?<!XYZ\\d{8}) to (?<!XYZ\\d{0,7}).
hope this helps.

regular expressions in java

How to validate an expression for a single dot character?
For example if I have an expression "trjb....fsf..ib.bi." then it should return only dots at index 15 and 18. If I use Pattern p=Pattern.compile("(\\.)+"); I get
4 ....
11 ..
15 .
18 .

This seems to do the trick:
String input = "trjb....fsf..ib.bi.";
Pattern pattern = Pattern.compile("[^\\.]\\.([^\\.]|$)");
Matcher matcher = pattern.matcher(" " + input);
while (matcher.find()) {
System.out.println(matcher.start());
}
The extra space in front of the input does two things:
Allows for a . to be detected as the first character of the input string
Offsets the matcher.start() by one to account for the character in front of the matched .
Result is:
15
18

add a blank at the beginning and at the end of the string and then use the pattern
"[^\\.]\\.[^\\.]"

you need to use negative lookarounds .
Something like Pattern.compile("(?<!\\.)\\.(?!\\.)");

Try
Pattern.compile("(?<=[^\\.])\\.(?=[^\\.])")
or even better...
Pattern.compile("(?<![\\.])\\.(?![\\.])")
This uses negative lookaround.
(?<![\\.]) => not preceeded by a .
\\. => a .
(?![\\.]) => not followed by a .

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to find tokens - Java Scanner or another alternative - java

Related

Strip list of sensitive query string values, also encoded, with regex

Regex: Match any word that is not the one defined by regex

Weird password check matching using regex in Java

Regular expression - Negative lookahead

regular expressions in java

Categories

Resources