Match pattern which is not wrapped by some character - java

I have an input string like this:
one `two three` four five `six` seven
where some parts can be wrapped by grave accent character (`).
I want to match only these parts which are not wrapped by it, it is one, four five and seven in example (skip two three and six).
I tryied to do it using lookaheads ((?<=) and (?=)) but it recognised four five group like two three and six. Is it possible to solve this problem using regex only, or I have to do it programmatically? (I'm using java 1.8)

If you are sure that there are no unclosed backticks, you could do this:
((?:\w| )+)(?=(?:[^`]*`[^`]*`)*[^`]*$)
This will match:
"one "
" four five "
" seven"
But it's a little bit expensive, because the lookahead that checks whether the number of backtics in the remaining part of line is divisible by 2 takes O(n^2) time to scan through the entire string.
Note that this works regardless of where the whitespace is, it really counts the backticks, it does not care about the relative position of the backticks. If you don't need this kind of robustness, #anubhava's answer is certainly more performant.
Demo: regex101.

You may use this regex using a lookahead and lookbehind:
(?<!`)\b\w+(?:\s+\w+)*\b(?!`)
RegEx Demo
Explanation:
- (?<!`): Negative Lookbehind to assert that we don't have ` at previous position
- \b\w+(?:\s+\w+)*\b: Match our text surrounded by word boundaries
- (?!`): Negative Lookahead to assert that we don't have ` at next position

I solve issues like this by specifying to exclude closing characters (in your case whitespace) like so:
`[^\s]+`

Related

Regex match optional string greedy inbetween two random strings

I am looking for a way to match an optional ABC in the following strings.
Both strings should be matched either way, if ABC is there or not:
precedingstringwithundefinedlenghtABCsubsequentstringwithundefinedlength
precedingstringwithundefinedlenghtsubsequentstringwithundefinedlength
I've tried
.*(ABC).*
which doesn't work for an optional ABC but making ABC non greedy doesn't work either as the .* will take all the pride:
.*(ABC)?.*
This is NOT a duplicate to e.g. Regex Match all characters between two strings as I am looking for a certain string inbetween two random string, kind of the other way around.
You can use
.*(ABC).*|.*
This works like this:
.*(ABC).* pattern is searched for first, since it is the leftmost part of an alternation (see "Remember That The Regex Engine Is Eager"), it looks for any zero or more chars other than line break chars as many as possible, then captures ABC into Group 1 and then matches the rest of the line with the right-hand .*
| - or
.* - is searched for if the first alternation part does not match.
Another solution without the need to use alternation:
^(?:.*(ABC))?.*
See this regex demo. Details:
^ - start of string
(?:.*(ABC))? - an optional non-capturing group that matches zero or more chars other than line break chars as many as possible and then captures into Group 1 an ABC char sequence
.* - zero or more chars other than line break chars as many as possible.
I’ve come up with an answer myself:
Using the OR operator seems to work:
(?:(?:.*(ABC))|.*).*
If there’s a better way, feel free to answer and I will accept it.
You could use this regex: .*(ABC){0,1}.*. It means any, optional{min,max}, any. It is easier to read. I can' t say if your solution or mine is faster due to the processing speed.
Options:
{value} = n-times
{min,} = min to infinity
{min,max} = min to max
.+([ABC])?.+ should do the job

Regex + sign followed by numbers

Hi i want to find Strings like "+19" in Java
so a + sign followed by infinite amount of numbers.
How do i do this?
Tried "+[0123456789]"
and "\+[0123456789]"
thank you :)
This is the regex you want to use:
\\+\\d+
Two kinds of plus are being used here. The first is escaped with two backslashes because it is treated as a literal. The second one means match 1 of more times (i.e. match any digit one or more times).
Code:
String input = "+19";
if (input.matches("\\+\\d+")) {
System.out.println("input string matches");
}
Yes, to match a plus you need to escape it with two backslashes in a C string literal that Java uses. A literal plus needs to be either escaped or put into a character class, [+]. If you just use a plus symbol, it becomes a quantifier that matches the previous symbol or group one or more number of times.
Also, note that the \d shorthand digit class can match more than just ASCII digits if Pattern.UNICODE_CHARACTER_CLASS flag is passed to Pattern.compile (or embedded (?U) flag is added at the start of the pattern). It is advised to use unambiguous patterns in case the code might be maintained or enhanced/adjusted by different developers later.
Most people prefer patterns without escaping backslashes if possible since that allows to avoid issues like the one you faced.
Here is a version of the regex that does not require any escaping:
"[+][0-9]+"
Also, the plus quantifier does not match an infinite number of digits, only MAX_UINT number of times.

^ and $ in Java regular expression

I know that ^ and $ means "matches the beginning of the line" and "matches the end of line"
However, when I did some coding today, I didn't notice any difference between including them and excluding them in a regular expression used in Java.
For example, I want to match a positive Integer using
^[1-9]\\d*$
, and when I exclude them in the regular expression like
[1-9]\\d*
, it seems that there is no difference. I have tried to test with a String that "contains" an integer like ###123###, and the second regular expression can still recognize it is not valid like the first one.
So are the two regular expressions above completely equal to the other one? Thanks!
Do you need to search a string like 2343, or [SPACE]2345, or abc234?
The anchored regex will only find the number in the first string. The un-anchored will find them in all strings.
It all depends on what your requirements are. Are you analyzing lines in a text file, where each line contains only digits?, or are you analyzing the text in a prose document or source-code, where digits may be interspersed among a whole bunch of other stuff?
In the former case, the anchors are good. In the latter, they are bad.
More info: http://www.regular-expressions.info/anchors.html
They are different, the first input checks the whole line so from the begin to the end of the line and second doesn't care about the line.
For more check: regex-bounds
Well...no, the regular expressions aren't equivalent. They're also not doing what you think they are.
You intend to match a positive digit - what your regular expression aims to do is to match some character between 1 and 9, then match any number of digit characters after that (which includes zero).
The difference between the two is the anchoring, as you've noted - the first regex will only match values that literally begin with a 1 through 9, then zero or more digits, then expect there to be nothing else in the string.
The correct regex to match any positive number anywhere in the string would look like this:
[1-9]*\\d*
...and the correct regex to match any line that is a positive number would be this:
^[1-9]*\\d*$

Parse content-page using Regex?

I'm writing a Java code using regex to parse a content-page extracted from a PDF document.
In a string the regex must match: a digit (up to three) followed by a space (or many) followed by a word (or many [word: any sequence of characters]). And vise versa: (word(s) space(s) digit(s)), they all must be in the string. Also considering leading spaces and be case insensitive.
The extracted content-page could look something like this:
Directors’ responsibilities 8
Corporate governance 9
Remuneration report 10
the numbering-style is not consistent and number of spaces between digit and string do vary, so it could also look like:
01 Contents
02 Strategy and highlights
04 Chairman’s statement
The regex i'm using matches any number of words followed by any number of spaces and then a number of no more than 3 digits:
(?i)([a-z\\s])*[0-9]{1,3}(?i)
It works but not quite well, can't tell what I'm doing wrong? and I wish there is a way to detect both numbering-style (having the page numbers to the left or right of the string) instead of repeating the regex and flip the order.
Cheers
If you want to match phrases you should include any punctuation you want to match in your regex. AFAIK there is no way in regex to say if a phrase is "before or after", so you should flip one and append it with a |. Something along the lines of:
[a-zA-Z'".,!\s]+\d{1,3}|\d{1,3}[a-zA-Z'".,!\s]+
Also, you don't need two instances of (?i), as the regex will apply the case insensitivity until the end of the string or if it encounters a (?-i).
You can use this pattern with multiline mode, if there is always a number before or after each items:
"^(?:(?<nb1>\\d{1,3}) +)?(?<item>\\S+(?: +\\S+)*?)(?: +(?<nb2>\\d{1,3})|$)"
Then you can use m.group('nb1')+m.group('nb2') to always obtain the number for each whole match.
But if you must check there is at least a number, you must repeat the whole pattern:
"^(?:(?<nb1>\\d{1,3}) +(?<item1>\\S+(?: +\\S+)*)|(?<item2>\\S+(?: +\\S+)*) +(?<nb2>\\d{1,3})$"
Then:
item = m.group('item1')+m.group('item2');
nb = m.group('nb1')+m.group('nb2');
Notice: since the patterns are anchored at the begining and at the end, it is possible that you have to add some optional spaces to do them work: ^\\s* and \\s*$

Need regular expression for pattern this

I need a regular expression for below pattern
It can start with / or number
It can only contain numbers, no text
Numbers can have space in between them.
It can contain /*, at least 1 number and space or numbers and /*
Valid Strings:
3232////33 43/323//
3232////3343/323//
/3232////343/323//
Invalid Strings:
/sas/3232/////dsds/
/ /34343///// /////
///////////
My Problem is, it can have space between numbers like /3232 323/ but not / /.
How to validate it ?
I have tried so far:
(\\d[\\d ]*/+) , (/*\\d[\\d ]*/+) , (/*)(\\d*)(/*)
This regex should work for you:
^/*(?:\\d(?: \\d)*/*)+$
Live Demo: http://www.rubular.com/r/pUOYFwV8SQ
My solution is not so simple but it works
^(((\d[\d ]*\d)|\d)|/)*((\d[\d ]*\d)|\d)(((\d[\d ]*\d)|\d)|/)*$
Just use lookarounds for the last criteria.
^(?=.*?\\d)([\\d/]*(?:/ ?(?!/)|\\d ?))+$
The best would have been to use conditional regex, but I think Java doesn't support them.
Explanation:
Basically, numbers or slashes, followed by one number and a space, or one slash and a space which is not followed by another slash. Repeat that. The space is made optional because I assume there's none at the end of your string.
Try this java regex
/*(\\d[\\d ]*(?<=\\d)/+)+
It meets all your criteria.
Although you didn't specifically state it, I have assumed that a space may not appear as the first or last character for a number (ie spaces must be between numbers)
"(?![A-z])(?=.*[0-9].*)(?!.*/ /.*)[0-9/ ]{2,}(?![A-z])"
this will match what you want but keep in mind it will also match this
/3232///// from /sas/3232/////dsds/
this is because part of the invalid string is correct
if you reading line by line then match the ^ $ and if you are reading an entire block of text then search for \r\n around the regex above to match each new line

Categories

Resources