Parse content-page using Regex? - java

I'm writing a Java code using regex to parse a content-page extracted from a PDF document.
In a string the regex must match: a digit (up to three) followed by a space (or many) followed by a word (or many [word: any sequence of characters]). And vise versa: (word(s) space(s) digit(s)), they all must be in the string. Also considering leading spaces and be case insensitive.
The extracted content-page could look something like this:
Directors’ responsibilities 8
Corporate governance 9
Remuneration report 10
the numbering-style is not consistent and number of spaces between digit and string do vary, so it could also look like:
01 Contents
02 Strategy and highlights
04 Chairman’s statement
The regex i'm using matches any number of words followed by any number of spaces and then a number of no more than 3 digits:
(?i)([a-z\\s])*[0-9]{1,3}(?i)
It works but not quite well, can't tell what I'm doing wrong? and I wish there is a way to detect both numbering-style (having the page numbers to the left or right of the string) instead of repeating the regex and flip the order.
Cheers

If you want to match phrases you should include any punctuation you want to match in your regex. AFAIK there is no way in regex to say if a phrase is "before or after", so you should flip one and append it with a |. Something along the lines of:
[a-zA-Z'".,!\s]+\d{1,3}|\d{1,3}[a-zA-Z'".,!\s]+
Also, you don't need two instances of (?i), as the regex will apply the case insensitivity until the end of the string or if it encounters a (?-i).

You can use this pattern with multiline mode, if there is always a number before or after each items:
"^(?:(?<nb1>\\d{1,3}) +)?(?<item>\\S+(?: +\\S+)*?)(?: +(?<nb2>\\d{1,3})|$)"
Then you can use m.group('nb1')+m.group('nb2') to always obtain the number for each whole match.
But if you must check there is at least a number, you must repeat the whole pattern:
"^(?:(?<nb1>\\d{1,3}) +(?<item1>\\S+(?: +\\S+)*)|(?<item2>\\S+(?: +\\S+)*) +(?<nb2>\\d{1,3})$"
Then:
item = m.group('item1')+m.group('item2');
nb = m.group('nb1')+m.group('nb2');
Notice: since the patterns are anchored at the begining and at the end, it is possible that you have to add some optional spaces to do them work: ^\\s* and \\s*$

Related

Match pattern which is not wrapped by some character

I have an input string like this:
one `two three` four five `six` seven
where some parts can be wrapped by grave accent character (`).
I want to match only these parts which are not wrapped by it, it is one, four five and seven in example (skip two three and six).
I tryied to do it using lookaheads ((?<=) and (?=)) but it recognised four five group like two three and six. Is it possible to solve this problem using regex only, or I have to do it programmatically? (I'm using java 1.8)
If you are sure that there are no unclosed backticks, you could do this:
((?:\w| )+)(?=(?:[^`]*`[^`]*`)*[^`]*$)
This will match:
"one "
" four five "
" seven"
But it's a little bit expensive, because the lookahead that checks whether the number of backtics in the remaining part of line is divisible by 2 takes O(n^2) time to scan through the entire string.
Note that this works regardless of where the whitespace is, it really counts the backticks, it does not care about the relative position of the backticks. If you don't need this kind of robustness, #anubhava's answer is certainly more performant.
Demo: regex101.
You may use this regex using a lookahead and lookbehind:
(?<!`)\b\w+(?:\s+\w+)*\b(?!`)
RegEx Demo
Explanation:
- (?<!`): Negative Lookbehind to assert that we don't have ` at previous position
- \b\w+(?:\s+\w+)*\b: Match our text surrounded by word boundaries
- (?!`): Negative Lookahead to assert that we don't have ` at next position
I solve issues like this by specifying to exclude closing characters (in your case whitespace) like so:
`[^\s]+`

Extract exactly n digits in a sentence using REGEX

Example
The no.s 1234 65
Input: n
For n=4, the output should be 1234
For n=2, the output should be : 65 (not 12)
Tried \d{n} which gives 12 and \d{n,} gives 1234 but i want the exact matching one.
Pattern p = Pattern.compile("//\d{n,}");
you need negative lookaround assertion: (?<!..): negative look behind, and (?!..): negative look ahead : regex101
(?<!\d)\d{4}(?!\d)
however not all regex engine supports them, maybe a work around may match also the preceeding character and following character (contrary to look-around which are 0 width matches), (\D matches all excpet a digit)
(?:^|\D)(\d{4})(?:\D|$)
I think what you meant is the \b character.
Hence, the regex you're looking for would be (for n=2):
\b\d{2}\b
From what I understand, you're looking for a regex that will match a number in a string which has n digits, taking into into account the spacing between the numbers. If that's the case, you're looking for something like this:
\b\d{4}\b
The \b will ensure the match is constrained to the start/end of a 'word' where a word is the boundary between anything matched by \w (which includes digits) and anything matched by the opposite, \W (which includes spaces).
I don't code in java but I can try to answer this using regex in general.
If your number is in the format d1d2d3d4 d5d6 and you want to extract digits d5d6, create 3 groups as r'([0-9]+)("/s")([0-9]+)' – each set of parenthesis () represent one group. Now, extract the third group only in another object which is your required output.

Java Regex to validate String

I have just bought a book on Regex to try and get my head around it but I'm still really struggling with it. I am trying to create a java regex that will satisfy a string configuration that can;
Can contain lowercase letters ([a-z])
Can contain commas (,) but only between words
Can contain colon (:) but must be separated by words or multiply (*)
Can contain hyphens (-) but must be separated by words
Can contain multiply (*) but if used it must be the only character before/between/after the colon
Cannot contain spaces, 'words' are delimitated by a hyphens (-) or commas (,) or colon (:) or the end of the string
So for example the following would be true:
foo:bar
foo-bar:foo
foo,bar:foo
foo-bar,foo:bar,foo-bar
foo:bar:foo,bar
*:foo
foo:*
*:*:*
But the following would be false:
foo :bar
,foo:bar
foo-:bar
-foo:bar
foo,:bar-
foo:bar,
foo,*:bar
foo-*:bar
This is what I have so far:
^[a-z-]|*[:?][a-z-]|*[:?][a-z-]|*
Here is a regex that will work for all your cases:
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+)([,-][a-z]+)*|\*)*
Here is a detailed analysis:
One of the basic structures used to build complicated regular expressions like this is actually pretty simple, and has the form text(separator text)*. A regex of that form will match:
one text
one text, a separator, and another text
one text, a separator, another text, another separator, and yet another text
or more, just add another separator and a text to the end.
So here is a breakdown of the code:
[a-z]+([,-][a-z]+)* is an instance of the pattern I discussed above: the text here is [a-z]+, and the separator is [,-].
([a-z]+([,-][a-z]+)*|\*) allows an asterisk to be matched instead.
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+([,-][a-z]+)*|\*))* is another instance of the pattern I discussed above: the text is ([a-z]+([,-][a-z]+)*|\*), and the separator is :.
If you plan to use this as a component of an even larger regex, in which the group matches will be important, I would recommend making the internal parens non-grouping, and place grouping parens around the entire regex, like so:
((?:[a-z]+(?:[,-][a-z]+)*|\*)(?::([a-z]+)(?:[,-][a-z]+)*|\*)*)
We rarely see here somebody who can define positive and negative test cases. That makes live really easier.
Here's my regex with a 95% solution:
"(([a-z]+|\\*)[:,-])*([a-z]+|\\*)" (JAVA-Version)
(([a-z]+|\*)[:,-])*([a-z]+|\*) (plain regex)
It simply differntiates between words (a-z or *) and separators (one of :-,) and it must contain at least one word and words must be separated by a separator. It works for the positive cases and for the negative cases except the last two negative ones.
One remark: Such a complex "syntax" would in real live be implemented with a grammer definition tool like ANTLR (or a few years ago with lex/yacc, flex/bison). Regex can do that but will not be easy to maintain.

Regular expression to check if string has certain number of digits

I have address information and some junk in my DB and I have to just check if the string has zip code I need to process that. Can you explain how to check if a string has a 5 digits present. For example,
String addr = 10100 Trinity Parkway, 5th Floor Stockton, CA 95219;
So it has to match this string as it is having 5 digit zip code. Any way to check using Java Regular Expression?
Update:
String addr = "10100 Trinity Parkway, 5th Floor Stockton, CA 95219";
String addressMatcher = "\\d{5}";
if(addr.matches(addressMatcher)){
System.out.println(addr);
}
Above is the code I am using after getting the answers but none of the regex matches and prints the addr. Am I doing anything wrong?
Regards,
Karthik
The simple expression is ".*\\d{5}$", which says that you want any character 0 or more times, then any digit 5 times, and then the end of the string. Note that this accounts for needing to escape the slash in the Java string.
If you may have more characters at the end of the string, then you can append .* to the expression to match those. However, that may end up matching number in addresses as well, so make sure your data is in a consistent, expected format.
Regular expressions may not be sufficient in this case, since not all 5-digit numbers are actually zip codes. As well, some zip codes may include additional numbers (usually following a - after the first five digits.)
I am not quite sure if this is what you are looking for:
(\w\s*)+,(\s*(\w\s*),)\s*[A-Z]{2}\s*[1-9]{5}
This expression will match:
1 - any words with any spaces followed by comma,
2 - then optional repetitions of words separated by spaces followed by comma
3 - at the end , it is ended with two characters words (the state code), followed by zip code which is 5 digits (non-zero digits)
I hope this help
[Edit]
This expression will filter out numbers in the address that are not in the correct place to be zip code, like for example ( 10100) in your example

How can I express such requirement using Java regular expression?

I need to check that a file contains some amounts that match a specific format:
between 1 and 15 characters (numbers or ",")
may contains at most one "," separator for decimals
must at least have one number before the separator
this amount is supposed to be in the middle of a string, bounded by alphabetical characters (but we have to exclude the malformed files).
I currently have this:
\d{1,15}(,\d{1,14})?
But it does not match with the requirement as I might catch up to 30 characters here.
Unfortunately, for some reasons that are too long to explain here, I cannot simply pick a substring or use any other java call. The match has to be in a single, java-compatible, regular expression.
^(?=.{1,15}$)\d+(,\d+)?$
^ start of the string
(?=.{1,15}$) positive lookahead to make sure that the total length of string is between 1 and 15
\d+ one or more digit(s)
(,\d+)? optionally followed by a comma and more digits
$ end of the string (not really required as we already checked for it in the lookahead).
You might have to escape backslashes for Java: ^(?=.{1,15}$)\\d+(,\\d+)?$
update: If you're looking for this in the middle of another string, use word boundaries \b instead of string boundaries (^ and $).
\b(?=[\d,]{1,15}\b)\d+(,\d+)?\b
For java:
"\\b(?=[\\d,]{1,15}\\b)\\d+(,\\d+)?\\b"
More readable version:
"\\b(?=[0-9,]{1,15}\\b)[0-9]+(,[0-9]+)?\\b"

Categories

Resources