Get word between two numbers java regex - java

I want to get the name word between the first id and before the second number.
I want to do this in Java Regex.
e.g. Car Care or Car Electronics & Accessories
# Name Id Child nodes
1 Car Care 15718271 Browse
2 Car Electronics & Accessories 2230642011 Browse
3 Exterior Accessories 15857511 Browse
I tried splitting the line with .split(" ")[1] but then it splits the words with spaces. Only gives one word within a phrase e.g. Car

Try this one:^\d*[a-zA-Z &+-]*(\d*)[a-zA-Z ]*
In the match '(\d*)' you will find the wished number.
If the strings before and after the number have special characters add them to the appropiate [] sections.
Explaination: '^' says start from the beginning, '\d*' take the first digit one or multiple times, [a-zA-Z &+-] take a string with these characters, (\d*) specifies the wished number, [a-zA-Z ] again takes a string after the number. Use a regex editor for trying this out.

You can use
(?m)^\d+\s+(.*?)\s+\d{6,}\s
See the regex demo. Details:
(?m) - a multiline option
^ - start of a line
\d+ - one or more digits
\s+ - one or more whitespaces
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
\s+ - one or more whitespaces
\d{6,} - six or more digits
\s - a whitespace.
See the Java demo:
String s = "# Name Id Child nodes\n1 Car Care 15718271 Browse \n2 Car Electronics & Accessories 2230642011 Browse\n3 Exterior Accessories 15857511 Browse";
Pattern pattern = Pattern.compile("(?m)^\\d+\\s+(.*?)\\s+\\d{6,}\\s");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
Output:
Car Care
Car Electronics & Accessories
Exterior Accessories

Related

Regex doesnt stop after sign

Hi I have regex like this
(.*(?=\sI+)*) (.*)
But it doesn't capture groups correctly as I need.
For this example data :
Vladimir Goth
Langraab II Landgraab
Léa Magdalena III Rouault Something
Anna Maria Teodora
Léa Maria Teodora II
1,2 are only correctly captured.
So what I need is
If there is no I+ is split by first space.
If after I+ there are other words first gorup should contains all to I+. So, group1 for 3rd example should be Léa Magdalena III
If after I+ there aren't any other words like in example 5, group1 should be capture to first space.
#Edit
I+ should be replaced by roman numbers
If you want to support any Roman numbers you can use
^(\S+(?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
If you need to support Roman numbers up to XX (exclusive):
^(\S+(?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)
See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.
Details:
^ - start of string
( - Group 1 start:
\S+ - one or more non-whitespaces
(?: - a non-capturing group:
.* - any zero or more chars other than line break chars as many as possible
\b - a word boundary
(?=[MDCLXVI]) - require at least one Roman digit immediately to the right
M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) - a Roman number pattern
\b - a word boundary
(?= +\S) - a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
)? - end of the non-capturing group, repeat one or zero times (it is optional)
) - end of the first group
+ - one or more spaces
(.*) - Group 2: the rest of the line.
In Java:
String regex = "^(\\S+(?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h+\\S))?)\\h+(.*)";
// Or
String regex = "^(\\S+(?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s+\S))?)\\s+(.*)";

Parsing number information from an ingredient string using regex

I am trying to extract the quantity information from an ingredient string where the unit has already been removed.
175 risotto rice
a little hot vegetable stock (optional)
1 coriander
salt pepper
1 0.5 extra virgin olive oil
1 mild onion
300 split red lentils
1.7 well-flavoured vegetable stock
4 carrots
1 head celery
100 stilton cheese
4 snipped chives
salt pepper
225 dried flageolet beans
These are examples of the strings I am parsing, and the results should look like:
175
1
1 0.5
1
300
1.7
4
1
100
4
225
My current thinking is using [0-9]+[ ]*[0-9]*.?[0-9]* as the regex, however this is picking up the first character after the numerical values, for example 175 risotto rice is returning "175 r"
The problem here is that you are not escaping the .? into a literal \.?. The exact behaviour is still somewhat unclear to me, but using your pattern and escaping the . in it should already provide you with the desired matching behavior.
Note that you can shorten [0-9] into \d:
^\d+\s*\d*\.?\d*
If you wanted to separately access each number group, you'd need capture groups to correctly deal with that
In your regex you match .? which will match an optional character (any character except a newline character) and in your data what will be for example the r in risotto or c in coriander.
You could use an anchor to assert the start of the string and match 1+ digits followed by an optional part that matches a dot and 1+ digits.
After that match you could add the same optional pattern with a leading 1+ spaces or tabs:
^\d+(?:\.\d+)?(?:[ \t]+\d+(?:\.\d+))?
In Java
String regex = "^\\d+(?:\\.\\d+)?(?:[ \\t]+\\d+(?:\\.\\d+))?";
That will match
^ Start of the string
\d+(?:\.\d+)? Match 1+ digits followed with an optional part ? that matches a dot and 1+ digits
(?: Non capturing group
[ \t]+\d+(?:\.\d+) match 1+ times a space or tab, 1+ digits and again followed with an optional part that matches a dot and 1+ digits
)? Close non capturing group and make it optional
Note that if you want to match the second pattern 0+ times instead of making it optional you could use * instead of ?
Regex demo | Java demo

RegEx of underscore delimited string

I have a string with 5 pieces of data delimited by underscores:
AAA_BBB_CCC_DDD_EEE
I want a different regex for each component.
The regex needs to return just the one component.
For example, the first would return just AAA, the second for BBB, etc.
I am able to parse out AAA with the following:
^([^_]*)?
I see that I can do a look-around like this to find:
(?<=[^_]*_).*
BBB_CCC_DDD_EEE
But the following can not find just BBB
(?<=[^_]*_)[^_]*(?=_)
Mixing lookbehind and lookahead
^([^_]+)? // 1st
(?<=_)[^_]+ // 2nd
(?<=_)[^_]+(?=_[^_]+_[^_]+$) // 3rd
(?<=_)[^_]+(?=_[^_]+$) // 4th
[^_]+$ // 5th
Just if the lengths of the strings beetween the "_" are known it can be like this
1st match
^([^_]+)?
2nd match
(?<=_)\K[^_]+
3rd match
(?<=_[A-Za-z]{3}_)\K[^_]+
4th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
5th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
each {3} is expressing the length of the string beetween "_"
If your string is always uses underscores, you might use 1 regex to capture your values in a capturing group by repeating the pattern of what is before (in this case NOT an underscore followed by an underscore) using a quantifier which you can change like {3}.
This way you can specify using the quantifier how many times you want to repeat the pattern before and then capture your match. For your example string AAA_BBB_CCC_DDD_EEE you could use {0}, {1},{2},{3} or {4}
^(?:[^_\n]+_){3}([0-9A-Za-z]+)(?:_[^_\n]+)*$
That would match:
^ Assert position at start of the line
(?:[^_\n]+_){3} In a non capturing group (?:, match NOT and underscore or a new line one or more times [^_\n]+ followed by an underscore and repeat that n times (In this example n is 3 times)
([0-9A-Za-z]+) Capture your characters in a group using for example a character class (or use [^_]+ to match not an underscore but that will also match any white space characters)
(?:_[^_\n]+)* Following after your captured values, repeat in a non capturing group matching an underscore, NOT and underscore or a new line one or more times and repeat that pattern zero or more times to get a full match
$ Assert position at the end of the line

How to regex a string representig a city or its postal code with accent?

I'm trying to write a code of java allowing to show a list of cities depending on the name of the city or its postal code:
I wrote many expressions but they didn't work 100%.
This is my last expression:
([A-Z_]+)(:)([0-9]+)
The expression should match a city name : it could be : Lonéy' ed or its code postal 57000
Does anyone have an idea how to improve my expression?
Thanx.
Since Java7 you can do the following :
Pattern.compile("([\\p{Alpha} '-_]+):(\\d{5})", Pattern.UNICODE_CHARACTER_CLASS)
Keep adding connecting characters (here [ '-_]) to cater for all your needs.
The pattern doesn't make any assumptions about the case of the name of a place as in some non-Latin scripts there are no cases.
EDIT: added 5 digits postal code detection and a SPACE for name detection
I suggest using
"(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b"
It means:
(?U) - a Pattern.UNICODE_CHARACTER_CLASS inline flag that makes \b word bounsary and \d digit character class Unicode aware in the pattern
(\\p{Lu}[\\p{L}\\p{M}\\s'_-]*) - Group 1 capturing:
\\p{Lu} - an uppercase Unicode letter
[\\p{L}\\p{M}\\s'_-]* - 0 or more characters that are either Unicode letters (\\p{L}), diacritics (\\p{M}), whitespace (\\s), ', _ or - (NOTE that the hyphen must be at the end of the character class so that it could be treated as a literal hyphen)
: - a literal colon
(\\d{5}) - (Group 2) five digits
\\b - a word boundary so that we only match 5 digits not followed with a word char (not 5 digits in a 110 digit substring), can be replaced with "(?!\\d)"
See Java demo:
String s = "Lonéy' ed:57000";
Pattern pattern = Pattern.compile("(?U)(\\p{Lu}[\\p{L}\\p{M}\\s'-]*):(\\d{5})\\b");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Java: Match City Regex

I'm using the following regex to match city:
[a-zA-Z]+(?:[ '-][a-zA-Z]+)*
The problem is it not only matches the city but also part of the street name.
How can I make it match only the city (such as Brooklyn and Columbia City)?
UPDATE:
The data is represented in 1 line of text (each address will be fed to regex engine separately):
2778 Ray Ridge Pkwy,
Brooklyn NY 1194-5954
1776 99th St,
Brooklyn NY 11994-1264
1776 99th St,
Columbia City OR 11994-1264
I suggest the following approach: match the words from the beginning of the string till the first occurrence of 2 uppercase letters followed with the ZIP (see the look-ahead (?=\s+[A-Z]{2}\s+\d{5}-\d{4}) below):
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\s+\d+-\d+)
See demo
The regex:
^ - then starts looking from the beginning
[A-Za-z]+ - matches a word
(?:[\s'-]+[A-Za-z]+)* - matches 0 or more words that...
(?=\s+[A-Z]{2}\s+\d+-\d+) - are right before a space + 2 uppercase letters, space, 1 or more digits, hyphen and 1 or more digits.
If the ZIP (or whatever the numbers stand for) is optional, you may just rely on the 2 uppercase letters:
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\b)
Note that \b in \s+[A-Z]{2}\b is a word boundary that will force a non-word (space or punctuation or even end of string) to appear after 2 uppercase letters.
Just do not forget to use double backslash in Java to escape regex special metacharacters.
Here is a Java code demo:
String s = "Brooklyn NY 1194-5954";
Pattern pattern = Pattern.compile("^[A-Za-z]+(?:[\\s'-]+[A-Za-z]+)*(?=\\s+[A-Z]{2}\\b)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
OK.. I think I got it after hrs of tweaking and testing. May be helpful for someone else. This did the trick:
(?<=\n)[a-zA-Z]+(?:[ '-][a-z]+)* ?[A-Z]?[a-z]+
In case all your data is like in your example in the question, the pattern in your data is everything from the comma after the street to minimum 2 uppercase letters which represents the state.
This pattern matches the pattern as described and selects a group which should represent the city:
,\s+([a-zA-Z\s]*)[A-Z]{2,}?\s+

Categories

Resources